I imagine that many interaction researchers will have been curious about how a voice-activated internet-connected device might be integrated (or not) into conversations at home. Martin Porcheron along with Stuart Reeves, Joel Fischer and Sarah Sharples (all at the University of Nottingham) went the next step, and did the research. Here Martin and Stuart explain how the research was done…
Voice-based ‘smartspeaker’ products, such as the Amazon Echo, Google Home, or Apple HomePod have become popular consumer items in the last year or two. These devices are designed for use in the home, and offer a kind of interaction where users may talk to an anthropomorphised ‘intelligent personal assistant’ which responds to things like questions and instructions. The widespread adoption of this new kind of interaction modality (i.e. voice) provided us with a great opportunity to consider how we could bring ethnomethodology and conversation analysis to bear on talk with and around such devices.
In this guest blog post, we wanted to give some background to our study of these devices and to discuss something we think might interest the ROLSI community. We recently published our findings as a paper to be presented at CHI 2018, the ACM conference for Human-Computer Interaction. We also posted couple of more easily digestible elaborations our findings and data:
- A Medium post focusing on an Interaction18 conference talk of unpacking the notion of ‘talking with machines
- A blog post summarising some findings from the CHI 2018 paper
There are many different ways you could study how interactions with such a device could unfold. Often people do lab studies or observational studies. However, for us the key considerations were: (1) getting the most ‘naturalistic’ interactions of people with such a device, and (2) recording the conversational context in which those interactions take place, before someone says ‘Alexa’ or ‘OK Google’ to wake up the device, as well as capturing what unfolds thereafter.
To achieve this, we provided a number of households with an Amazon Echo for about a month and also gave them a custom device (built by Martin) to record interactions with the Echo.
This device, the ‘Conditional Voice Recorder’ (CVR), is essentially a Raspberry Pi (a credit-card sized, but functionally complete, computer) with a conference microphone stuck on the top. It is always listening — much like the Amazon Echo — but differs in a range of ways. Firstly, it has lights to show when it was listening and when it was recording. Secondly, a button was added to enable/disable recording, as we wanted participant households to feel comfortable with the study and have control over the data that was being collected.
What it records
To collect contextual information of how an interaction was occasioned, the CVR listens continuously for an interaction with the Amazon Echo (the Amazon Echo interaction always starts with the word ‘Alexa’) and keeps the last minute of audio in the memory of the device. When an interaction with the Echo starts, it saves the last minute of audio to an internal memory card, and also records for one further minute. If people use the Echo again in that minute, recording is extended.
We encountered many challenges with the CVR, not least the challenges of people with non-English accents. In much the same way that commercial systems struggle with peoples’ accents, so did this one. Fortunately, we built in a system to remotely update the CVR once it was deployed into people’s homes. We also adjusted settings remotely to make the device more sensitive to detecting ‘Alexa’, or sometimes ‘less’ sensitive. Another concern was what happens if the recorder were to crash, or people left it turned off without realising (something that we found was easy to do during development) — the solution here was to turn the device automatically off and on again overnight.
How to analyse it?
As any CA researchers who work with audio-only recordings only will know well, one of the challenges during our research was working out the situational relevancy, particularly of interactions that seemed to not involve the Echo but were nevertheless occurring alongside it (a video camera would have been useful!). A mitigating factor here, we found, of course, was that participants routinely produced accounts of their interactions with the Echo and the relation of these to the current course of their activities in the home. Participants tended to audibly orient between (as well as embed) their interactional work with the Amazon Echo both alongside and ‘inside’ various other typical tasks of home life: cooking, watching TV, or eating dinner together.
To give a flavour of the sort of data we have collected, consider the short (but very rich) fragment below of a family using the Amazon Echo, taken from our paper. Here, Susan (the mother; all names changed) announces to the group her desire to play a particular game (called ‘Beat the Intro’) with the Amazon Echo, with a negative assessment from Liam (the son) overlapping Susan’s almost immediate instigation of the “wake word” to the Amazon Echo. Carl (the father) approves, inserting a quick ‘yeah’ in the pause between the wake word and the subsequent request by Susan:
However, the request fails, but not before Susan turns to Liam and instructs him to keep eating his food, with Carl providing support. Susan then repeats the request, implicating her assessment that the prior request has failed….
By adopting a CA approach, we very quickly were able to draw out some of the nuanced ways in which interaction with the Amazon Echo gets sequentially interleaved with other ongoing activities. Naturally, however, we also learned much of the character of home life of our participants just by listening to these interactions with / around the Echo. Through just our audio recordings of Amazon Echo use, you begin to understand habits such as meal times, music tastes, TV interests, shopping habits, and so on. We have plans to release some of our audio for others to use in research but ensuring we can maintain the confidentiality and anonymity of participants makes this a significant challenge.
In summary, we started out with a rather exciting challenge of trying to understand how interaction with new voice-based devices is practically achieved amidst the complex multiactivity setting of the home, facing a number of challenges in the process. Some of these we overcame with technology, such as running collected data through further automatic speech recognition software, and others, through analysing the audio recordings and drawing on EMCA.
Hopefully this guest blog provides some of the background to our work. If you’re interested in the outcomes of our analyses, we encourage you to check out the other posts linked to above, or the paper itself.