Virtual Assistant Interaction Process Overview

Will Styler - LIGN 6

Fundamentally, every interactional pair (e.g. you ask a question, it answers) with a virtual assistant will look relatively similar, and will be composed of roughly the same set of elements.

  1. Wake-Word Recogniton
  2. Recording
  3. Transmission to Servers
  4. ASR Processing
  5. Linguistic processing of the text data
  6. Meaning Extraction
  7. Response Planning
  8. Information Retrieval and Command Implementation
  9. Text-to-Speech
  10. Transmission to Device and Playback
  11. Ongoing interaction
  12. Ethical Concerns

Although (12) isn’t strictly speaking a ‘part of the process’, it’s crucial that as you design any of these systems, you think about the ethical concerns involved, and this should be a dedicated section of your final paper.

Below, I’ll outline each of these steps in greater detail, highlighting some of the complexities involved with each.

In your final project (if you’ve chosen the ‘Design a Natural Language System’ approach), you’re going to want to address each of these steps in light of your very specific domain. In your write-up, you’ll need to pay particular attention to the ‘Things to consider’ raised below, addressing the relevant ones, and mentioning why the irrelevant ones aren’t relevant. Each of your projects and implementation approaches will have unique complexities, and I’ll want to see evidence that you’ve thought about those complexities in each of these steps. You’re welcome to use this as a template for the greater structure of your final project, but I’d still like interpretable prose, rather than bullet-by-bullet answers.

Now, step-by-step, assuming the query is “Alexa, will it rain today?”

Wake-Word Recognition

This is the ‘trigger’ which starts the recording. ‘Alexa’, or ‘OK Google’ or ‘Hey Siri’.

Things to consider

Recording

Here, you’re capturing the human’s query (the audio immediately following the wake-word) acoustically in a reasonable form. This includes Analog-to-digital sampling, as well as some filtering.

Things to consider

Transmission to Servers

All queries on most modern virtual assistants are processed on a remote server. So, your voice recording must be sent back to Apple/Amazon/Google/Microsoft/Nuance.

Things to consider

Automatic Speech Recognition Processing

This turns the waveform into the text “Will it rain today”.

Things to consider

Linguistic Processing of the text data

This takes the output from the ASR (“will it rain today”) and gathers linguistic information about the query. So, for instance, you might extract part-of-speech and syntactic information:

(S (MOD will) (NP it) (VP rain (ADV today)))

You might also get things like co-reference (e.g. “John(1) and I have a meeting(2) tomorrow. Email him(1) to remind him(1) about it(2)”), or verb sense information (e.g. “rain” is PropBank sense rain.01 (‘Pure Weather Phenomenon’))

Things to consider

Meaning Extraction (Semantic Parsing)

This is the component that maps elements of the linguistic representation to an actionable query. So, this is where we realize that this query is seeking information about the weather, and specifically, about rain. The query is made to fit a particular ‘weather request’ frame, with elements like ‘What time period do you want a forecast for?’ or ‘Where do you want to know about?’ or ‘What weather phenomenon are you interested in?’.

This also involves things like temporal reasoning, device location detection, and inference based on known data about the user (e.g. “when this human says”my wife”, he means contact #45ea ‘Jessica Styler’“)

Elements of the query are then normalized to a searchable format and fit into that frame.

For voice commands, this process will look similar, but instead of a question frame, you might have an action frame. If you say “Alexa, turn on the living room lights to 40%”, you’d have to parse down into a different frame, along the lines of:

Or, for an online ordering system, you might parse a command like “Hey, order me some 36”x34” gray cargo pants from Mountain Khakis” into…

The key thing to do here is to examine the kinds of commands you’ll need, and then the kinds of information the human is giving you, and then describe what kinds of things you’ll need to pull from the language to get it.

Things to consider

Response Planning

At this stage, the computer needs to figure out what information to return to the query or what action to take. When asked ‘Will it rain today?’, do you want it to simply reply ‘Yes’? Or should it frame the answer in a carrier sentence (e.g. “Yes, it will rain today”)? Should it provide additional information (e.g. percent likelihood of rain, approximate time of rain start, expected amount of rain)?

This stage allows you to build a scaffolding of a response, including variables (marked with $) which stand in for things which will be retrieved later. Something like “$yesnoanswer, rain is predicted in $location starting at $precipstarttime today, and they’re predicting $InchesOfRain inches of rain will fall”.

Note, though, that you’ll need multiple answer scaffoldings depending on the response. You wouldn’t want to reply “No, rain is predicted in La Jolla at never today, and they’re predicting zero inches of rain”.

For commands, you’ll want to figure out both how to carry out the action (e.g. sending a command to the Phillips Hue smart bulb servers to trigger the lights in ‘Living Room’ associated with this account to turn on at 40% intensity), and how to respond verbally (e.g. “OK!”, “Sure thing, I’ve turned on the lights”, or just with silence).

Finally, you might need to follow up. If the person says “Turn on a light”, you might have to ask them “Which light do you want me to turn on?”

Things to consider

Information Retrieval and Command Implementation

This is the ‘boring’ part which is just computers talking to computers. Your assistant will take the schematic data produced above (e.g. location, desired phenomenon, time) and query an external (or internal) database to fill in the ‘blanks’ in the response above:

Or, in the case of a command, it’ll issue API calls to whatever service to turn on the lights, or start music playing via Spotify, or what have you. Importantly, this isn’t natural language: This is computers talking to computers.

For LIGN 6 final projects, you can oversimplify this process, and just assume that there’s a server someplace which will take your detailed queries and return detailed answers, or will implement the requested actions. Just describe the queries and the ideal responses, but feel free to be a bit handwavey about where you’d get (e.g.) detailed weather information, the closest restaurant with a four-star rating, a satellite picture of a certain region, or how you’d inform Phillips to turn on Will’s living room lights. I care more about your natural language processing than the details of getting data from elsewhere.

Things to consider

Text-to-Speech

You’ll still need to turn your results back into spoken language. Even if you have a text template for the sentence for the weather report, you’ll need to turn things like “03FEB2019 18:53” or a binary “1” response back into natural language text (text analysis). You’ll also need to think about prosody at this point.

Things to consider

Transmission to Device and Playback

The audio of Alexa’s response will need to be returned to the device and played back. This is generally uninteresting, but as yourself if there are any specific elements of playback which are relevant to your use case.

Things to consider

Ongoing Interaction

It’s possible that after the above steps, your query or action may be ‘complete’, and your system can shut down. But it also might require further response from the human. Perhaps there’s an expected followup (because your system asked a clarification question, or because the process is more interactional)?

Things to consider

Ethical Concerns

Finally, as you’re doing all of this, it’s crucial to think about the ethical problems raised by these systems and the companies using them. In your paper, as well as in practice, these systems also have to think about other concerns related to bias, privacy, harm, and dual-use concerns.

Things to consider

Additional Elements

It’s quite possible that your specific domain might have additional steps or requirements. Please don’t feel obligated to limit yourself to the framework laid out here. You may want to include additional information, and you’re welcome to do so!