Virtual Assistant Interaction Process Overview

Will Styler - LIGN 6

Fundamentally, every interactional pair (e.g. you ask a question, it answers) with a virtual assistant will look relatively similar, and will be composed of roughly the same set of elements.

Wake-Word Recogniton
Recording
Transmission to Servers
ASR Processing
Linguistic processing of the text data
Meaning Extraction
Response Planning
Information Retrieval and Command Implementation
Text-to-Speech
Transmission to Device and Playback
Ongoing interaction
Ethical Concerns

Although (12) isn’t strictly speaking a ‘part of the process’, it’s crucial that as you design any of these systems, you think about the ethical concerns involved, and this should be a dedicated section of your final paper.

Below, I’ll outline each of these steps in greater detail, highlighting some of the complexities involved with each.

In your final project (if you’ve chosen the ‘Design a Natural Language System’ approach), you’re going to want to address each of these steps in light of your very specific domain. In your write-up, you’ll need to pay particular attention to the ‘Things to consider’ raised below, addressing the relevant ones, and mentioning why the irrelevant ones aren’t relevant. Each of your projects and implementation approaches will have unique complexities, and I’ll want to see evidence that you’ve thought about those complexities in each of these steps. You’re welcome to use this as a template for the greater structure of your final project, but I’d still like interpretable prose, rather than bullet-by-bullet answers.

Now, step-by-step, assuming the query is “Alexa, will it rain today?”

Wake-Word Recognition

This is the ‘trigger’ which starts the recording. ‘Alexa’, or ‘OK Google’ or ‘Hey Siri’.

Things to consider

Are the requirements for this step easier/harder/the same as in existing systems? Do you think an off-the-shelf system could handle your task?
Do you need to use a wake-word, or will you have an always-listening system?
- Ponder the privacy issues of submitting to ASR chunks of non-computer-directed conversation!
How do you minimize accidental activation?
- Is your wake-word too similar to frequent natural language words?
Separating the wake-word from noise
Recognizing multiple voices
Recognizing distant voices
Do you allow multiple wake-words?

Recording

Here, you’re capturing the human’s query (the audio immediately following the wake-word) acoustically in a reasonable form. This includes Analog-to-digital sampling, as well as some filtering.

Things to consider

Are the requirements for this step easier/harder/the same as in existing systems? Do you think an off-the-shelf system could handle your task?
Sampling rate?
Filtering to remove non-target noise? (e.g. a nearby loud computer fan)

Transmission to Servers

All queries on most modern virtual assistants are processed on a remote server. So, your voice recording must be sent back to Apple/Amazon/Google/Microsoft/Nuance.

Things to consider

Are the requirements for this step easier/harder/the same as in existing systems? Do you think an off-the-shelf system could handle your task?
Airplane mode?
Slow data connections?
Balancing sending high quality audio (e.g. less compressed, higher sampling rate, larger files) with low data usage

Automatic Speech Recognition Processing

This turns the waveform into the text “Will it rain today”.

Things to consider

Are the requirements for this step easier/harder/the same as in existing systems? Do you think an off-the-shelf system could handle your task?
You’ll want to try sample sentences or queries against an existing dictation service (e.g. Dictation.io or your phone’s built in ASR)
What words does it need to be able to recognize?
Is there technical jargon? Unusual artist names?
- You’ll want to come up with a list of some very frequent vocab items, along with their ARPAbet pronunciations
- You can use IPA as well if you’re familiar, but for most folks, ARPAbet will be easier
Fuzzy matching (e.g. turning “Play songs by the bed sit in for me” to “Play songs by the Bedsit Infamy”
Who’s talking?
Regional or international accents?
Which language(s) are being used?
Are there any Homophones of great importance?
What kind of training data would be needed to build (or improve) the ASR system, if your task is very different from the norm.

Linguistic Processing of the text data

This takes the output from the ASR (“will it rain today”) and gathers linguistic information about the query. So, for instance, you might extract part-of-speech and syntactic information:

(S (MOD will) (NP it) (VP rain (ADV today)))

You might also get things like co-reference (e.g. “John(1) and I have a meeting(2) tomorrow. Email him(1) to remind him(1) about it(2)”), or verb sense information (e.g. “rain” is PropBank sense rain.01 (‘Pure Weather Phenomenon’))

Things to consider

Are the requirements for this step easier/harder/the same as in existing systems? Do you think an off-the-shelf system could handle your task?
You’ll want to try sample sentences or queries against an existing synactic parser (e.g. The Stanford Parser or The CMU Phrase Parser) to check performance on syntax and POS tagging.
You’ll also want to try sample sentences or queries against an online Coreference tool (e.g. Neural Coref.
What kind of training data would you want to use to build the language models for this system?
- Emails, twitter, NY Times?
- Would you need to create your own corpus?
- If you’re basing this on a corpus of novel data, please include a sample of what such data would look like in your write-up!
Do you expect full, grammatical sentences?
Are you likely to have odd syntactic constructions?
Do you know which language(s) is being used?
Is there unusual jargon?
Do you expect uncommon word senses (e.g. “Neutralize” a person, in the military context, or “involve” in the medical “structure is included within a growing tumor” sense?)

Meaning Extraction (Semantic Parsing)

This is the component that maps elements of the linguistic representation to an actionable query. So, this is where we realize that this query is seeking information about the weather, and specifically, about rain. The query is made to fit a particular ‘weather request’ frame, with elements like ‘What time period do you want a forecast for?’ or ‘Where do you want to know about?’ or ‘What weather phenomenon are you interested in?’.

This also involves things like temporal reasoning, device location detection, and inference based on known data about the user (e.g. “when this human says”my wife”, he means contact #45ea ‘Jessica Styler’“)

Elements of the query are then normalized to a searchable format and fit into that frame.

Period of forecast: 3FEB2019 15:19:12 - 3FEB2019 23:59:59
Phenomenon of interest: ‘rain’
Location of interest: ‘92161’ or ‘32°52’44.9”N 117°14’26.9”W’

For voice commands, this process will look similar, but instead of a question frame, you might have an action frame. If you say “Alexa, turn on the living room lights to 40%”, you’d have to parse down into a different frame, along the lines of:

Action: Manipulate Lights
Relevant Service: Philips Hue Lighting
Main action: Activate Lights
Relevant Lighting zone: ‘Living Room’ on Hue account will@savethevowels.org
Color: ‘Same as previous/unspecified’
Intensity: 0.4

Or, for an online ordering system, you might parse a command like “Hey, order me some 36”x34” gray cargo pants from Mountain Khakis” into…

Action: Purchase
Relevant Service: BuySomePants.com
Main Action: Purchase
Item Type: Clothing, Pants, Cargo
Size: 36x34
Color: gray
Brand: Mountain Khakis
Recipient: User

The key thing to do here is to examine the kinds of commands you’ll need, and then the kinds of information the human is giving you, and then describe what kinds of things you’ll need to pull from the language to get it.

Things to consider

Are the requirements for this step easier/harder/the same as in existing systems? Do you think an off-the-shelf system could handle your task?
What queries/commands do you expect?
What sorts of elements does each type of query/command involve?
- For your project, you’ll want to design frames for a few of the most common kinds of queries
- Which of these elements are mandatory?
- What do you do when these elements are missing?
- What additional information would you need to ask a human for to answer or take action?
What words or phrases help us understand which query/command is being used?
- How can it tell you’re asking about the weather or adjusting the lights?
What are the different sorts of phrasings you might expect a human to use to make this query/command?
- “Is it going to rain?” “Should I bring an umbrella?” “Do we expect precipitation?” “What’s tonight’s weather?”
- Could your system handle weird things like “My wife wants to know if she’s going to want an umbrella?”
What kind of information would you need to be able to get the answer from an external (or internal) data source?
What kind of inference will your system need to do to properly respond? For example…
- Knowing that ‘today’ means ‘from the time of the query until the end of this calendar day’
- Knowing that “Should I bring an umbrella?” is asking about the probability of rain
- Knowing that the question is asking about the location of the user/device, not some other arbitrary point.
- Knowing that ‘Living room lights’ means “The lights located in the ‘living room’ zone of the account-holder’s linked Hue account
What assumptions are built into the query as phrased?
- Which lights?
- What location?
- When do you want the lights turned on?
- Is the human also interested if it’s going to snow instead of raining?

Response Planning

At this stage, the computer needs to figure out what information to return to the query or what action to take. When asked ‘Will it rain today?’, do you want it to simply reply ‘Yes’? Or should it frame the answer in a carrier sentence (e.g. “Yes, it will rain today”)? Should it provide additional information (e.g. percent likelihood of rain, approximate time of rain start, expected amount of rain)?

This stage allows you to build a scaffolding of a response, including variables (marked with $) which stand in for things which will be retrieved later. Something like “$yesnoanswer, rain is predicted in $location starting at $precipstarttime today, and they’re predicting $InchesOfRain inches of rain will fall”.

Note, though, that you’ll need multiple answer scaffoldings depending on the response. You wouldn’t want to reply “No, rain is predicted in La Jolla at never today, and they’re predicting zero inches of rain”.

For commands, you’ll want to figure out both how to carry out the action (e.g. sending a command to the Phillips Hue smart bulb servers to trigger the lights in ‘Living Room’ associated with this account to turn on at 40% intensity), and how to respond verbally (e.g. “OK!”, “Sure thing, I’ve turned on the lights”, or just with silence).

Finally, you might need to follow up. If the person says “Turn on a light”, you might have to ask them “Which light do you want me to turn on?”

Things to consider

Are the requirements for this step easier/harder/the same as in existing systems? Do you think an off-the-shelf system could handle your task?
How should the system respond to the queries above?
- Create response templates for some of the most likely queries/commands
- What phrasing(s) should be used?
- What information variables need to be contained in the response?
Are there multiple response types depending on the results of the information requested?
- Think “No rain” vs. “Rain” responses
- It’s not a bad idea to think about this as a decision tree (“If there is rain, then this response. If snow, then this one. If none, then this.”)
What actions would the system need to carry out?
- What information is needed to carry out the proper actions?
- What is the desired outcome from this command?
- What verbal response or confirmation do you need the system to provide?

Information Retrieval and Command Implementation

This is the ‘boring’ part which is just computers talking to computers. Your assistant will take the schematic data produced above (e.g. location, desired phenomenon, time) and query an external (or internal) database to fill in the ‘blanks’ in the response above:

$yesnoanswer -> 1 (or 0, if no rain)
$location -> "La Jolla"
$precipstarttime -> 03FEB2019 18:53
$inchesofrain -> 0.07

Or, in the case of a command, it’ll issue API calls to whatever service to turn on the lights, or start music playing via Spotify, or what have you. Importantly, this isn’t natural language: This is computers talking to computers.

For LIGN 6 final projects, you can oversimplify this process, and just assume that there’s a server someplace which will take your detailed queries and return detailed answers, or will implement the requested actions. Just describe the queries and the ideal responses, but feel free to be a bit handwavey about where you’d get (e.g.) detailed weather information, the closest restaurant with a four-star rating, a satellite picture of a certain region, or how you’d inform Phillips to turn on Will’s living room lights. I care more about your natural language processing than the details of getting data from elsewhere.

Things to consider

Are the requirements for this step easier/harder/the same as in existing systems? Do you think an off-the-shelf system could handle your task?
Where could you, in theory, get the data you need to answer queries?
- What information will need to be available to your assistant to answer these questions?
- What kind of websites exist (or would need to exist) that might have these data?
If the data are coming from your system itself, what data would you need to store or generate to be able to make this function.
What sort (roughly) of commands or queries would you need to send?
What sorts of service(s) will you need to interact with to take these actions?

Text-to-Speech

You’ll still need to turn your results back into spoken language. Even if you have a text template for the sentence for the weather report, you’ll need to turn things like “03FEB2019 18:53” or a binary “1” response back into natural language text (text analysis). You’ll also need to think about prosody at this point.

Things to consider

Are the requirements for this step easier/harder/the same as in existing systems? Do you think an off-the-shelf system could handle your task?
What or who do you want the text-to-speech voice to sound like? Which dialect?
You’ll want to try sample responses in existing Text-to-Speech systems like Google’s TTS or your computer’s built in TTS systems to see where common errors will be made.
What words would need to be added to your text-to-speech dictionary?
- Cross-check some words in your use-case against CMUDict
Are there specific person or place or product names you’d want to have pre-recorded or added to the dictionary?
Are there pre-recordable phrases that will help improve naturalness in unit-selection synthesis?
- Think “The National Weather Service has issued a High Surf Warning for”
Are there any specific prosodic ‘tunes’ required to make this sounds reasonable?
- Question intonation? List intonation?
Are there numbers which you’ll need to handle in a certain way?
How much detail should be included in date/time playback?

Transmission to Device and Playback

The audio of Alexa’s response will need to be returned to the device and played back. This is generally uninteresting, but as yourself if there are any specific elements of playback which are relevant to your use case.

Things to consider

How would you handle a situation where the connectivity is poor?
- Do you need to follow up? Can the command fail silently?
Is the playback volume important?
- Would you want Alexa to come on full-blast volume if you’re asking her to turn on the lights at 3am?

Ongoing Interaction

It’s possible that after the above steps, your query or action may be ‘complete’, and your system can shut down. But it also might require further response from the human. Perhaps there’s an expected followup (because your system asked a clarification question, or because the process is more interactional)?

Things to consider

Has the interaction ended?
Will your system begin listening again for a response?
How long will you keep listening?

Ethical Concerns

Finally, as you’re doing all of this, it’s crucial to think about the ethical problems raised by these systems and the companies using them. In your paper, as well as in practice, these systems also have to think about other concerns related to bias, privacy, harm, and dual-use concerns.

Things to consider

How can you ensure equity? That is…
- Will the system work with different accents and dialects?
- Will the system treat all people equally?
- Will the system’s training data result in inherent bias in the system?
- Will the system’s decision making favor some groups over others?
How can you be conscious of privacy?
- What personally identifiable or sensitive data (PID, for short) must be stored or collected for the system to function?
- What kinds of PID can you avoid storing or collecting? Put differently, what steps can you take to proactively anonymize your users and avoid being responsible for PID?
- Imagine your servers have been completely hacked and owned, and somebody now has unfettered access to all of your data and is posting it on the internet. What harm would this cause to your users? What can you do to reduce the harm of this?
  - Note that “improving and hardening security” is not an answer here, as everybody can be hacked.
- How can you let users know what your privacy practices look like in a transparent way?
How can you avoid harm to your users?
- In what way(s) could your system cause direct physical harm to users? How can this be avoided?
  - Along the lines of “The ship navigation system misheard”Mars” as “Maw” and warps the occupants into a black hole”.
- In what way(s) could your system cause emotional harm to users? How can this be avoided?
  - Think along the lines of the emotional consequences of data loss for a ‘virtual romantic partner’ app, or a bug in a virtual psychologist resulting in an erroneously harsh response to vulnerability, or a children’s virtual assistant accidentally displaying a scene of intense movie violence.
- In what way(s) could your system cause social harm to users? How can this be avoided?
  - Think about an app accidentally ‘outing’ elements of somebody’s sexual life or identity in a response to a query, or a virtual assistant scheduling appointments with friends but failing to inform the user, resulting in no-shows.
Does your system have the potential to have a secondary use which could do harm?
- If you were forced to give exceptional access to a government (e.g. “Hand us the keys to your servers if you’d like to stay in business and keep your passport”), what harms could it do?
- Are there other potential negative consequences for society or civilization of this technology being developed?

Additional Elements

It’s quite possible that your specific domain might have additional steps or requirements. Please don’t feel obligated to limit yourself to the framework laid out here. You may want to include additional information, and you’re welcome to do so!