Fundamentally, every interactional pair (e.g. you ask a question, it
answers) with a virtual assistant will look relatively similar, and will
be composed of roughly the same set of elements.
- Wake-Word Recogniton
- Recording
- Transmission to Servers
- ASR Processing
- Linguistic processing of the text data
- Meaning Extraction
- Response Planning
- Information Retrieval and Command Implementation
- Text-to-Speech
- Transmission to Device and Playback
- Ongoing interaction
- Ethical Concerns
Although (12) isn’t strictly speaking a ‘part of the process’, it’s
crucial that as you design any of these systems, you think about the
ethical concerns involved, and this should be a dedicated section of
your final paper.
Below, I’ll outline each of these steps in greater detail,
highlighting some of the complexities involved with each.
In your final project (if you’ve chosen the ‘Design a Natural
Language System’ approach), you’re going to want to address each of
these steps in light of your very specific domain. In your
write-up, you’ll need to pay particular attention to the ‘Things to
consider’ raised below, addressing the relevant ones, and mentioning why
the irrelevant ones aren’t relevant. Each of your projects and
implementation approaches will have unique complexities, and I’ll want
to see evidence that you’ve thought about those complexities in each of
these steps. You’re welcome to use this as a template for the greater
structure of your final project, but I’d still like
interpretable prose, rather than bullet-by-bullet answers.
Now, step-by-step, assuming the query is “Alexa, will it rain
today?”
Wake-Word Recognition
This is the ‘trigger’ which starts the recording. ‘Alexa’, or ‘OK
Google’ or ‘Hey Siri’.
Things to consider
- Are the requirements for this step easier/harder/the same as in
existing systems? Do you think an off-the-shelf system could handle your
task?
- Do you need to use a wake-word, or will you have an always-listening
system?
- Ponder the privacy issues of submitting to ASR chunks of
non-computer-directed conversation!
- How do you minimize accidental activation?
- Is your wake-word too similar to frequent natural language
words?
- Separating the wake-word from noise
- Recognizing multiple voices
- Recognizing distant voices
- Do you allow multiple wake-words?
Recording
Here, you’re capturing the human’s query (the audio immediately
following the wake-word) acoustically in a reasonable form. This
includes Analog-to-digital sampling, as well as some filtering.
Things to consider
- Are the requirements for this step easier/harder/the same as in
existing systems? Do you think an off-the-shelf system could handle your
task?
- Sampling rate?
- Filtering to remove non-target noise? (e.g. a nearby loud computer
fan)
Transmission to Servers
All queries on most modern virtual assistants are processed on a
remote server. So, your voice recording must be sent back to
Apple/Amazon/Google/Microsoft/Nuance.
Things to consider
- Are the requirements for this step easier/harder/the same as in
existing systems? Do you think an off-the-shelf system could handle your
task?
- Airplane mode?
- Slow data connections?
- Balancing sending high quality audio (e.g. less compressed, higher
sampling rate, larger files) with low data usage
Automatic Speech
Recognition Processing
This turns the waveform into the text “Will it rain today”.
Things to consider
- Are the requirements for this step easier/harder/the same as in
existing systems? Do you think an off-the-shelf system could handle your
task?
- You’ll want to try sample sentences or queries against an existing
dictation service (e.g. Dictation.io
or your phone’s built in ASR)
- What words does it need to be able to recognize?
- Is there technical jargon? Unusual artist names?
- You’ll want to come up with a list of some very frequent vocab
items, along with their ARPAbet pronunciations
- You can use IPA as well if you’re familiar, but for most folks,
ARPAbet will be easier
- Fuzzy matching (e.g. turning “Play songs by the bed sit in for me”
to “Play songs by the Bedsit Infamy”
- Who’s talking?
- Regional or international accents?
- Which language(s) are being used?
- Are there any Homophones of great importance?
- What kind of training data would be needed to build (or improve) the
ASR system, if your task is very different from the norm.
Linguistic Processing of
the text data
This takes the output from the ASR (“will it rain today”) and gathers
linguistic information about the query. So, for instance, you might
extract part-of-speech and syntactic information:
(S (MOD will) (NP it) (VP rain (ADV today)))
You might also get things like co-reference (e.g. “John(1) and I have
a meeting(2) tomorrow. Email him(1) to remind him(1) about it(2)”), or
verb sense information (e.g. “rain” is PropBank sense rain.01
(‘Pure Weather Phenomenon’))
Things to consider
- Are the requirements for this step easier/harder/the same as in
existing systems? Do you think an off-the-shelf system could handle your
task?
- You’ll want to try sample sentences or queries against an existing
synactic parser (e.g. The Stanford
Parser or The CMU
Phrase Parser) to check performance on syntax and POS tagging.
- You’ll also want to try sample sentences or queries against an
online Coreference tool (e.g. Neural Coref.
- What kind of training data would you want to use to build the
language models for this system?
- Emails, twitter, NY Times?
- Would you need to create your own corpus?
- If you’re basing this on a corpus of novel data, please include
a sample of what such data would look like in your write-up!
- Do you expect full, grammatical sentences?
- Are you likely to have odd syntactic constructions?
- Do you know which language(s) is being used?
- Is there unusual jargon?
- Do you expect uncommon word senses (e.g. “Neutralize” a person, in
the military context, or “involve” in the medical “structure is included
within a growing tumor” sense?)
This is the component that maps elements of the linguistic
representation to an actionable query. So, this is where we realize that
this query is seeking information about the weather, and specifically,
about rain. The query is made to fit a particular ‘weather request’
frame, with elements like ‘What time period do you want a forecast for?’
or ‘Where do you want to know about?’ or ‘What weather phenomenon are
you interested in?’.
This also involves things like temporal reasoning, device location
detection, and inference based on known data about the user (e.g. “when
this human says”my wife”, he means contact #45ea ‘Jessica Styler’“)
Elements of the query are then normalized to a searchable format and
fit into that frame.
- Period of forecast: 3FEB2019 15:19:12 - 3FEB2019 23:59:59
- Phenomenon of interest: ‘rain’
- Location of interest: ‘92161’ or ‘32°52’44.9”N 117°14’26.9”W’
For voice commands, this process will look similar,
but instead of a question frame, you might have an action frame. If you
say “Alexa, turn on the living room lights to 40%”, you’d have to parse
down into a different frame, along the lines of:
- Action: Manipulate Lights
- Relevant Service: Philips Hue Lighting
- Main action: Activate Lights
- Relevant Lighting zone: ‘Living Room’ on Hue account
will@savethevowels.org
- Color: ‘Same as previous/unspecified’
- Intensity: 0.4
Or, for an online ordering system, you might parse a command like
“Hey, order me some 36”x34” gray cargo pants from Mountain Khakis”
into…
- Action: Purchase
- Relevant Service: BuySomePants.com
- Main Action: Purchase
- Item Type: Clothing, Pants, Cargo
- Size: 36x34
- Color: gray
- Brand: Mountain Khakis
- Recipient: User
The key thing to do here is to examine the kinds of commands you’ll
need, and then the kinds of information the human is giving you, and
then describe what kinds of things you’ll need to pull from the language
to get it.
Things to consider
- Are the requirements for this step easier/harder/the same as in
existing systems? Do you think an off-the-shelf system could handle your
task?
- What queries/commands do you expect?
- What sorts of elements does each type of query/command involve?
- For your project, you’ll want to design frames for a few of the
most common kinds of queries
- Which of these elements are mandatory?
- What do you do when these elements are missing?
- What additional information would you need to ask a human for to
answer or take action?
- What words or phrases help us understand which query/command is
being used?
- How can it tell you’re asking about the weather or adjusting the
lights?
- What are the different sorts of phrasings you might expect a human
to use to make this query/command?
- “Is it going to rain?” “Should I bring an umbrella?” “Do we expect
precipitation?” “What’s tonight’s weather?”
- Could your system handle weird things like “My wife wants to know if
she’s going to want an umbrella?”
- What kind of information would you need to be able to get the answer
from an external (or internal) data source?
- What kind of inference will your system need to do to
properly respond? For example…
- Knowing that ‘today’ means ‘from the time of the query until the end
of this calendar day’
- Knowing that “Should I bring an umbrella?” is asking about the
probability of rain
- Knowing that the question is asking about the location of the
user/device, not some other arbitrary point.
- Knowing that ‘Living room lights’ means “The lights located in the
‘living room’ zone of the account-holder’s linked Hue account
- What assumptions are built into the query as phrased?
- Which lights?
- What location?
- When do you want the lights turned on?
- Is the human also interested if it’s going to snow instead of
raining?
Response Planning
At this stage, the computer needs to figure out what information to
return to the query or what action to take. When asked ‘Will it rain
today?’, do you want it to simply reply ‘Yes’? Or should it frame the
answer in a carrier sentence (e.g. “Yes, it will rain today”)? Should it
provide additional information (e.g. percent likelihood of rain,
approximate time of rain start, expected amount of rain)?
This stage allows you to build a scaffolding of a response, including
variables (marked with $
) which stand in for things which
will be retrieved later. Something like “$yesnoanswer
, rain
is predicted in $location
starting at
$precipstarttime
today, and they’re predicting
$InchesOfRain
inches of rain will fall”.
Note, though, that you’ll need multiple answer scaffoldings depending
on the response. You wouldn’t want to reply “No, rain is predicted in La
Jolla at never today, and they’re predicting zero inches of rain”.
For commands, you’ll want to figure out both how to carry out the
action (e.g. sending a command to the Phillips Hue smart bulb servers to
trigger the lights in ‘Living Room’ associated with this account to turn
on at 40% intensity), and how to respond verbally (e.g. “OK!”, “Sure
thing, I’ve turned on the lights”, or just with silence).
Finally, you might need to follow up. If the person says “Turn on a
light”, you might have to ask them “Which light do you want me to turn
on?”
Things to consider
- Are the requirements for this step easier/harder/the same as in
existing systems? Do you think an off-the-shelf system could handle your
task?
- How should the system respond to the queries above?
- Create response templates for some of the most likely
queries/commands
- What phrasing(s) should be used?
- What information variables need to be contained in the
response?
- Are there multiple response types depending on the results of the
information requested?
- Think “No rain” vs. “Rain” responses
- It’s not a bad idea to think about this as a decision tree (“If
there is rain, then this response. If snow, then this one. If none, then
this.”)
- What actions would the system need to carry out?
- What information is needed to carry out the proper actions?
- What is the desired outcome from this command?
- What verbal response or confirmation do you need the system to
provide?
This is the ‘boring’ part which is just computers talking to
computers. Your assistant will take the schematic data produced above
(e.g. location, desired phenomenon, time) and query an external (or
internal) database to fill in the ‘blanks’ in the response above:
$yesnoanswer -> 1 (or 0, if no rain)
$location -> "La Jolla"
$precipstarttime -> 03FEB2019 18:53
$inchesofrain -> 0.07
Or, in the case of a command, it’ll issue API calls to whatever
service to turn on the lights, or start music playing via Spotify, or
what have you. Importantly, this isn’t natural language: This is
computers talking to computers.
For LIGN 6 final projects, you can oversimplify this process, and
just assume that there’s a server someplace which will take your
detailed queries and return detailed answers, or will implement the
requested actions. Just describe the queries and the ideal
responses, but feel free to be a bit handwavey about where you’d get
(e.g.) detailed weather information, the closest restaurant with a
four-star rating, a satellite picture of a certain region, or how you’d
inform Phillips to turn on Will’s living room lights. I care more about
your natural language processing than the details of getting data from
elsewhere.
Things to consider
- Are the requirements for this step easier/harder/the same as in
existing systems? Do you think an off-the-shelf system could handle your
task?
- Where could you, in theory, get the data you need to answer queries?
- What information will need to be available to your assistant to
answer these questions?
- What kind of websites exist (or would need to exist) that might have
these data?
- If the data are coming from your system itself, what data would you
need to store or generate to be able to make this function.
- What sort (roughly) of commands or queries would you need to
send?
- What sorts of service(s) will you need to interact with to take
these actions?
Text-to-Speech
You’ll still need to turn your results back into spoken language.
Even if you have a text template for the sentence for the weather
report, you’ll need to turn things like “03FEB2019 18:53” or a binary
“1” response back into natural language text (text analysis). You’ll
also need to think about prosody at this point.
Things to consider
- Are the requirements for this step easier/harder/the same as in
existing systems? Do you think an off-the-shelf system could handle your
task?
- What or who do you want the text-to-speech voice to sound like?
Which dialect?
- You’ll want to try sample responses in existing Text-to-Speech
systems like Google’s
TTS or your computer’s built in TTS systems to see where common
errors will be made.
- What words would need to be added to your text-to-speech dictionary?
- Cross-check some words in your use-case against CMUDict
- Are there specific person or place or product names you’d want to
have pre-recorded or added to the dictionary?
- Are there pre-recordable phrases that will help improve naturalness
in unit-selection synthesis?
- Think “The National Weather Service has issued a High Surf Warning
for”
- Are there any specific prosodic ‘tunes’ required to make this sounds
reasonable?
- Question intonation? List intonation?
- Are there numbers which you’ll need to handle in a certain way?
- How much detail should be included in date/time playback?
Transmission to Device and
Playback
The audio of Alexa’s response will need to be returned to the device
and played back. This is generally uninteresting, but as yourself if
there are any specific elements of playback which are relevant to your
use case.
Things to consider
- How would you handle a situation where the connectivity is poor?
- Do you need to follow up? Can the command fail silently?
- Is the playback volume important?
- Would you want Alexa to come on full-blast volume if you’re asking
her to turn on the lights at 3am?
Ongoing Interaction
It’s possible that after the above steps, your query or action may be
‘complete’, and your system can shut down. But it also might require
further response from the human. Perhaps there’s an expected followup
(because your system asked a clarification question, or because the
process is more interactional)?
Things to consider
- Has the interaction ended?
- Will your system begin listening again for a response?
- How long will you keep listening?
Ethical Concerns
Finally, as you’re doing all of this, it’s crucial to think about the
ethical problems raised by these systems and the companies using them.
In your paper, as well as in practice, these systems also have to think
about other concerns related to bias, privacy, harm, and dual-use
concerns.
Things to consider
- How can you ensure equity? That is…
- Will the system work with different accents and dialects?
- Will the system treat all people equally?
- Will the system’s training data result in inherent bias in the
system?
- Will the system’s decision making favor some groups over
others?
- How can you be conscious of privacy?
- What personally identifiable or sensitive data (PID, for short)
must be stored or collected for the system to function?
- What kinds of PID can you avoid storing or collecting? Put
differently, what steps can you take to proactively anonymize your users
and avoid being responsible for PID?
- Imagine your servers have been completely hacked and owned, and
somebody now has unfettered access to all of your data and is posting it
on the internet. What harm would this cause to your users? What can you
do to reduce the harm of this?
- Note that “improving and hardening security” is not an answer here,
as everybody can be hacked.
- How can you let users know what your privacy practices look like in
a transparent way?
- How can you avoid harm to your users?
- In what way(s) could your system cause direct physical harm
to users? How can this be avoided?
- Along the lines of “The ship navigation system misheard”Mars” as
“Maw” and warps the occupants into a black hole”.
- In what way(s) could your system cause emotional harm to
users? How can this be avoided?
- Think along the lines of the emotional consequences of data loss for
a ‘virtual romantic partner’ app, or a bug in a virtual psychologist
resulting in an erroneously harsh response to vulnerability, or a
children’s virtual assistant accidentally displaying a scene of intense
movie violence.
- In what way(s) could your system cause social harm to
users? How can this be avoided?
- Think about an app accidentally ‘outing’ elements of somebody’s
sexual life or identity in a response to a query, or a virtual assistant
scheduling appointments with friends but failing to inform the user,
resulting in no-shows.
- Does your system have the potential to have a secondary use which
could do harm?
- If you were forced to give exceptional access to a government
(e.g. “Hand us the keys to your servers if you’d like to stay in
business and keep your passport”), what harms could it do?
- Are there other potential negative consequences for society or
civilization of this technology being developed?
Additional Elements
It’s quite possible that your specific domain might have additional
steps or requirements. Please don’t feel obligated to limit yourself to
the framework laid out here. You may want to include additional
information, and you’re welcome to do so!