Doing Terrible things to Speech Recognition Software

This was originally posted on my blog, Notes from a Linguistic Mystic in 2012. See all posts

Yesterday, I wrote a huge review of Dragon Dictate 3, in which I talked about accuracy in use a great deal, and discussed some of the errors that frequently come up in everyday usage.

Today, we’re going to look at the other side of the coin: accuracy when you’re not trying to cooperate, and instead, are trying quite hard to break speech recognition. These sentences and phrases are worst-case scenarios that specifically target some of the strategies that modern speech recognition uses to improve accuracy, and each one removes some advantage that the software naturally has. I’ve been collecting these same sentences for a few years, using different services, and using them as examples in my various presentations on speech recognition for undergraduates. Here are my results using Nuance’s Dragon Dictate 3 (trained on my voice), Nuance’s “FlexT9” app (which uses Nuance’s server-side solution, untrained), and Google’s Voice Search Interface (the same one you get by clicking the microphone in Google’s search bar).

We’ll start with a very basic sentence, said quickly and without context. First, the phrase, and then the Dragon output:
- “These slides are for a phonetics class”
  - Google Voice Search: the slicer for the next class
  - FlexT9: The slides are for phonetics class
  - Dragon: These slides are for phonetics class
No major surprises here, although I’m shocked that FlexT9 (a network based service) had “Phonetics”, a very uncommon word.
Here’s a sentence which amuses me, from 2001: A Space Odyssey:
- “Open the pod bay doors, HAL”
  - Google Voice Search: open the pod bay doors hal
  - FlexT9: Open the pod bay doors Hal
  - Dragon: Open the pod bay doors HAL
Dragon got the capitalization! Somebody must’ve trained it specifically to do that. Cute.
How about some untrained medical vocabulary?
- “She’s got Thrombocytopenia”
  - Google Voice Search: she's got from beside a pina
  - FlexT9: She’s got thrombocytopenia
  - Dragon: She’s got thrombocytopenia
Swing-and-a-miss for Google, but Nuance seems to have hidden some odd vocab even in their non-medical versions.
An unusual expression paired with an unusual name:
- “For shizzle, Bashira”
  - Google Voice Search: for shizzle the shear ice
  - FlexT9: For chisel The Shira
  - Dragon: Forces a this year i
I love that Google Voice Search knows “For shizzle” as an expression, and this is the first total strikeout for Dragon so far.
How about a semantically unpredictable sentence, where the words are rather completely unrelated and can’t be guessed by how often they occur together?
- “Tall hamburgers fly sexily under a sultan’s flamingo”
  - Google Voice Search: call hamburgers fly sexually and result in flamingo
  - FlexT9: Call hamburgers fly sexily hunder his sultan’s flamingo
  - Dragon: Tall hamburgers fly sexily under assault’s flamingo
This is particularly interesting. Note that Google changed “under a sultan’s” into “and result in”, a much more common phrase in speech, betraying Google’s “large corpus, find common phrases” bias. It’s also strange that FlexT9 spat out “hunder”, which I’ve been unable to find a definition of.
Now, a sentence made up entirely of homophones (words which sound the same as another word) for more common words:
- “I’m gun a wok ewe two classe four Xer size”
  - Google Voice Search: I'm gonna walk you to class for exercise
  - FlexT9: I’m been walking to class for exercise
  - Dragon: I’m going to walk you to class for exercise
Nuance uses similar modeling (I suspect) for both Dragon and FlexT9 (their local and server-based solutions), and both struggle with reduction in sentences like “gonna” or “wanna”. It’s interesting that Dragon corrects to the more canonical “going to”, whereas FlexT9 fails outright. Google, the most casual of them all, tolerates reduction quite well.

Also, of course, all of them fail to pick up the homophony, but that’s to be expected. “two” and “to” really are pronounced identically in many cases, “walk” and “wok” the same. Only by context and our knowledge of the language’s grammar can we disambiguate these as speakers, let alone as computer programs.
Now, a homophone with understandable context:
- “I’m gonna take a wok from the chinese restaurant”
  - Google Voice Search: I'm gonna take a walk from the chinese restaurant
  - FlexT9: I get a take a walk from the Chinese Restaurant
  - Dragon: I’m good to take a walk from the Chinese restaurant
Once again, Nuance stumbles on “Gonna”, and none of the services pick up on “wok” given the Chinese restaurant context. To be expected, but quite amusing still.
This is a sentence which is tough for many undergrads to write correctly, let alone a computer.
- “They’re going there to check their mail”
  - Google Voice Search: they're going there to check your mail
  - FlexT9: They are going there to check their mail
  - Dragon: They are going there to check their mail
Nuance really hates reduced forms, likely because of their business heritage. Nonetheless, all three services are able to use basic grammatical knowledge to use the proper forms in the proper places (although Google, as always, is reading your mail).
How about speech recognition on dangerous ground? (If you don’t get it, say it quickly aloud)
- “Their deals are Sofa King good”
  - Google Voice Search: their deals are so f****** good (censorship as given)
  - FlexT9: Their deals are so f*cking good (censorship as given)
  - Dragon: Their deals are so fighting good
This is another rough homophone to handle, and our first two contestants fall right into the trap. Dragon 3 will never say the word “fucking”, even though I’ve tried to train it, but for whatever reason, it’s happy to say “ass”.
Now, “Don’t you want to go to the park?”, reduced almost beyond recognition (as in this sound file):
- “[ʊ̃tʃə wɑɾ̃ə kʌɾ̃əðə pɑɹk̚]”
  - Google Voice Search: which wanna come apart
  - FlexT9: But you want on the part
  - Dragon: One of the park
We shouldn’t really expect speech recognition software to deal with this kind of reduction, because, well, humans can hardly deal with it. I am impressed, though, that Dragon caught the “park” at the end.
How about the funny sound we make when we don’t know something (as in this sound file)?
- “[That weird sound we make when we don’t know]”
  - Google Voice Search: ass
  - FlexT9: Hey
  - Dragon: [Dragon didn't acknowledge this as speech, printed nothing]
I think Google is taking this test a bit personally. Well, if it’s going to call me an ass, I’ll do something really mean…
And, finally, my absolute favorite horrible thing to do to speech recognition: The opening of the rap from Sugar Hill Gang’s “Rapper’s Delight”, pronounced at regular speed, by me, according to the transcription below:
- “[ɑsɛdə hɪp hap ðəhɪbi ðe hɪbi təðə hɪp hɪp hɑpʰɑjɨ doʊn stɑp ðə ɹɑkɪn tʰəðə bejŋ bejŋ bʊɡi seɪ ʔʌp dʒʌmp ðə bʊɡi tʰəðə ɹɪðm ʌ ðʌ bʊɡidə bit]”
  - Google Voice Search: hip hoppity hippity hip hop you don't stop rockin the bang bang boogie chapter 8
  - FlexT9: Is that hip hop baby baby to get it off you don’t stop rocking the bang bang boogie say up jumped over the rhythm of the Bebi
  - Dragon: Is it hip hop debut his populist operon of angling receptor and rhythm of the validity
Lol.

Conclusion of a silly experiment

Failures here shouldn’t be considered a failure of the speech recognition software at all. These are sentences which I’ve written and designed to break speech recognition software, and as you can see, many of them succeeded. However, unless you often discuss Sultan’s flamingoes or often recite rap lyrics into your papers, most of these issues won’t show up for you.

I just figured I’d share my amusement, and hopefully you’ve learned just a little bit about what speech recognition is struggling with in this world. Eye hope ewe end joyed!