### NO CLASS FRIDAY - You are legally obligated to take the time to do something kind to yourself --- # Sound, Computers, and ASR ### Will Styler - LIGN 6 --- ### How do acousticians say hello? - They wave! --- ### Today's Plan - Computers and Sound - Turning Signals into features - Automatic Speech Recognition --- ### We've got a fundamental problem, to start --- ### Computers don't do waves
010001110010101000100101101010101010 --- ### Sound is analog, computers are digital - How do we deal with that? --- ### Quantization ('Sampling')
--- ### Quantization ('Sampling')
--- ### Quantization ('Sampling')
--- ### Analog-to-digital conversion - Sample the wave many times per second - Record the amplitude at each sample - The resulting wave will faithfully capture the signal --- ### How often do we sample? - This is called the 'Sampling Rate' - Measured in samples per second (Hz) --- ### Sampling Rate
--- ### Sampling Rate
--- ### Sampling Rate (low rate)
--- ### Sampling Rate (awful rate)
--- ### Bad sampling makes for bad waves
--- ## Nyquist Theorem The highest frequency captured by a sample signal is one half the sampling rate --- ### Sampling Rates (Shpongle - 'Nothing is something worth doing') 44,100 Hz
22,050 Hz
11,025 Hz
6000 Hz
--- ### Sampling Rates (Shpongle - 'Nothing is something worth doing') 44,100 Hz
6000 Hz
3000 Hz
1500 Hz
800 Hz
--- ### Different media use different sampling rates - Radio was historically less than this - CDs are at 44,100 Hz - DVDs are at 48,000 Hz - High-End Audio DVDs are at 96,000 Hz - Some people want 192,000 Hz - Likely they are dolphins --- ### The 'Bit Depth' controls how much detail we store about each amplitude - 16 bits gives 65,563 levels, which is the default in modern machines --- ### Here's a talk about this I did which goes into more detail - Covers compression, bit depth, mp3, and more -
- Also LIGN 168! --- ### AD Conversion now yields a signal that the computer can read - ... but how does it interpret it? --- ### Well, much like the rest of us!
--- ### There are more problems - We're going to use Neural Networks - Or, historically, hidden markov models - ... but what are the algorithms looking at? --- ### Putting in the waveform itself was historically a poor choice - It's cheap and easy - NNs weren't amazing at estimating frequency-based effects - Recent approaches are changing that (c.f. [Wav2Vec](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)) - Important parts of the signal live only in frequency band info - We want to be able to give it all the information we can, in the most useful format! --- ### Why not linguistically useful features?
--- ### Linguistically useful features benefits - They reflect speech-specific understanding - They treat speech as "special" - They reflect articulatory facts - They're efficient - Optimal informativeness per feature - They're very transparent - We know what each of them means --- ### Linguistically useful features downsides - Slow to extract - Require specialized algorithms to extract - They treat speech as "special" --- ### For research, linguistically useful features are great - ... but in production, we don't care --- ### We don't need transparent or minimal - We're plugging it into a black box - We're happy to plug in hundreds of features, if need be - We'd just as soon turn that sound into a boring matrix --- ### Let's get that algorithm a Matrix - Algorithms love Matrices --- # Mel-Frequency Cepstral Coefficients (MFCCs) --- ### We're not going deep here - This is a lot of signal processing - We're going to teach the idea, not the practice --- ### MFCCs
--- ### MFCC Process - 1: Create a spectrogram - 2: Extract the most useful bands for speech (in Mels) - 3: Look at the frequencies of this banded signal (repeating the Fourier Transform process) - 4: Simplify this into a smaller number of coefficients using DCT - Usually 12 or 13 --- ### MFCC Input
--- ### MFCC Output
--- ### So, the sound becomes a matrix of features - Many rows (representing time during the signal) - N columns (usually 13) with coefficients which tell us the spectral shape - It's black-boxy, but we don't care. - We've created a Matrix ---
--- ### Now we've got a matrix representing the sound - ... which captures frequency information, according to our perceptual needs --- ### It's Neural Network time!
--- ... Wait, hold on. - ### What are we actually recognizing? --- ### What are we recognizing in speech recognition? - We need to give the NN labeled data - [Chunk of Sound MFCCed] == [Labeled Linguistic Info] - (for Many many many many tokens) - What level do we want to recognize at? --- ### Possible levels of recognition - Sentences? - Words? - Letters? - Phones? - Diphones? --- ### Sentences - Why are sentences a bad idea? --- ### Words
"Noise" --- ### Word Recognition Pros - Handles larger patterns of coarticulation - Captures word specific effects - Robust to short duration noise - Word annotation is *way* cheaper --- ### Word Recognition Cons - What about novel words? - Training data becomes much more sparse - Can we really learn nothing about "boy" from "soy"? --- ### Grapheme-based Recognition - You could use the orthography itself as the 'pronunciation dictionary' and recognize letters ('graphemes') - Mapping straight from letters to speech signal - This is actually happening now! - [Here's one example](https://aclanthology.org/2020.sltu-1.7.pdf) and [another](http://www.interspeech2020.org/uploadfile/pdf/Wed-2-8-8.pdf) - Here's another production system you can play with: [HuggingFace2](https://huggingface.co/docs/transformers/model_doc/wav2vec2#transformers.Wav2Vec2ForCTC) --- ### Grapheme-based Pros - The data are much easier to get - Subtitles, transcripts, etc - More able to handle new words and names - It can guess how 'Haligtree' or 'Maliketh' sound without dictionary entries - **You don't need dictionaries to map from words to phones!** --- ### Grapheme-based Cons - Grapheme-to-phone conversion is very language specific - It's often roughly and thoroughly arbitrary - Some languages' writing systems have less mutual information with spoken language - It throws away data for many homograph differences (e.g. record, villa, does) --- ### Phones
--- ### Phone Recognition Pros - The most basic unit, so training data is rich - Can (theoretically) work for any language - Can still capture unknown words - "Fuzzy matching" --- ### Phone Recognition Cons - Annotation is brutally expensive - Coarticulation is problematic - Phone-level recognition is overkill for many contexts --- ### Diphones
--- ### Diphone Recognition Pros - Coarticulation becomes a feature, not a bug - Still very basic, so all training data provides data - Can still (theoretically) work for any language - ... but patterns of coarticulation differ - Can still capture unknown words via Fuzzy matching --- ### Diphone Recognition Cons - Still stupidly expensive to annotate - Still overkill in many contexts --- ### In practice, many systems use diphones - [CMU's Sphynx does](https://cmusphinx.github.io/) - As do many others - Triphones are often a possibility --- ### ... but modern systems are often going waveform-to-grapheme - This is absolutely wild --- ### So, we can now train a system - Capture sounds and annotate them as diphones or words - MFCC them, or read in the waveform alongside word labels, and feed them into a neural network as training data - Then later, feed new data in and get back a list of phones (or words), which you can use to predict which words were intended! --- ### That's a tricky step right there - Why? --- ### Your ASR system is only as good as your dictionary and/or training data - "For shizzle, Bashira" - "Mel Frequency Cepstral Coefficient" - "Differentiating Theta and Eth" - "Take Caminito Santa Fe, then Mira Mesa into La Jolla" --- ### Users have very specific matches they expect --- ### "Hey Siri play songs by the Bedsit Infamy" -
--- ### "Hey Siri play songs by the Bedsit Infamy"
--- ### How do we test the system? --- ### Like this
--- ### Wrapping Up - Computers can learn to do the wave - MFCCs turn beautiful sounds into opaque, useful matrices - Speech Recognition often uses diphones - You're only as good as your dictionary --- ## For next time - **NO CLASS FRIDAY** - Why is speech recognition so damned hard? ---
Thank you!