### Text-to-speech system designers don't need resumes - They just let their work speak for itself --- ### From Last Time --- ### TacoTron2 - Uses an RNN to go directly from Text to speech - [The Tacotron 2 TTS System](https://arxiv.org/abs/1712.05884) - [Samples](https://google.github.io/tacotron/publications/tacotron2/index.html) --- ### Amazon's Neural TTS - [Samples](https://developer.amazon.com/blogs/alexa/post/7ab9665a-0536-4be2-aaad-18281ec59af8/varying-speaking-styles-with-neural-text-to-speech) --- ### Microsoft RobuTrans - [Robutrans](https://ojs.aaai.org//index.php/AAAI/article/view/6337) uses a transformer architecture for TTS - Feeds in text, turns that into 'linguistic features', which map into mel-spectra - There's a step dedicated to handling prosodic issues - There's a final WaveNet step which turns the mel spectra into audio samples --- ### Neural Networks are winning - If you can afford them, they produce great results - Less hand-annotation - Better analogy to new words - You'll likely be playing with Neural TTS - Apple is even [marketing using Neural Text-to-Speech](https://www.shacknews.com/article/112048/apple-announces-neural-tts-support-for-siri-at-wwdc-2019) - [Here's a great overview on Neural Text-To-Speech](https://arxiv.org/pdf/2106.15561.pdf) --- ### They're not without problems - They require *lots* of training data and time - Much harder for low-resource languages - NN TTS models can do strange things like skipping or duplicating or continuiously repeating words if the NN state becomes misaligned - You can even get infinite strings of repeating words - You can also get strange 'blendings' and 'slurrings' which are unlike any word - These problems are occasionally harder to 'recover' for humans - Missing, duplicated, or slurred words are harder for human understanding than a 'mispronunciation' - Real time is hard with 'autoregressive' models (where inference or prediction depends on the prior state) --- ### Let's try a Neural TTS system
--- # Text-to-Speech Synthesis is hard ### Will Styler - LIGN 6 --- > It should be abundantly clear that TTS systems are far from perfect. There are a number of issues, ranging from text analysis, to names, to numbers, like combining units, and worst of all, it's fiendishly difficult to get the prosody right.
--- (The majority of today's TTS samples are from Apple's 'say' command, which is behind the times, but illustrative!) --- ### Today's plan - Text analysis is hard - Fluidly combining units is hard - Prosody is hard to get right --- # Text Analysis is hard --- ### Detecting the end of sentences > It's difficult to even detect something like the end of a sentence. Although periods and exclamation and question marks provide good information, there are situations (e.g. the word e.g.) where periods can be used on their own. And we'll often end sentences by trailing off, blank lines, etc
--- ### Acronyms and Initialisms > Initialisms are read aloud as a series of letters, like the CIA, UCSD, NSA, and FYI. Acronyms are pronounced, like NASA, DARPA, FAFSA, or RAV4. And some have very specific pronunciations, like NAACP or AAA. Apple
IBM
--- ### Numbers > Numbers are hard because we read numbers differently depending on their function. You're born in 1999, your pin number is 1999, you might have 1999 grains of rice in a cooker, but January 25 is 25 days after the 1.
--- ### Homographs are hard! - "We could lead in lead removal." - Noun vs. verb - "The wedding dress sewer fell into the sewer" - "The plumbing contractor is unionized." - "The acetic acid is unionized." --- ### The Lexicon - Preparedness across many domains - Medical, Legal, Military, International Places and Concepts - Miscellaneous Technical Jargon - Local street names - Code switching (switching between languages) --- ### Jargon > Adenocarcinoma in Tubovillious Adenoma bona fide certiorari de jure collusion RICO ex post facto CVN AWACS Escapement Tourbillion Remontoir de Egalite Apple
IBM
--- ### Placenames > On my map is Lebon Drive, Gilman Drive, Miramar Road, Muir Lane, Caminito Santa Fe, Soledad Mountain Road, San Joaquin Drive, Arcadia Road, and I'm now in La Jolla and thinking of Moscow, Guangzhou and Darjeeling. Apple
IBM
--- ### Codeswitching > MaƱana me voy a Walmart to buy some calcetines y un poco del Chocolate that you really like Apple
--- ### Names are super hard - Spelling is arbitrary and variable - Names from around the world - 1.5 million names in 72 million households (1987 Donnelly list) - 20%+ of tokens in newswire --- ### Let's check some names
--- ### Oof. - ... but this gets at a truth --- ### Language tasks that are hard for humans are often even harder for machines! - Humans are bad at names too - We know some subset of names common in our region - Spelling or pronounced variants still cause problems - "Alycia" - "Andres" vs. "Andries" --- ### Katelyn, Caitlin, Caitlyn, Kaetlin, Katelin, Katelynn, Kate Lynn, Caitlynn, Kaeytlynn
--- ### Text Analysis is hard - The writing system is awful - The proper pronunciation isn't always clear - Technical, Local, and field-specific jargon is everywhere - Place names are hard - Names are nearly impossible --- ... but wait, there's more! --- # Unit Selection is Hard --- ### Concatenative Synthesis Synthesis has lots of problems --- ### Concatenative Synthesis Database Creation (Step by step) --- ### Step 1) Record a LOT of speech from an actual human (tens to hundreds of hours) ---
--- ### Step 2) Divide into segments at the desired unit levels - Segmenting sounds is hard remember? --- ### Step 3) Capture important phonetic information about each segment (e.g. pitch, duration, syllable position, etc) - So you've got all the problems of phonetic research, too! --- ### Step 4) Add additional words to the database as needed. - What if the voice actor leaves the company? Stops working? --- ### ... and Synthesis is hard too --- ### Step 1) Text Modeling (for prosody and phones) to determine the plan - Uh, yeah. Tough. --- ### Step 2) Choose the optimal chunks - You don't always have the optimal chunks - The "San" from San Diego may be subtly different from the "San" in "San Ysidro" - Many criteria for optimal fit --- ### Step 3) Concatenate (combine end-to-end) the chunks - Different volumes? - Different pitches? - Even just stitching them together is non-trivial - What's the spacing between items? --- ### Step 4) Modify the prosody and duration where needed - Pitch correction isn't perfect, even with a perfect model of pitch - We're still missing other prosodic elements --- ### We've all heard bad unit selection
--- ### Neural Networks fix some of this - It's generating, rather than combining - But still, difficult! --- # Prosody is hard --- ### Emotional prosody - "Did you hear John's back in the hospital?" - "I'm really, really excited about the LIGN 6 final project!!" - "My wife decided she wants to go to a steakhouse tonight." - The risks of incorrect emotion are very high - Do we want to simulate this? --- ### Computers get judgemental about donuts ---
--- ### Meaning Differences from prosody - "I think I'll come tomorrow" - "Bill is coming if he's allowed" - "John should know that" - "I really like eating at Taco Bell. It is the peak of gourmet cuisine." --- ### Getting the timing right is hard
--- ### Text to Speech remains a hard problem - ... but we've come a LONG way
--- ### Even the free tools are quite good - [Festival](http://www.cstr.ed.ac.uk/projects/festival/) is the best known free TTS --- (Just in case you want to implement TTS for your final project) - Speaking of which... --- # Let's talk Projects --- ### Important Dates - Week 6 - **Proposal Due** - Finals Week - **Final Project Due** --- ### Topics - 1: Build a system - 2: Implement an NLP tool on your local machine and solve a small problem - 3: Choose your own adventure - All of these are graded according to [the final project rubric](https://wstyler.ucsd.edu/6/l6_final_rubric.html) --- ### Self Grading - **You will grade your own project using the rubric!** - Details are up on the rubric - We'll check that you were honest and accurate in your evaluation --- ### Build-a-System - Come up with a creative 'Virtual Assistant' type product - Build for a specific domain, use case, or situation - Describe the easy and hard parts of each element (ASR, TTS, Parsing, Semantics, Interaction) - Think outside the box here - More creative ideas are more fun to write and grade --- ### Implement a tool - Pick a tool for ASR, TTS, Parsing/Tagging, Semantic Analysis, or similar, and implement it - Get it running *on your machine* (or our linux server) - Throw some data at it - Do a basic analysis of how well or poorly it works - ### Come talk to me in office hours to plan this out before the proposal --- ### Choose your own adventure - Align this course's goals with your own work or interests - If there's a way to make this interesting for you, I want that. - ### Come to office hours to pitch your idea(s) before the proposal --- ### Group work is *highly* encouraged! - Pairing Linguistically experienced folks with computationally experienced folks? - People of similar interests - Mutual learning! - You'll submit just one final paper, declaring who did what - **Check the Discord to see who's interested in what** --- ### Wrapping Up - Every element of TTS is difficult - Text modeling - Breadth of vocabulary - Unit concatenation - Prosody - Start thinking about your final projects --- ### For next time We're shifting gears into text, and talking about how to look at Linguistic data ---
Thank you!