LIGN 6 - Text-to-speech

### Opening Music

Patricia Taxxon - sd_bbb

---

# Text-to-Speech Synthesis

### Will Styler - LIGN 6

---

### Today's plan

- Speech Synthesis Tasks

- Text-to-Phoneme Modeling

- What are the ways text-to-speech is done?

---

### What is Text-to-Speech?

---

### Text-to-Speech (TTS)

- A means of turning written input into spoken output without using a human

- Also called "Speech Synthesis"

---

### TTS has two stages

- Text Analysis: "How is this chunk of text pronounced?"

- Sound Synthesis: "Let's turn that into an acoustic signal from playback"

---

### "PG&E will file schedules on April 20."

- <img class="r-stretch" src="comp/tts_phones.jpg">

- <img class="r-stretch" src="comp/tts_wave.jpg">

- (Thanks to Julia Hirschberg for this annotated chunk)

---

... but first...

- ### What are we trying to do?

---

# Text-to-Speech Tasks

---

### Many different tasks

- Different inputs

- Different speech scenarios

- Different timeframes

---

### Phrase Playback Tasks

> "Please place the item back in the bagging area"

> "Please scan your next item, or press "Pay now""

> "An attendant has been notified to assist you"

---

### Phrase Playback

- Not "real" text to speech

- Playing back a series of recorded sound files

- Fixed vocabulary

- No attempts made at prosody

---

### Domain Specific Synthesis

> ...HIGH SURF THURSDAY AFTERNOON THROUGH FRIDAY...

> A long period west to northwest swell will bring high surf
Thursday afternoon through Friday. The peak swell and surf will
occur Thursday night into early Friday morning, with the highest
surf in southern San Diego County. Minor coastal flooding will
occur during periods of high tides.

---

### Domain-Specific Systems

- Combining a series of pre-recorded snippets

- Has a fixed vocabulary of 'chunks', recombined into different orders

- You could record the entire set of possible phrases

- 50 states, 3007 counties, 19354 'incorporated places' in the US
	
---

### Arbitrary Text Systems

> Alaina Rutkowska posted a great close-up photo of the tube of an ice-cream cone worm. The tube is made of sand grains, carefully selected and fitted together, and bound with a special adhesive. The worm has golden bristles used to rake through sediment so it can pick up yummy bits with little tentacles.

---

### Arbitrary Text Systems

- Must be able to reproduce any written sentence

- Must be able to cope with any textual input

- Word's not in the dictionary?  Godspeed.

- This is most common, and *really hard* to do.

---

### 'Vocaloid' Singing Systems

- Specialized Text-to-Speech systems which can serve as 'artificial singers'

- 'Vocaloid' is the best known software for this

- Popularized by Hatsune Miku, a singer who doesn't strictly speaking exist

- Strong emphasis on pitch matching and the ability to produce arbitrary melodies

---

### So, how do we do synthesis?

- Step one is...

---

# Text-to-Phone Modeling

---

### Text-to-Phone Processing

- "What sequence of phones do I need to produce this line of text?"

- "What's the appropriate prosody for this sentence?"

- "How is this given word actually pronounced?"

---

### Text-to-Phone by Dictionary

- "Here's how 134,000 spelled forms are pronounced"
	
- See [CMUDict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict)

- <img class="r-stretch" src="comp/cmudict.jpg">

---

### Grapheme-to-Phone Modeling

- Rule-based models to convert letters to phones

- "Phonics"

- Depends on the writing system being regular

- 🤣
	
- This is now being more successfully done by huge NNs

---

### Prosodic Modeling

- "Given the spelling, syntax, and punctuation, what's the right tune?"

- Rule based approaches (e.g. "Rise at the end when you see a ?")

- Modeling approaches ("Given these linguistic facts, choose the likeliest value for this word")

- Neural Networks ("Just look at the data and figure it out")

---

... Once you've got the phones, and a prosodic plan, you can move into...

---

## Synthesizing Speech

---

### Text-to-Speech Methods

- Articulatory Synthesis

- Concatenative Synthesis

- Neural Network Synthesis

---

### Articulatory Synthesis

- Reproduce sounds by reproducing speech gestures

- "Virtual tongue"

- Can be implemented in hardware or in software

---

### Hardware articulatory Synthesis

---

### Articulatory Synthesis: Pros

- *Zero* speech recording required

- Any voice is possible

- Coarticulatory stuff comes for free

---

### Articulatory Synthesis: Cons

- *Really* complicated to model

- We barely understand humans enough to do this
	
- Complex models needed for each word

- ... or at least for each combination of phones
	
- There are many things we don't model well yet

- Robots could do fine with a single speaker

---

### Articulatory Synthesis isn't used outside research

- Creating stimuli for perception experiments with careful control

- For TTS, this would be *insane*

---

### Concatenative Synthesis

---

## Concatenation

Combining multiple elements together in a one-after-the-other fashion

- Can be used for text, sound, images, etc.

---

### Concatenative Synthesis

- Take units of real recorded speech and stitch them together according to our needs

- Units can be of any length
	
- Smooth over the prosody after-the-fact

---

### Concatenative Synthesis Units can be of any length!

- Phones, diphones, triphones, syllables, words, phrases

- We grab the largest available chunks that fit

- New words can be constructed from existing (di)phones

- This is called *unit selection*

- Some systems are phone-level only

---

### Domain-Specific Systems

> ...HIGH SURF THURSDAY AFTERNOON THROUGH FRIDAY...

> A long period west to northwest swell will bring high surf
Thursday afternoon through Friday. The peak swell and surf will
occur Thursday night
into early Friday morning, with the highest
surf in southern San Diego County. Minor coastal flooding will
occur during periods of high tides.

---

### Arbitrary Text Systems

---

### Concatenative Synthesis Database Creation

- Step 1) Record a LOT of speech from an actual human (tens to hundreds of hours)

- Step 2) Divide into segments at the desired unit levels

- Step 3) Capture important phonetic information about each segment (e.g. pitch, duration, syllable position, etc)

- Step 4) Add additional words to the database as needed.

- **All of this happens offline, before the system is deployed**

---

### Concatenative Synthesis Synthesis Process

- Step 1) Text Modeling (for prosody and phones) to determine the plan

- Step 2) Choose the optimal chunks

- You want long length (e.g. use available large chunks)
	
	- You want it to match the target context and prosody
	
	- You want good transitions
	
	- Use a decision tree process to do this
	
- Step 3) Concatenate (combine end-to-end) the chunks
	
- Step 4) Modify the prosody and duration where needed

- Not all systems bother with this

---

### Modifying Prosody

- This usually uses PSOLA and duplicating cycles

- Modifies existing speech to hit length and pitch targets

- You've seen PSOLA before...

---

---

### Concatenative Synthesis: Pros

- Can be very lightweight (diphone-based systems)

- Can handle most words acceptably

- Especially with good spelling dictionaries and databases
	
- Failures are usually plausible

- "Oh, look how it mispronounced 'Caminito'" rather than missing words
	
- Generally does the job

---

### Concatenative Synthesis: Cons

- You're only as good as your dictionaries, database, and spelling rules

- Jelena Krivokapic

- Very easy to seem "disjoint" and disfluent

- All prosodic changes must be explicitly modeled

---

## Neural Network TTS

---

### Using Neural Networks for TTS

- Feed the diphone/character sequence in, get back a likely acoustic signal

- This will generate a voice which matches (roughly) the input training voice

- Style transfer is possible too!

- Training the model on a generic voice

- Then learning the variation associated with another voice as a style embedding

- Then applying the variation to the pre-existing model

---

### Neural Network Text-to-Speech Style Transfer Examples

---

### You can make a model of anybody these days...

(Credit to Erick Amaro!)

---

### TacoTron2

- Uses an RNN to go directly from Text to speech

- [The Tacotron 2 TTS System](https://arxiv.org/abs/1712.05884)

- [Samples](https://google.github.io/tacotron/publications/tacotron2/index.html)

---

### Amazon's Neural TTS

- [Samples](https://developer.amazon.com/blogs/alexa/post/7ab9665a-0536-4be2-aaad-18281ec59af8/varying-speaking-styles-with-neural-text-to-speech)

---

### Microsoft RobuTrans

- [Robutrans](https://ojs.aaai.org//index.php/AAAI/article/view/6337) uses a transformer architecture for TTS

- Feeds in text, turns that into 'linguistic features', which map into mel-spectra

- There's a step dedicated to handling prosodic issues

- There's a final WaveNet step which turns the mel spectra into audio samples

---

### Neural Networks are winning

- If you can afford them, they produce great results

- Less hand-annotation

- Better analogy

- You'll likely be playing with Neural TTS

- Apple is even [marketing using Neural Text-to-Speech](https://www.shacknews.com/article/112048/apple-announces-neural-tts-support-for-siri-at-wwdc-2019)

- [Here's a great overview on Neural Text-To-Speech](https://arxiv.org/pdf/2106.15561.pdf)

---

### They're not without problems

- They require *lots* of training data and time

- Much harder for low-resource languages

- NN TTS models can do strange things like skipping or duplicating or continuiously repeating words if the NN state becomes misaligned

- You can even get infinite strings of repeating words
	
	- You can also get strange 'blendings' and 'slurrings' which are unlike any word
	
- These problems are occasionally harder to 'recover' for humans

- Missing, duplicated, or slurred words are harder for human understanding than a 'mispronunciation'

- Real time is hard with 'autoregressive' models (where inference or prediction depends on the prior state)

---

### Let's try a Neural TTS system

<https://cloud.google.com/text-to-speech/>

---

### Wrapping Up

- TTS systems involve both modeling text and producing sound

- There are several ways to do TTS

- Articulatory Synthesis is neat, but unneeded

- Concatenative Synthesis works quite well

- Neural Text-to-Speech systems are getting amazing

---

### For next time

We're going to DISCUSS why TTS systems are reeeeeally hard to get right.  Also, fyi, I worked for Jelena Krivokapic in 2017

---

<huge>Thank you!</huge>