Using the HTK and the Penn Phonetics Lab Forced Aligner on Mac OS X

This was originally posted on my blog, Notes from a Linguistic Mystic in 2014. See all posts

EDIT: According to Mitch Ohriner in November 2019, these instructions still work, albeit with the following two changes. Thanks Mitch!

In “strarr.c” in htk/HTKLib I had to replace Line 21 with #include <stdlib.h>

This is probably because of the permissions structure of my school-owned computer, but I had to use sudo make install instead of make install to finish up the HTK stuff (so that it would elicit a password instead of just saying “permission denied”.

And according to Kechun Li, in February 2023:

I also hope to report some small changes that I had to do, in case this is helpful to future users. First, in Step 0.5, the link for ‘these instructions’ did not work. But there is plenty of content on how to install the X-code command line tools so I managed to do it. Second, in Step 1, I need to change another line of code in one of HTK files. In htk/HTKLib/strarr.c, I need to change #include <malloc.h> to #include <malloc/malloc.h>, otherwise, there will be an error about this file can’t be found. This is the solution I found in this post.

Finally, I’m now recommending Jian Zhu’s Charsiu Neural Forced Aligner over P2FA, as it’s much more performant, and in many cases doesn’t require text at all.

As part of my dissertation, I’m having to record a large number of subjects and do analyses on their speech. The biggest problem with doing that is that in order to do the analyses automatically, you need to time-align the words, creating files which tell your analysis software (in this case, Praat) where each sentence/word/sound starts and ends.

The fastest way to do this automatically is using what’s called “forced alignment”, and the current best forced aligner for English for phonetic use is the Penn Phonetics Lab Forced Aligner. In this post, I’ll describe how I got it working on my Mac running Mavericks (10.9), in a step-by-step sort of way.

There are four basic steps involved: 1. Install HTK (the hard part!) 2. Install the Penn Phonetics Lab Forced Aligner (henceforth P2FA) 3. Install Sox (which is required by P2FA) 4. Set it up for your data and run it to get aligned textgrids

Disclaimer

This post is up as a public service. I’ve done my absolute best to be comprehensive and clear, but your system/install/issue may vary, and they might update any of these tools at any time, and this post may not change when they do. I’m also mid-dissertation, so I’m unable to offer personal assistance setting up P2FA to commenters or by email.

Feel free to leave a comment if you have a question or issue, and maybe somebody can help, but nothing’s guaranteed. In short, the Linguistic Mystic is not responsible for any troubles, your mileage may vary, good luck and godspeed.

Step 0.5: Xcode Command Line Tools

If you’re doing anything code-y on a Mac, you need Xcode for the compilers and other useful tools it has.

Download Xcode from the Mac App Store (it’s free).
Follow these instructions to install the XCode command line tools.

Step 1: Installing HTK

This is the hardest and most terrifying part if you’re not used to compiling and installing command-line tools. We’ll take it step by step, though.

The P2FA readme is very specific that you need version 3.4 of HTK, so let’s install that. The manual isn’t terribly helpful for a Mac install, so we’ll have to go this alone.

Go over to http://htk.eng.cam.ac.uk and register. It’s free and only takes a minute.
Download HTK 3.4 from this page. Since you’re on a Mac, grab HTK-3.4.tar.gz.
On your Mac, go to wherever the file downloaded to, and double-click the .tar.gz file to expand it. This will create a folder called “htk”, and for the rest of this tutorial, I’m going to pretend it’s on your desktop.
Open up Terminal.app (/Applications/Utilities/Terminal.app).
- Any time you see a command inside a code block, that means “type the command into a terminal exactly”
Enter the command cd ~/Desktop/htk
Run ./configure -build=i686-apple-macos LDFLAGS=-L/opt/X11/lib CFLAGS='-I/opt/X11/include -I/usr/include/malloc'to configure the software for OS X. A major hat-tip to this post for helping me with that command.

At this point, the HTK manual says you should be able to make all && make install. But it’s not that easy. If you run that command, you’ll get a couple of errors which look like:

esignal.c:1184:25: error: use of undeclared identifier 'ARCH' architecture = ARCH;

Translation: “I’m trying to prep the file HTKLib/esignal.c, but nobody told me what system architecture this code is gonna be run on. Unless I know that, I can’t build!” This is actually a problem with the way HTK is written, but luckily, we can fix it by manually specifying that the Architecture is “darwin” (which it always is, for OS X). A major hat-tip to this post for helping me figure out some of these issues.
Open the HTKLib/esignal.c file in a code-friendly text editor. You can use Xcode, or my personal favorite TextMate 2.
Find and change the below lines:

Change Line 974: if (strcmp(architecture, ARCH) == 0) /* native architecture */

To: if (strcmp(architecture, "darwin") == 0) /* native architecture */

Change Line 1184: architecture = ARCH;

To: architecture = "darwin";
Now, let’s build! Run make all in the terminal window. Some warnings will pop up, but we don’t care.
Now we’ll install it. Run make install
Just to test, run LMerge. It’ll pop up a message about USAGE, and that’s fine. That just tells us it installed OK.

Whew. HTK is installed. That was the tough part. Now let’s install P2FA.

Step 2: Installing Penn Phonetics Lab Forced Aligner

This part’s easier!

Download P2FA.
Double-click the .tgz file to open it up, giving a “p2fa” folder.
Move that folder someplace easy to find.

Done!

Step 3: Installing Sox

P2FA does depend on Sox to work. The easiest way to get Sox, by far, is using Homebrew, so we’ll do that. Homebrew is a great little program for easily and quickly installing all sorts of fun commandline tools. I love it.

Open your terminal back up.
Run ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"(This command straight from the Homebrew Homepage)
Once that’s done, install sox using brew install sox

Done!

Step 4: Setting P2FA up for your data and running it

Unfortunately, P2FA needs a particular format for your data to work. In my case, I had a bunch of files of people saying the exact same things, as prompted by a script. So, my sound files started with:

The word is men

The word is mint

… and so forth for the other 318 words which they read in the script.

To force-align data like this:

Make sure that any extraneous talk is trimmed out, such that the speech actually matches the script.
Create a text file which it will be aligned against. To capture the above, this file would look like:

{SL} {NS} sp THE sp WORD sp IS sp MEN {SL} {NS} sp THE sp WORD sp IS sp MINT {SL} {NS}

… and then goes on to do the same thing for all the other ‘the word is (word)’ sentences. The {SL} stands for “silence”, and covers the silence after they finish the sentence. The {NS} means “noise”, which is there to pick up the click of the keyboard as they advance the slide. Then, each sp (small pause) is in case the person pauses again between words. In P2FA, these “small pauses” can be present or not, and they should be sprinkled liberally throughout your data. All words need to be capitalized.

Save the file you’ve created. I’ll call it “alignscript.txt” in other examples.
Make sure that all words are included in the dictionary. It’ll yell at you at runtime if you’ve asked it to align a word which isn’t in the dictionary, so, if you’re aligning non-words (or even odd, new words), you’ll need to add them. Let’s say you want to add “neighed”:
1. To add a new word to dictionary, open the “model” folder, and then open “dict” in your text editor.
2. Find the line for a word which rhymes with your new word, like “made”:
  
  MADE M EY1 D
3. Modify the sounds for the new word:
  
  NEIGHED N EY1 D
Downsample the file you’re looking to align using Praat. I’ve had great luck using the suggested 11,025 Hz sampling rate. Save this as a .wav file.
- Remember, you can always use the Textgrid with the full-quality file later, the downsampled file is just temporary for alignment.
Run the aligner, modifying the paths in the command to fit where you’ve got P2FA and your data. The command is: python /path/to/align.py sound_file.wav alignscript.txt output_name.TextGrid

So, for my actual work, if I wanted to align the recording session file for a subject named “sarah”: ~/data/p2fa/align.py ~/data/sarah_session.wav ~/data/alignscript.txt ~/data/sarah_session.TextGrid
Go have coffee. For a 15 minute recording, it takes around 10 minutes for the forced aligner to run on my (fairly recent) Mac.
Open up the newly-generated .TextGrid file in Praat alongside the sound file and see how it did.

P2FA in Practice

So far, I’ve been really impressed with the results. It’s pretty good, with only one major error (missed word or complete mis-identification) in every two files. Individual sounds are missed more regularly (where it’ll cut off the /z/ in “meds” or the /n/ in “plan”). Vowel boundaries are off by 10 ms or so in around 1/3 of tokens.

I’ve been hand-correcting the data because I care a lot about those boundaries, but if I just wanted a measure at the center of the vowel, I wouldn’t even bother, as the vowel’s center is quite reliably in the center of the vowel span. Regardless of these issues, using P2FA with hand-correction, I’m able to beautifully annotate data in around 1/4 of the time it takes to do it by hand. It’s an absolutely excellent tool, and would recommend it to anybody.

So, I hope this was helpful, good luck, and good alignment!