Automatic Part-of-Speech Tagging

Will Styler - LIGN 6

Today's Plan

Why computers can't POS tag like humans
Creating a corpus for POS-tagging use
Part-of-Speech Ambiguity
How does HMM-based POS tagging work?
POS Tagging is hard

We've talked about parts of speech already

... but these are linguistic, human categories

We understand the functional distinction between an adverb and a preposition
We can talk within a certain language, but understand when the rules change
We know the semantics of a given word
- We know that a pipe is an object, and that it has a function that could be verby

We also gave you 'tests' to use

"Can you make it plural? If so, it's a noun!"
"Can you inflect it? If so, it's a verb!"
"If you can use a comparative construction, it's probably an adjective!"
"Pronouns can substitute for noun phrases"
"Is this a relationship a squirrel can have with a tree? Then probably Preposition!"

... but a computer can't use any of these tests

"Sure, rotates is the plural of 'rotate', so it's a noun"
"I treed, therefore, tree is a verb"
"This slide is computerer than the last one"
"What the heck is a noun phrase, anyways?"
"Squirrel? Tree? Huh?"

So, we can't teach computers to do POS tagging in the same way that we teach humans to!

Preparing for POS Tagging

Before we can automate it, we need to do it with humans

This is always going to be the case

Determining the best tagset

This is partly language specific
- What POS categories exist
- What additional detail would be helpful in prediction
Partly based on what corpora are available
- Use this tagset, or annotate 12 million words?

For English...

(Table from Jurafsky and Martin 'Speech and Language Processing' 3e)

Annotating a corpus for POS tags

Teach some annotators the POS-tagging system
Run a sample POS-tagging system to get suggestions
Have the annotators hand correct them

On/IN an/DT exceptionally/RB hot/JJ evening/NN early/RB in/IN July/NNP a/DT young/JJ man/NN came/VBD out/RP of/IN the/DT garret/NN in/IN which/WDT he/PRP lodged/VBN and/CC walked/VBD slowly/RB ,/, as/RB though/IN in/IN hesitation/NN ,/, towards/IN a/DT bridge/NN ./.

All example tagging from today comes from the Stanford Parser

There are many tagged corpora already out there

You don't need to do this.
Which is good.
POS tagging is super boring

Once you have a tagset and a corpus, you can use...

Automatic POS Tagging

POS Ambiguity

How much uncertainty there is about the part of speech of a given word

Some words are certain in terms of POS

'Funniest'
'hesitantly'
'Sharon'
Around 85% of words are unambiguous in terms of POS
- ... but around 65% of tokens in running text are ambiguous :(

Some words are only a bit ambiguous in POS

'in'
'a'
'between'
'Marshall'
'Demonstrated'

Some words are very ambiguous in POS

'sink'
'that'
'lift'
'will'

Some words have many parts of speech

earnings growth took a back/JJ seat
a small building in the back/NN
a clear majority of senators back/VBP the bill
Dave began to back/VB toward the door
enable the country to buy back/RP about debt
I was twenty-one back/RB then

POS tagging is about resolving this ambiguity

The Stupid Approach: 'Most Frequent Tag'

"Let the tag of word X be the most likely tag of word X in our corpus"
Tagging is just a lookup table
- 'fly' is most frequently a verb
- Therefore, every instance of 'fly' is a verb
This provides a 'baseline' performance
- "If we take the dumbest possible approach, what performance do we get?"

Most Frequent Tag Accuracy

Accuracy here is 'percentage of tags correctly labeled'
Most Frequent Tag gets 92% accuracy on WSJ data!
If we want to use something more complicated, you have to do better than this.
- If you can't beat the dumbest approach, you've got a problem

Slightly more intelligent: Word form features

Capitalization
- 'I showed Will my will, prepared by Green.'
Prefixes and suffixes are helpful.
- 'Ungerplinked'
- 'Flabertibly'
- 'Skwerking'
X-Y constructions are usually adjectives
- "New-found"
- "46-year"
- "Under-utilized"

... but words come in sequences. We should use that!

HMM-based POS Tagging

Hidden Markov Model

A machine learning process which models a series of observations, with the assumption that there's some 'hidden' state which helps to predict the observations

One major assumption of HMMs

The probability of the current state is based ONLY on the previous state
The model does not have long term 'memory'
The model cannot look ahead
This is a left-to-right walk through the data

HMMs for POS Tagging

Observations: The series of words in the text
States: The parts of speech of those words
'Look at the sequence of words, to help predict which part of speech corresponds to this word'

How do we use HMMs for POS-tagging

1: Calculate the probabilities of parts-of-speech (and sequences) from a corpus
2: Tokenize the input data
3: Using the input, decide the most likely sequence of parts-of-speech

We need to know two types of probabilities

Observation probability: The probability that a word has a given tag
- e.g. "How likely is "will" to be a modal verb?"
Transition Probability: The probability of one POS, given the prior one
- e.g. "How likely is a modal verb following a pronoun?"

To get observation probabilities...

Count the number of instances of "will" in the corpus
Count the number of times that it's a modal verb
Count the number of times it's a noun
Count the number of times it's a proper noun
... and so on ...
Turn these numbers into P(modal|will) (and so on)

Observation probability gets at the idea of 'POS Ambiguity'

Words that have little ambiguity will have high probabilities for one category
Words that have lots of ambiguity may have nearly equivalent probabilties across several categories

To get Transition probabilities...

Count the number of instances of modal in the corpus
Count the number of times modal follows pronoun
Count the number of times modal follows noun
Count the number of times modal follows verb
... and so on ...
Turn these numbers into P(modal|Previous pronoun) (and so on)

Transition probabilities get at the idea that syntax involves sequences of word types

How likely is a Determiner to be followed by a Noun?
- REALLY likely
How likely is a preposition to be followed by a determiner?
- Reasonably likely
How likely is a preposition to be followed by a proper noun?
- Likely-ish
How likely is a modal verb (e.g. 'will') to be followed by a Noun?
- Really unlikely

Now we know the probabilities!

Then we tokenize
Then...

We decode the HMM

"Given this sequence of words, what's the most likely sequence of POS tags"
This uses the Viterbi Algorithm
- Which we're not going into!

HMM Decoding: The Basic Idea

We know the probability of a given state (POS tag) given each word
We know the probability of a given state (POS tag) given the prior state (POS tag)
We can calculate the most probable state for each word in light of those two facts
What is the most likely string of states that gets us through the entire sentence?

So, we have the most likely set of POS tags

Both with respect to individual words' probabilities
... and with respect to the likely sequence of tags
This gives us the best of both worlds!
Cool!

One consequence of HMM-based tagging

Word order matters!

the/DT three/CD cute/JJ cats/NNS made/VBN will/MD sit/VB back/RP in/IN awe/NN

sit/VB cute/JJ three/CD awe/NN the/DT will/NN back/RB made/VBN in/IN cats/NNS
'will' goes from modal to noun
'back' goes from particle to adverb

How does HMM-based POS tagging perform?

Baseline ("Most Frequent Class"): ~92% accuracy
Hidden Markov Model POS Tagging: ~97% accuracy
That's pretty good!
This is one of the 'flagship' tasks for HMMs
- Other approaches exist
- Neural Networks didn't win, for once!
- (Well, OK, they might win by a few decimal places)

... Why only 97% accuracy?

POS Tagging is hard

Use-mention distinctions

Not all words are being used, when being used

You can have words that show up in uninformative contexts
Words that are being mentioned, rather than used, are hard to POS-tag

'She said 'bear' was her favorite word.'

She/PRP said/VBD `/`` bear/NN '/'' was/VBD her/PRP$ favorite/JJ word/NN ./.

'Roger texted me 'back''

Roger/NNP texted/VBD me/PRP `/`` back/VBP '/''

'I bought the The Pianist DVD'

I/PRP bought/VBD the/DT The/NNP Pianist/NNP DVD/NN
I/PRP bought/VBD the/DT the/DT pianist/NN DVD/NN

Ambiguous Sentences

Some sentences are actually ambiguous in POS tagging

Not all ambiguities of POS are resolveable by humans

'Maria was entertaining last night'

Maria/NNP was/VBD entertaining/JJ last/JJ night/NN

'I saw the official take from the store.'

I/PRP saw/VBD the/DT official/NN take/VBP from/IN the/DT store/NN ./.

'You should ask a Smith.'

You/PRP should/MD ask/VB a/DT Smith/NNP ./.

'I hate bridging gaps.'

I/PRP hate/VBP bridging/VBG gaps/NNS ./.

Rare or Unknown words

Rare or unknown Words

Capitalization and Morphology are the best tools
You can rely mostly on the transitional probability within the model
- "Well, I know that the last thing was a modal 'will', so 'gerfleeble' is probably a verb!"

'yeet'

yeet/NN

'yeeting'

yeeting/NN

'yeeted'

yeeted/JJ

'I yeet when I throw empty cans'

I/PRP yeet/VBP when/WRB I/PRP throw/VBP empty/JJ cans/NNS

'lit'

lit/UH

'That phonetics lab meeting was lit'

That/DT phonetics/NNS lab/NN meeting/NN was/VBD lit/JJ

'I'm studying English Lit'

I/PRP 'm/VBP studying/VBG English/NNP Lit/NNP

'They lit the beacon of Amon Din to summon the Rohirrim'

They/PRP lit/VBD the/DT beacon/NN of/IN Amon/NNP Din/NNP to/TO summon/VB the/DT Rohirrim/NNP

Homonyms

Homonyms are (always) a problem

Is 'saw' a past tense verb, a noun, or a present tense verb?

'I saw the sign'

I/PRP saw/VBD the/DT sign/NN

'I saw the sign whenever I need to test the cutting feel of a new blade'

I/PRP saw/VBD the/DT sign/NN whenever/WRB I/PRP need/VBP to/TO test/VB the/DT cutting/VBG feel/NN of/IN a/DT new/JJ blade/NN

'I bought a saw'

I/PRP bought/VBD a/DT saw/NN

POS Tagging is crucial

POS Tagging is very helpful

Helps disambiguate word senses
Helps identify verbs vs. the things the verbs are acting on
Provides the basis for syntactic parsing!

Wrapping up

Computers can't use meaning or language intuitions to POS-tag
POS-tagged data is valuable
Words can be more or less ambiguous in terms of POS tags
HMMs work great for POS Tagging
But POS tagging is still hard!

For Next Time

We'll talk about syntactic parsing

Thank you!

"I know the project is due today but I'm just getting started and I have questions..."