POS tAGGING and HMM
POS TAGGING AND HMM Tim Teks Mining Adapted from Heng Ji Outline POS Tagging and HMM
What is Part-of-Speech (POS)
- Generally speaking, Word Classes (=POS) :
- Verb, Noun, Adjective, Adverb, Article, …
- We can also include inflection:
- Verbs: Tense, number, …
- Nouns: Number, proper/common, …
- Adjectives: comparative, superlative, …
- …
- Noun, verb, adjective, preposition, adverb, article, interjection,
8 (ish) traditional parts of speech
- pronoun, conjunction, etc
Called: parts-of-speech, lexical categories, word classes,
- morphological classes, lexical tags... Lots of debate within linguistics about the number, nature, and >universality of these We’ll completely ignore this debate.
- N noun chair, bandwidth, pacing
- V verb study, debate, munch
- ADJ adj purple, tall, ridiculous
- ADV adverb unfortunately, slowly,
- P preposition of, by, to
- PRO pronoun I, me, mine
- DET determiner the, a, that, those
POS Tagging
The process of assigning a part-of-speech or lexical
- class marker to each word in a collection.
WORD tag the DET koala N put
V the DET keys N on P the DET table N Penn TreeBank POS Tag Set
- Penn Treebank: hand-annotated corpus of Wall Street
Journal, 1M words
- 46 tags
- Some particularities:
- to /TO not disambiguated
- Auxiliaries and verbs not distinguished
Why POS tagging is useful?
- Speech synthesis:
- How to pronounce “lead”?
- INsult inSULT
- OBject obJECT
- OVERflow overFLOW
- DIScount disCOUNT
- CONtent conTENT
- Stemming for information retrieval
- Can search for “aardvarks” get “aardvark”
- Parsing and speech recognition and etc
- Possessive pronouns (my, your, her) followed by nouns
Personal pronouns (I, you, he) likely to be followed by verbs
Need to know if a word is an N or V before you can parse
- Information extraction
- Finding names, relations, etc.
- Closed class: a small fixed membership
- Prepositions: of, in, by, …
- Auxiliaries: may, can, will had, been, …
- Pronouns: I, you, she, mine, his, them, …
- Usually function words (short common words which play a role in grammar)
- Open class: new ones can be created all the time
- English has 4: Nouns, Verbs, Adjectives, Adverbs
- Many languages have these 4, but not all!
- Nouns
- Proper nouns (Boulder, Granby, Eli Manning)
- English capitalizes these.
- Common nouns (the rest).
- Count nouns and mass nouns
- Count: have plurals, get counted: goat/goats, one goat, two goats
- Mass: don’t get counted (snow, salt, communism) (*two snows)
- Adverbs: tend to modify things
- Unfortunately, John walked home extremely slowly yesterday
- Directional/locative adverbs (here,home, downhill)
- Degree adverbs (extremely, very, somewhat)
- Manner adverbs (slowly, slinkily, delicately)
- Verbs
- In English, have morphological affixes (eat/eats/eaten)
Closed Class Words
:
- prepositions: on, under, over, …
- particles: up, down, on, off, …
a, an, the, …
- determiners:
- pronouns: she, who, I, ..
- conjunctions: and, but, or, …
- auxiliary verbs: can, may should, …
- numerals: one, two, three, third, …
English Particles
Conjunctions
POS Tagging Choosing a Tagset
There are so many parts of speech, potential distinctions we can
- draw To do POS tagging, we need to choose a standard set of tags to
- work with Could pick very coarse tagsets
- N, V, Adj, Adv.
- More commonly used set is finer grained, the “Penn TreeBank
- tagset”, 45 tags PRP$, WRB, WP$, VBG >Even more fine-grained tagsets exist
Using the Penn Tagset
The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT
- number/NN of/IN other/JJ topics/NNS ./. Prepositions and subordinating conjunctions marked IN
- (“although/IN I/PRP..”) Except the preposition/complementizer “to” is just marked
- “TO”.
POS Tagging
- Words often have more than one POS: back
- The back door = JJ
- On my back = NN
- Win the voters back = RB
- Promised to back the bill = VB
- The POS tagging problem is to determine the POS tag for a particular instance of a word.
How Hard is POS Tagging? Measuring Ambiguity
Current Performance
- How many tags are correct?
- About 97% currently
- But baseline is already 90%
- Baseline algorithm:
- Tag every word with its most frequent tag
- Tag unknown words as nouns
- How well do people do?
Quick Test: Agreement?
the students went to class
- plays well with others
- fruit flies like a banana
- DT: the, this, that NN: noun
VB: verb P: prepostion ADV: adverb
Quick Test
the students went to class
- DT NN VB P NN plays well with others
- VB ADV P NN NN NN P DT fruit flies like a banana
- NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
How to do it? History DeRose/Church Trigram Tagger Combined Methods Efficient HMM Sparse Data 95%+ (Kempe) 98%+ 96%+ Tree-Based Statistics (Helmut Shmid) Greene and Rubin Rule Based - 70% HMM Tagging 93%-95% (CLAWS) Rule Based – 95%+ Transformation Based Tagging (Eric Brill) Neural Network Rule Based – 96%+ 96%+ Brown Corpus 1960 1970 1980 1990 2000 Brown Corpus LOB Corpus Created (EN-US) Tagged 1 Million Words POS Tagging Tagged British National Corpus Created (EN-UK) 1 Million Words LOB Corpus separated from other NLP
Penn Treebank (tagged by CLAWS)
Corpus Two Methods for POS Tagging 1.
Rule-based tagging
(ENGTWOL)
- 2.
Stochastic
- MEMMs (Maximum Entropy Markov Models)
HMM (Hidden Markov Model) tagging Rule-Based Tagging
Start with a dictionary
- Assign all possible tags to words from the dictionary
- Write rules by hand to selectively remove tags >Leaving the correct tag for each w
Rule-based taggers
Early POS taggers all hand-coded
- Most of these (Harris, 1962; Greene and Rubin, 1971)
- and the best of the recent ones, ENGTWOL (Voutilainen, 1995) based on a two-stage architecture
Stage 1: look up word in lexicon to give list of potential POSs
- Stage 2: Apply rules which certify or disallow tag sequences
- Rules originally handwritten; more recently Machine
- Learning methods can be used
Start With a Dictionary
- she: PRP
- promised: VBN,VBD
- to TO
- back:
- the: DT
- bill: NN, VB
- Etc… for the ~100,000 words of English with more than 1 tag
Assign Every Possible Tag
NN RB
VBN JJ VB PRPVBD TO
VB DT NN
She promised to back the bill Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows
“<start> PRP” NN RB JJ
VB PRPVBD TO VB DT NN
She promised to back the bill
VBN POS tagging The involvement of ion channels in B and T lymphocyte activation is
DT NN IN NN NNS IN NN CC NN NN NN VBZ
supported by many reports of changes in ion fluxes and membrane
VBN IN JJ NNS IN NNS IN NN NNS CC NN …………………………………………………………………………………….
…………………………………………………………………………………….
Machine Learning
Algorithm
training We demonstrate that …Unseen text We demonstrate PRP VBP that …
Goal of POS Tagging We want the best set of tags for a sequence of words (a
- sentence) W — a sequence of words
- T — a sequence of tags
- ^
r l o u T arg max P ( T | W ) o
ur a G al
O G O T Example:
- P( (NN NN P DET ADJ NN) | ( heat oil in a large pot ) )
Rich Models often require vast amounts of data
- Count up instances of the string "heat oil in a large pot" in >the training corpus, and pick the most common tag assignment to the string.. Too many possible combinat
POS Tagging as Sequence Classification
We are given a sentence (an “observation” or “sequence of
- observations”)
Secretariat is expected to race tomorrow
- What is the best sequence of tags that corresponds to this
- sequence of observations? Probabilistic view:
- Consider all possible sequences of tags
- Out of this universe of sequences, choose the tag sequence which is
most probable given the observation sequence of n words w …w .
1 n
Getting to HMMs
- We want, out of all sequences of n tags t 1 …t n the single tag sequence such that P(t 1 …t n |w 1 …w n ) is hig
- Hat ^ means “our estimate of the best one”
f(x) means “the x such that f(x) is maximized”
- Argmax x
Getting to HMMs
This equation is guaranteed to give us the best tag
- sequence But how to make it operational? How to compute this
- value? Intuition of Bayesian classification:
Use Bayes rule to transform this equation into a set of other
- probabilities that are easier to compute
Reminder: Apply
Bayes’ Theorem (1763) posteriorposterior
prior
prior
likelihood
likelihood
marginal likelihood
marginal likelihood
Our G oal : T o maxi mize it! Our G oal : T o ! ma ximi ze it
) ( ) ( ) | ( ) | (
W P T P T W P W T P How to Count ^
T arg max P ( T | W )
T P ( W | T ) P ( T ) arg max
T P ( W ) arg max P ( W | T ) P ( T )
T P(W|T) and P(T) can be counted from a large
hand-tagged corpus; and smooth them to get rid of the zeroes
Count P(W|T) and P(T)
Assume each word in the sequence depends only on
- its corresponding tag:
n i i P ( W | T ) P ( w | t )
i
1
Make a Markov assumption and use N-grams over tags ...
2
1
1
1
P t t t P t t P
n i i n
P t t t t t t P t t P t P P t t
n n n
1
1
1
P(T) is a product of the probability of N-grams that make it up
3
1
2
1
1
) ,..., | ( ... ) | ( ) | ( ) ( ) ,..., (
Count P(T) history history
P(T) is a product of the probability of N-grams that make it up
Make a Markov assumption and use N-grams over tags ...
) | ( ) ( ) ,..., (
Part-of-speech tagging with Hidden Markov
Models P w ... w | t ... t P t ... t 1 n 1 n 1 n P t ... t | w ... w
1 n 1 n
P w ... w 1 n tags words
P w ... w | t ... t P t ... t
1 n 1 n 1 n n
P w | t P t | t
i i i i 1
i
1
Analyzing Fish sleep.
A Simple POS HMM
start noun verb end
0.8
0.2
0.8
0.7
0.1
0.2
0.1
0.1 Word Emission Probabilities P ( word | state ) A two-word language: “fish” and “sleep”
- Suppose in our training corpus,
- “fish” appears 8 times as a noun and 5 times as a verb
“sleep” appears twice as a noun and 5 times as a verb Emission probabilities:
Noun
- P(fish | noun) :
0.8
- P(sleep | noun) : 0.2
Verb
P(fish | verb) :
0.5
- P(sleep | verb) : 0.5
Viterbi Probabilities
1
2
3 start verb noun end
0.2
0.1
0.8
0.7
0.8
end start noun verb
0.2
0.1
0.1
1
2
3 start 1 verb noun
0.2
0.1
0.8
0.7
0.8
end start noun verb
0.2
0.1
0.1 Token 1: fish
1
2
3 start 1 verb .2 * .5 noun .8 * .8
0.2
0.1
0.8
0.7
0.8
end start noun verb
0.2
0.1
0.1 Token 1: fish
1
2
3 start 1 verb .1 noun .64
0.2
0.1
0.8
0.7
0.8
end start noun verb
0.2
0.1
0.1 Token 2: sleep
1
2
3 (if ‘fish’ is verb) start
1
.1*.1*.5
verb .1
.1*.2*.2
noun .64
0.2
0.1
0.8
0.7
0.8
end start noun verb
0.2
0.1
0.1 Token 2: sleep
1
2
3 (if ‘fish’ is verb) start
1
.005
verb .1
.004
noun .64
0.2
0.1
0.8
0.7
0.8
end start noun verb
0.2
0.1
0.1 Token 2: sleep
1
2
3 (if ‘fish’ is a noun) start
1
.005
verb .1
.64*.8*.5 .004
noun .64
.64*.1*.2
0.2
0.1
0.8
0.7
0.8
end start noun verb
0.2
0.1
0.1 Token 2: sleep
1
2
3 (if ‘fish’ is a noun) start
1
.005
verb .1
.256 .004
noun .64
.0128
0.2
0.1
0.8
0.7
0.8
end start noun verb
0.2
0.1
0.1 Token 2: sleep take maximum,
1
2
3 set back pointers start
1
.005
verb .1
.256 .004
noun .64
.0128
0.2
0.1
0.8
0.7
0.8
end start noun verb
0.2
0.1
0.1 Token 2: sleep take maximum,
1
2
3 set back pointers start
1 verb .1 .256 noun .64 .0128
0.2
0.1
0.8
0.7
0.8
end start noun verb
0.2
0.1
0.1 Token 3: end
1
2
3 start 1 verb .1 .256 - noun .64 .0128 -
0.2
0.1
0.8
0.7
0.8
end start noun verb
0.2
0.1
0.1 Token 3: end take maximum,
1
2
3 set back pointers start
1 verb .1 .256 - noun .64 .0128 -
0.2
0.1
0.8
0.7
0.8
end start noun verb
0.2
0.1
0.1 Decode: fish = noun
1
2
3 sleep = verb start
1 verb .1 .256 - noun .64 .0128 - Markov Chain for a Simple Name Tagger George:0.3
W.:0.3 Probability
Bush:0.3 Emission
Iraq:0.1 Probability PER
$:1.0
0.2
0.3
0.2
0.3 START LOC
0.2
0.2
0.1
0.3
0.2
X W.:0.3
0.5 discussed:0.7 Exercise
Tag names in the following sentence:
- George. W. Bush discussed I
POS taggers
- Brill ’s tagger
- http://www.cs.jhu.edu/~brill/
- TnT tagger
- http://www.coli.uni-saarland.de/~thorsten/tnt/
- Stanford tagger
- http://nlp.stanford.edu/software/tagger.shtml
- SVMTool
- http://www.lsi.upc.es/~nlp/SVMTool/
- GENIA tagger
- More complete list at: