POS tAGGING and HMM

  

POS TAGGING AND HMM Tim Teks Mining Adapted from Heng Ji Outline POS Tagging and HMM

  What is Part-of-Speech (POS)

  • Generally speaking, Word Classes (=POS) :
  • Verb, Noun, Adjective, Adverb, Article, …
  • We can also include inflection:
  • Verbs: Tense, number, …
  • Nouns: Number, proper/common, …
  • Adjectives: comparative, superlative, …
Parts of Speech

  • Noun, verb, adjective, preposition, adverb, article, interjection,

  8 (ish) traditional parts of speech

  • pronoun, conjunction, etc

    Called: parts-of-speech, lexical categories, word classes,

  • morphological classes, lexical tags... Lots of debate within linguistics about the number, nature, and
  • >universality of these We’ll completely ignore this debate.

  • N noun chair, bandwidth, pacing
  • V verb study, debate, munch
  • ADJ adj purple, tall, ridiculous
  • ADV adverb unfortunately, slowly,
  • P preposition of, by, to
  • PRO pronoun I, me, mine
  • DET determiner the, a, that, those

  POS Tagging

  The process of assigning a part-of-speech or lexical

  • class marker to each word in a collection.

  WORD tag the DET koala N put

  V the DET keys N on P the DET table N Penn TreeBank POS Tag Set

  • Penn Treebank: hand-annotated corpus of Wall Street

  Journal, 1M words

  • 46 tags
  • Some particularities:
  • to /TO not disambiguated
  • Auxiliaries and verbs not distinguished
Penn Treebank Tagset

  Why POS tagging is useful?

  • Speech synthesis:
  • How to pronounce “lead”?
  • INsult inSULT
  • OBject obJECT
  • OVERflow overFLOW
  • DIScount disCOUNT
  • CONtent conTENT
  • Stemming for information retrieval
  • Can search for “aardvarks” get “aardvark”
  • Parsing and speech recognition and etc
  • Possessive pronouns (my, your, her) followed by nouns
  • Personal pronouns (I, you, he) likely to be followed by verbs

  • Need to know if a word is an N or V before you can parse

  • Information extraction
  • Finding names, relations, etc.
Open and Closed Classes

  • Closed class: a small fixed membership
  • Prepositions: of, in, by, …
  • Auxiliaries: may, can, will had, been, …
  • Pronouns: I, you, she, mine, his, them, …
  • Usually function words (short common words which play a role in grammar)
  • Open class: new ones can be created all the time
  • English has 4: Nouns, Verbs, Adjectives, Adverbs
  • Many languages have these 4, but not all!
Open Class Words

  • Nouns
  • Proper nouns (Boulder, Granby, Eli Manning)
  • English capitalizes these.
  • Common nouns (the rest).
  • Count nouns and mass nouns
  • Count: have plurals, get counted: goat/goats, one goat, two goats
  • Mass: don’t get counted (snow, salt, communism) (*two snows)
  • Adverbs: tend to modify things
  • Unfortunately, John walked home extremely slowly yesterday
  • Directional/locative adverbs (here,home, downhill)
  • Degree adverbs (extremely, very, somewhat)
  • Manner adverbs (slowly, slinkily, delicately)
  • Verbs
  • In English, have morphological affixes (eat/eats/eaten)
Examples

  Closed Class Words

  :

  • prepositions: on, under, over,
  • particles: up, down, on, off, …

  a, an, the, …

  • determiners:
  • pronouns: she, who, I, ..
  • conjunctions: and, but, or, …
  • auxiliary verbs: can, may should, …
  • numerals: one, two, three, third, …
Prepositions from CELEX

  English Particles

  Conjunctions

POS Tagging Choosing a Tagset

  There are so many parts of speech, potential distinctions we can

  • draw To do POS tagging, we need to choose a standard set of tags to
  • work with Could pick very coarse tagsets
  • N, V, Adj, Adv.
  • More commonly used set is finer grained, the “Penn TreeBank
  • tagset”, 45 tags PRP$, WRB, WP$, VBG
  • >Even more fine-grained tagsets exist

Using the Penn Tagset

  The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT

  • number/NN of/IN other/JJ topics/NNS ./. Prepositions and subordinating conjunctions marked IN
  • (“although/IN I/PRP..”) Except the preposition/complementizer “to” is just marked
  • “TO”.

  POS Tagging

  • Words often have more than one POS: back
  • The back door = JJ
  • On my back = NN
  • Win the voters back = RB
  • Promised to back the bill = VB
  • The POS tagging problem is to determine the POS tag for a particular instance of a word.

  

How Hard is POS Tagging? Measuring Ambiguity

  Current Performance

  • How many tags are correct?
  • About 97% currently
  • But baseline is already 90%
  • Baseline algorithm:
  • Tag every word with its most frequent tag
  • Tag unknown words as nouns
  • How well do people do?

  Quick Test: Agreement?

  the students went to class

  • plays well with others
  • fruit flies like a banana
  • DT: the, this, that NN: noun

  VB: verb P: prepostion ADV: adverb

Quick Test

  the students went to class

  • DT NN VB P NN plays well with others
  • VB ADV P NN NN NN P DT fruit flies like a banana
  • NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

  How to do it? History DeRose/Church Trigram Tagger Combined Methods Efficient HMM Sparse Data 95%+ (Kempe) 98%+ 96%+ Tree-Based Statistics (Helmut Shmid) Greene and Rubin Rule Based - 70% HMM Tagging 93%-95% (CLAWS) Rule Based – 95%+ Transformation Based Tagging (Eric Brill) Neural Network Rule Based – 96%+ 96%+ Brown Corpus 1960 1970 1980 1990 2000 Brown Corpus LOB Corpus Created (EN-US) Tagged 1 Million Words POS Tagging Tagged British National Corpus Created (EN-UK) 1 Million Words LOB Corpus separated from other NLP

Penn Treebank (tagged by CLAWS)

Corpus Two Methods for POS Tagging 1.

  Rule-based tagging

  (ENGTWOL)

  • 2.

  Stochastic

  • MEMMs (Maximum Entropy Markov Models)

  HMM (Hidden Markov Model) tagging Rule-Based Tagging

  Start with a dictionary

  • Assign all possible tags to words from the dictionary
  • Write rules by hand to selectively remove tags
  • >Leaving the correct tag for each w

  Rule-based taggers

  Early POS taggers all hand-coded

  • Most of these (Harris, 1962; Greene and Rubin, 1971)
  • and the best of the recent ones, ENGTWOL (Voutilainen, 1995) based on a two-stage architecture

  Stage 1: look up word in lexicon to give list of potential POSs

  • Stage 2: Apply rules which certify or disallow tag sequences
  • Rules originally handwritten; more recently Machine
  • Learning methods can be used

Start With a Dictionary

  • she: PRP
  • promised: VBN,VBD
  • to TO
  • back:
  • the: DT
  • bill: NN, VB
  • Etc… for the ~100,000 words of English with more than 1 tag

  Assign Every Possible Tag

  NN RB

  VBN JJ VB PRPVBD TO

  VB DT NN

  She promised to back the bill Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows

  “<start> PRP” NN RB JJ

  VB PRPVBD TO VB DT NN

  She promised to back the bill

  VBN POS tagging The involvement of ion channels in B and T lymphocyte activation is

  DT NN IN NN NNS IN NN CC NN NN NN VBZ

supported by many reports of changes in ion fluxes and membrane

  VBN IN JJ NNS IN NNS IN NN NNS CC NN …………………………………………………………………………………….

  …………………………………………………………………………………….

  

Machine Learning

Algorithm

training We demonstrate that …

  Unseen text We demonstrate PRP VBP that …

  Goal of POS Tagging We want the best set of tags for a sequence of words (a

  • sentence) W — a sequence of words
  • T — a sequence of tags
  • ^

  r l o u T arg max P ( T | W ) o

   ur a G al

  O G O T Example:

  • P( (NN NN P DET ADJ NN) | ( heat oil in a large pot ) )
But, the Sparse Data Problem …

  Rich Models often require vast amounts of data

  • Count up instances of the string "heat oil in a large pot" in
  • >the training corpus, and pick the most common tag assignment to the string.. Too many possible combinat

  POS Tagging as Sequence Classification

  We are given a sentence (an “observation” or “sequence of

  • observations”)

  Secretariat is expected to race tomorrow

  • What is the best sequence of tags that corresponds to this
  • sequence of observations? Probabilistic view:
  • Consider all possible sequences of tags
  • Out of this universe of sequences, choose the tag sequence which is
  • most probable given the observation sequence of n words w …w .

    1 n

Getting to HMMs

  • We want, out of all sequences of n tags t
  • 1 …t n the single tag sequence such that P(t 1 …t n |w 1 …w n ) is hig
  • Hat ^ means “our estimate of the best one”

  f(x) means “the x such that f(x) is maximized”

  • Argmax x

  Getting to HMMs

  This equation is guaranteed to give us the best tag

  • sequence But how to make it operational? How to compute this
  • value? Intuition of Bayesian classification:
  • Use Bayes rule to transform this equation into a set of other

  • probabilities that are easier to compute

  

Reminder: Apply

Bayes’ Theorem (1763) posterior

  posterior

  prior

  prior

  likelihood

  likelihood

  marginal likelihood

  marginal likelihood

  Our G oal : T o maxi mize it! Our G oal : T o ! ma ximi ze it

  ) ( ) ( ) | ( ) | (

  W P T P T W P W T P  How to Count ^

  T arg max P ( T | W ) 

  T P ( W | T ) P ( T ) arg max 

  T P ( W ) arg max P ( W | T ) P ( T ) 

  T P(W|T) and P(T) can be counted from a large

  • hand-tagged corpus; and smooth them to get rid of the zeroes

  Count P(W|T) and P(T)

  Assume each word in the sequence depends only on

  • its corresponding tag:

  n i i P ( W | T ) P ( w | t )

    i

  1 

   Make a Markov assumption and use N-grams over tags ...

  2

  1

  

1

  1

  P t t t P t t P

  n i i n

   

  

  P t t t t t t P t t P t P P t t

       n n n

  1 

  1

  1

   P(T) is a product of the probability of N-grams that make it up

  3

  

1

  2

  1

  1

  ) ,..., | ( ... ) | ( ) | ( ) ( ) ,..., (

  Count P(T) history history

  P(T) is a product of the probability of N-grams that make it up

  

  Make a Markov assumption and use N-grams over tags ...

  

  ) | ( ) ( ) ,..., (

  

Part-of-speech tagging with Hidden Markov

Models P w ... w | t ... t P t ... t

   1 n 1 n   1 nP t ... t | w ... w

    1 n 1 n

  P w ... w  1 n  tags words

  P w ... w | t ... t P t ... t

      1 n 1 n 1 n n

  P w | t P t | t

   i i   i i 1  

   i

  1 

  Analyzing Fish sleep.

  A Simple POS HMM

  start noun verb end

  0.8

  0.2

  0.8

  0.7

  0.1

  0.2

  0.1

  0.1 Word Emission Probabilities P ( word | state ) A two-word language: “fish” and “sleep”

  • Suppose in our training corpus,
  • “fish” appears 8 times as a noun and 5 times as a verb

  “sleep” appears twice as a noun and 5 times as a verb Emission probabilities:

  Noun

  • P(fish | noun) :

  0.8

  • P(sleep | noun) : 0.2

  Verb

  P(fish | verb) :

  0.5

  • P(sleep | verb) : 0.5

Viterbi Probabilities

  1

  2

  3 start verb noun end

  0.2

  0.1

  0.8

  0.7

  0.8

  end start noun verb

  0.2

  0.1

  0.1

  1

  2

  3 start 1 verb noun

  0.2

  0.1

  0.8

  0.7

  0.8

  end start noun verb

  0.2

  0.1

  0.1 Token 1: fish

  1

  2

  3 start 1 verb .2 * .5 noun .8 * .8

  0.2

  0.1

  0.8

  0.7

  0.8

  end start noun verb

  0.2

  0.1

  0.1 Token 1: fish

  1

  2

  3 start 1 verb .1 noun .64

  0.2

  0.1

  0.8

  0.7

  0.8

  end start noun verb

  0.2

  0.1

  0.1 Token 2: sleep

  1

  2

  3 (if ‘fish’ is verb) start

  1

  .1*.1*.5

  verb .1

  .1*.2*.2

  noun .64

  0.2

  0.1

  0.8

  0.7

  0.8

  end start noun verb

  0.2

  0.1

  0.1 Token 2: sleep

  1

  2

  3 (if ‘fish’ is verb) start

  1

  .005

  verb .1

  .004

  noun .64

  0.2

  0.1

  0.8

  0.7

  0.8

  end start noun verb

  0.2

  0.1

  0.1 Token 2: sleep

  1

  2

  3 (if ‘fish’ is a noun) start

  1

  .005

  verb .1

  .64*.8*.5 .004

  noun .64

  .64*.1*.2

  0.2

  0.1

  0.8

  0.7

  0.8

  end start noun verb

  0.2

  0.1

  0.1 Token 2: sleep

  1

  2

  3 (if ‘fish’ is a noun) start

  1

  .005

  verb .1

  .256 .004

  noun .64

  .0128

  0.2

  0.1

  0.8

  0.7

  0.8

  end start noun verb

  0.2

  0.1

  0.1 Token 2: sleep take maximum,

  1

  2

  3 set back pointers start

  1

  .005

  verb .1

  .256 .004

  noun .64

  .0128

  0.2

  0.1

  0.8

  0.7

  0.8

  end start noun verb

  0.2

  0.1

  0.1 Token 2: sleep take maximum,

  1

  2

  3 set back pointers start

  1 verb .1 .256 noun .64 .0128

  0.2

  0.1

  0.8

  0.7

  0.8

  end start noun verb

  0.2

  0.1

  0.1 Token 3: end

  1

  2

  3 start 1 verb .1 .256 - noun .64 .0128 -

  0.2

  0.1

  0.8

  0.7

  0.8

  end start noun verb

  0.2

  0.1

  0.1 Token 3: end take maximum,

  1

  2

  3 set back pointers start

  1 verb .1 .256 - noun .64 .0128 -

  0.2

  0.1

  0.8

  0.7

  0.8

  end start noun verb

  0.2

  0.1

  0.1 Decode: fish = noun

  1

  2

  3 sleep = verb start

  1 verb .1 .256 - noun .64 .0128 - Markov Chain for a Simple Name Tagger George:0.3

  W.:0.3 Probability

  Bush:0.3 Emission

  Iraq:0.1 Probability PER

  $:1.0

  0.2

  0.3

  0.2

  0.3 START LOC

  0.2

  0.2

  0.1

  0.3

  

0.2

  X W.:0.3

  0.5 discussed:0.7 Exercise

  Tag names in the following sentence:

  • George. W. Bush discussed I

  

  POS taggers

  • Brill ’s tagger
  • http://www.cs.jhu.edu/~brill/
  • TnT tagger
  • http://www.coli.uni-saarland.de/~thorsten/tnt/
  • Stanford tagger
  • http://nlp.stanford.edu/software/tagger.shtml
  • SVMTool
  • http://www.lsi.upc.es/~nlp/SVMTool/
  • GENIA tagger
  • More complete list at: