POS tAGGING and HMM

POS TAGGING AND HMM Tim Teks Mining Adapted from Heng Ji Outline POS Tagging and HMM

What is Part-of-Speech (POS)

Generally speaking, Word Classes (=POS) :
Verb, Noun, Adjective, Adverb, Article, …
We can also include inflection:
Verbs: Tense, number, …
Nouns: Number, proper/common, …
Adjectives: comparative, superlative, …
…

Parts of Speech

Noun, verb, adjective, preposition, adverb, article, interjection,

8 (ish) traditional parts of speech

pronoun, conjunction, etc
Called: parts-of-speech, lexical categories, word classes,
morphological classes, lexical tags... Lots of debate within linguistics about the number, nature, and

N noun chair, bandwidth, pacing
V verb study, debate, munch
ADJ adj purple, tall, ridiculous
ADV adverb unfortunately, slowly,
P preposition of, by, to
PRO pronoun I, me, mine
DET determiner the, a, that, those

POS Tagging

The process of assigning a part-of-speech or lexical

class marker to each word in a collection.

WORD tag the DET koala N put

V the DET keys N on P the DET table N Penn TreeBank POS Tag Set

Penn Treebank: hand-annotated corpus of Wall Street

Journal, 1M words

46 tags
Some particularities:
to /TO not disambiguated
Auxiliaries and verbs not distinguished

Penn Treebank Tagset

Why POS tagging is useful?

Speech synthesis:
How to pronounce “lead”?
INsult inSULT
OBject obJECT
OVERflow overFLOW
DIScount disCOUNT
CONtent conTENT
Stemming for information retrieval
Can search for “aardvarks” get “aardvark”
Parsing and speech recognition and etc
Possessive pronouns (my, your, her) followed by nouns
Personal pronouns (I, you, he) likely to be followed by verbs
Need to know if a word is an N or V before you can parse
Information extraction
Finding names, relations, etc.

Open and Closed Classes

Closed class: a small fixed membership
Prepositions: of, in, by, …
Auxiliaries: may, can, will had, been, …
Pronouns: I, you, she, mine, his, them, …
Usually function words (short common words which play a role in grammar)
Open class: new ones can be created all the time
English has 4: Nouns, Verbs, Adjectives, Adverbs
Many languages have these 4, but not all!

Open Class Words

Nouns
Proper nouns (Boulder, Granby, Eli Manning)
English capitalizes these.
Common nouns (the rest).
Count nouns and mass nouns
Count: have plurals, get counted: goat/goats, one goat, two goats
Mass: don’t get counted (snow, salt, communism) (*two snows)
Adverbs: tend to modify things
Unfortunately, John walked home extremely slowly yesterday
Directional/locative adverbs (here,home, downhill)
Degree adverbs (extremely, very, somewhat)
Manner adverbs (slowly, slinkily, delicately)
Verbs
In English, have morphological affixes (eat/eats/eaten)

Examples

Closed Class Words

prepositions: on, under, over, …
particles: up, down, on, off, …

a, an, the, …

determiners:
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
numerals: one, two, three, third, …

Prepositions from CELEX

English Particles

Conjunctions

POS Tagging Choosing a Tagset

There are so many parts of speech, potential distinctions we can

draw To do POS tagging, we need to choose a standard set of tags to
work with Could pick very coarse tagsets
N, V, Adj, Adv.
More commonly used set is finer grained, the “Penn TreeBank
tagset”, 45 tags PRP$, WRB, WP$, VBG

Using the Penn Tagset

The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT

number/NN of/IN other/JJ topics/NNS ./. Prepositions and subordinating conjunctions marked IN
(“although/IN I/PRP..”) Except the preposition/complementizer “to” is just marked
“TO”.

POS Tagging

Words often have more than one POS: back
The back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word.

How Hard is POS Tagging? Measuring Ambiguity

Current Performance

How many tags are correct?
About 97% currently
But baseline is already 90%
Baseline algorithm:
Tag every word with its most frequent tag
Tag unknown words as nouns
How well do people do?

Quick Test: Agreement?

the students went to class

plays well with others
fruit flies like a banana
DT: the, this, that NN: noun

VB: verb P: prepostion ADV: adverb

Quick Test

the students went to class

DT NN VB P NN plays well with others
VB ADV P NN NN NN P DT fruit flies like a banana
NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

How to do it? History _{DeRose/Church} _{Trigram Tagger} _{Combined Methods} _{Efficient HMM} _{Sparse Data} _95%+ (Kempe) 98%+ _96%+ _{Tree-Based Statistics} _{(Helmut Shmid)} _{Greene and Rubin} _{Rule Based - 70%} _{HMM Tagging} _93%-95% _(CLAWS) _{Rule Based – 95%+} Transformation _{Based Tagging} _{(Eric Brill)} _{Neural Network} _{Rule Based – 96%+} 96%+ _{Brown Corpus} 1960 1970 1980 1990 2000 _{Brown Corpus} _{LOB Corpus} _{Created (EN-US) Tagged} _{1 Million Words} _{POS Tagging} Tagged _{British National} _Corpus _{Created (EN-UK)} _{1 Million Words} _{LOB Corpus} separated from _{other NLP}

_{Penn Treebank} _{(tagged by CLAWS)}

Corpus Two Methods for POS Tagging 1.

Rule-based tagging

(ENGTWOL)

Stochastic

MEMMs (Maximum Entropy Markov Models)

HMM (Hidden Markov Model) tagging Rule-Based Tagging

Start with a dictionary

Assign all possible tags to words from the dictionary
Write rules by hand to selectively remove tags

Rule-based taggers

Early POS taggers all hand-coded

Most of these (Harris, 1962; Greene and Rubin, 1971)
and the best of the recent ones, ENGTWOL (Voutilainen, 1995) based on a two-stage architecture

Stage 1: look up word in lexicon to give list of potential POSs

Stage 2: Apply rules which certify or disallow tag sequences
Rules originally handwritten; more recently Machine
Learning methods can be used

Start With a Dictionary

she: PRP
promised: VBN,VBD
to TO
back:
the: DT
bill: NN, VB
Etc… for the ~100,000 words of English with more than 1 tag

Assign Every Possible Tag

NN RB

VBN JJ VB PRPVBD TO

VB DT NN

She promised to back the bill Write Rules to Eliminate Tags

_{Eliminate VBN if VBD is an option when VBN|VBD follows}

“<start> PRP” NN RB JJ

VB PRPVBD TO VB DT NN

She promised to back the bill

VBN POS tagging The involvement of ion channels in B and T lymphocyte activation is

DT NN IN NN NNS IN NN CC NN NN NN VBZ

supported by many reports of changes in ion fluxes and membrane

VBN IN JJ NNS IN NNS IN NN NNS CC NN …………………………………………………………………………………….

…………………………………………………………………………………….

Machine Learning

Algorithm

training We demonstrate that …

Unseen text We demonstrate PRP VBP that …

Goal of POS Tagging We want the best set of tags for a sequence of words (a

sentence) W — a sequence of words
T — a sequence of tags
^

r l o u T arg max P ( T | W ) o

 ur a G al

O G O T Example:

P( (NN NN P DET ADJ NN) | ( heat oil in a large pot ) )

But, the Sparse Data Problem …

Rich Models often require vast amounts of data

Count up instances of the string "heat oil in a large pot" in

POS Tagging as Sequence Classification

We are given a sentence (an “observation” or “sequence of

observations”)

Secretariat is expected to race tomorrow

What is the best sequence of tags that corresponds to this
sequence of observations? Probabilistic view:
Consider all possible sequences of tags
Out of this universe of sequences, choose the tag sequence which is
most probable given the observation sequence of n words w …w .
_{1 n}

Getting to HMMs

We want, out of all sequences of n tags t

₁

Hat ^ means “our estimate of the best one”

f(x) means “the x such that f(x) is maximized”

Argmax _x

Getting to HMMs

This equation is guaranteed to give us the best tag

sequence But how to make it operational? How to compute this
value? Intuition of Bayesian classification:
Use Bayes rule to transform this equation into a set of other
probabilities that are easier to compute

Reminder: Apply

Bayes’ Theorem (1763) posterior

posterior

prior

likelihood

marginal likelihood

Our G oal : T o maxi mize it! Our G oal : T o ! ma ximi ze it

) ( ) ( ) | ( ) | (

W P T P T W P W T P  How to Count ^

T arg max P ( T | W ) 

T P ( W | T ) P ( T ) arg max 

T P ( W ) arg max P ( W | T ) P ( T ) 

T P(W|T) and P(T) can be counted from a large

hand-tagged corpus; and smooth them to get rid of the zeroes

Count P(W|T) and P(T)

Assume each word in the sequence depends only on

its corresponding tag:

n i i P ( W | T ) P ( w | t )

  i

1 

 Make a Markov assumption and use N-grams over tags ...

P t t t P t t P

n i i n

 



P t t t t t t P t t P t P P t t 

     n n n

1 

 P(T) is a product of the probability of N-grams that make it up

) ,..., | ( ... ) | ( ) | ( ) ( ) ,..., (

Count P(T) history history

P(T) is a product of the probability of N-grams that make it up



Make a Markov assumption and use N-grams over tags ...



) | ( ) ( ) ,..., (

Part-of-speech tagging with Hidden Markov

Models P w ... w | t ... t P t ... t

 1 n 1 n   1 n  P t ... t | w ... w

  1 n 1 n 

P w ... w  1 n  tags words

P w ... w | t ... t P t ... t 

    1 n 1 n 1 n n

P w | t P t | t 

 i i   i i 1  

 i

1 

Analyzing Fish sleep.

A Simple POS HMM

start noun verb end

0.8

0.2

0.8

0.7

0.1

0.2

0.1

0.1 Word Emission Probabilities P ( word | state ) A two-word language: “fish” and “sleep”

Suppose in our training corpus,
“fish” appears 8 times as a noun and 5 times as a verb

“sleep” appears twice as a noun and 5 times as a verb Emission probabilities:

Noun

P(fish | noun) :

0.8

P(sleep | noun) : 0.2

Verb

P(fish | verb) :

0.5

P(sleep | verb) : 0.5

Viterbi Probabilities

3 start verb noun end

0.2

0.1

0.8

0.7

0.8

end start noun verb

0.2

0.1

3 start 1 verb noun

0.2

0.1

0.8

0.7

0.8

end start noun verb

0.2

0.1

0.1 Token 1: fish

3 start 1 verb .2 * .5 noun .8 * .8

0.2

0.1

0.8

0.7

0.8

end start noun verb

0.2

0.1

0.1 Token 1: fish

3 start 1 verb .1 noun .64

0.2

0.1

0.8

0.7

0.8

end start noun verb

0.2

0.1

0.1 Token 2: sleep

3 (if ‘fish’ is verb) start

.1*.1*.5

verb .1

.1*.2*.2

noun .64

0.2

0.1

0.8

0.7

0.8

end start noun verb

0.2

0.1

0.1 Token 2: sleep

3 (if ‘fish’ is verb) start

.005

verb .1

.004

noun .64

0.2

0.1

0.8

0.7

0.8

end start noun verb

0.2

0.1

0.1 Token 2: sleep

3 (if ‘fish’ is a noun) start

.005

verb .1

.64*.8*.5 .004

noun .64

.64*.1*.2

0.2

0.1

0.8

0.7

0.8

end start noun verb

0.2

0.1

0.1 Token 2: sleep

3 (if ‘fish’ is a noun) start

.005

verb .1

.256 .004

noun .64

.0128

0.2

0.1

0.8

0.7

0.8

end start noun verb

0.2

0.1

0.1 Token 2: sleep take maximum,

3 set back pointers start

.005

verb .1

.256 .004

noun .64

.0128

0.2

0.1

0.8

0.7

0.8

end start noun verb

0.2

0.1

0.1 Token 2: sleep take maximum,

3 set back pointers start

1 verb .1 .256 noun .64 .0128

0.2

0.1

0.8

0.7

0.8

end start noun verb

0.2

0.1

0.1 Token 3: end

3 start 1 verb .1 .256 - noun .64 .0128 -

0.2

0.1

0.8

0.7

0.8

end start noun verb

0.2

0.1

0.1 Token 3: end take maximum,

3 set back pointers start

1 verb .1 .256 - noun .64 .0128 -

0.2

0.1

0.8

0.7

0.8

end start noun verb

0.2

0.1

0.1 Decode: fish = noun

3 sleep = verb start

1 verb .1 .256 - noun .64 .0128 - Markov Chain for a Simple Name Tagger George:0.3

W.:0.3 Probability

Bush:0.3 Emission

Iraq:0.1 Probability PER

$:1.0

0.2

0.3

0.2

0.3 START LOC

0.2

0.1

0.3

0.2

X W.:0.3

0.5 discussed:0.7 Exercise

Tag names in the following sentence:

George. W. Bush discussed I

POS taggers

Brill ’s tagger
http://www.cs.jhu.edu/~brill/
TnT tagger
http://www.coli.uni-saarland.de/~thorsten/tnt/
Stanford tagger
http://nlp.stanford.edu/software/tagger.shtml
SVMTool
http://www.lsi.upc.es/~nlp/SVMTool/
GENIA tagger
More complete list at:

POS tAGGING and HMM

POS Tagging Choosing a Tagset

Using the Penn Tagset

Quick Test

Start With a Dictionary

Getting to HMMs

Viterbi Probabilities

Dokumen yang terkait

Pengaruh Good Corporate Governance, Kinerja Keuangan Terhadap Nilai Perusahaan Industri Food and Beverage

PENGARUHNYA TERHADAP HARGA JUAL DAN KEPUTUSAN BISNIS PADA UPT PSTKP BALI – BPPT I Nyoman Normal (Peneliti Akuntansi Keuangan UPT PSTKP Bali – BPPT) Abstracts : Glazur is a material that simialiar of thin mirror and will trickle with

KEPUASAN KERJA SEBAGAI MEDIASI HUBUNGAN KOMUNIKASI DAN BUDAYA ORGANISASI TERHADAP KINERJA KARYAWAN (Studi pada Puri Saron Hotel Group di Bali) Anak Agung Ketut Sri Asih I Wayan Arta Artana Abstract: This study aimed to determine the effect of direct and i

Keywords : Jobtenure, Self-Awareness, Personal Ability, and Individual Excellence

Key words : education and profession of secretary

Numerical Differentiation and Integration

Combining a Rule-based Classifier with Ensemble of Feature Sets and Machine Learning Techniques for Sentiment Analysis on Microblog

Problem Solving and Search

Document Indexing and Term Weighting v2

Text Mining and Information Retrieval

Dukungan

Links

POS tAGGING and HMM

POS Tagging Choosing a Tagset

Using the Penn Tagset

Quick Test

Start With a Dictionary

Getting to HMMs

Viterbi Probabilities

Dokumen yang terkait

Pengaruh Good Corporate Governance, Kinerja Keuangan Terhadap Nilai Perusahaan Industri Food and Beverage

PENGARUHNYA TERHADAP HARGA JUAL DAN KEPUTUSAN BISNIS PADA UPT PSTKP BALI – BPPT I Nyoman Normal (Peneliti Akuntansi Keuangan UPT PSTKP Bali – BPPT) Abstracts : Glazur is a material that simialiar of thin mirror and will trickle with

KEPUASAN KERJA SEBAGAI MEDIASI HUBUNGAN KOMUNIKASI DAN BUDAYA ORGANISASI TERHADAP KINERJA KARYAWAN (Studi pada Puri Saron Hotel Group di Bali) Anak Agung Ketut Sri Asih I Wayan Arta Artana Abstract: This study aimed to determine the effect of direct and i

Keywords : Jobtenure, Self-Awareness, Personal Ability, and Individual Excellence

Key words : education and profession of secretary

Numerical Differentiation and Integration

Combining a Rule-based Classifier with Ensemble of Feature Sets and Machine Learning Techniques for Sentiment Analysis on Microblog

Problem Solving and Search

Document Indexing and Term Weighting v2

Text Mining and Information Retrieval

Dokumen yang Anda mencari sudah siap untuk unduhkan