Training Hidden Markov Models
JJ IN
VB VBN
TO NNP
PRP NN
RB VBG
DT JJ
0.0 0.0
0.0 0.0
0.0 0.0
0.0 1.0
1.0 0.0
0.0 IN
0.0 0.0
0.0 0.0
0.0 1.0
0.0 2.0
0.0 0.0
4.0 VB
0.0 3.0
0.0 0.0
3.0 3.0
0.0 1.0
1.0 0.0
14.0 VBN
0.0 0.0
0.0 0.0
0.0 1.0
0.0 0.0
0.0 0.0
0.0 TO
0.0 0.0
2.0 0.0
0.0 1.0
0.0 0.0
0.0 0.0
2.0 NNP
0.0 1.0
16.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0
0.0 PRP
0.0 0.0
2.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0
0.0 NN
0.0 3.0
5.0 1.0
2.0 0.0
0.0 1.0
0.0 0.0
0.0 RB
0.0 0.0
0.0 0.0
0.0 1.0
1.0 0.0
0.0 0.0
0.0 VBG
0.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0
1.0 DT
1.0 0.0
1.0 0.0
0.0 0.0
0.0 25.0
0.0 0.0
0.0 Table 9.3: Transition counts from the first tag shown in row to the second tag
shown in column. We see that the transition from NNP to VB is
common. In order to train a Markov model to tag parts of speech, we start by building a two-
dimensional array using the method build words and tags that uses the following 2D array to count transitions; part of this array was seen in Figure 9.3:
tagT oT agT ransitionCount [uniqueT agCount][uniqueT agCount]
where the first index is the index of tag
n
and the second index is the index of tag
n+1
. We will see later how to calculate the values in this array and then how the values in
this two-dimensional array will be used to calculate the probabilities of transition- ing from one tag to another. First however, we simply use this array for counting
transitions between pairs of tags. The purpose of the training process is to fill this array with values based on the hand-tagged training file:
training datamarkovtagged text.txt That looks like this:
JohnNNP chasedVB theDT dogNN downRP theDT streetNN ..
IPRP sawVB JohnNNP dogVB MaryNNP andCC laterRB MaryNNP throwVB
theDT ballNN toTO JohnNNP onIN theDT streetNN ..
The method build words and tags parses this text file and fills the uniqueT ags and uniqueW ords collections.
The method train model starts by filling the tag to tag transition count arraysee Table 9.3:
169
tagT oT agT ransitionCount [][]
The element tagT oT agT ransitionCount [indexT ag0][indexT ag1] is incremented
whenever we find a transition of tag
n
to tag
n+1
in the input training text. The exam- ple program writes a spreadsheet style CSV file for this and other two-dimensional
arrays that are useful for viewing intermediate training results in any spreadsheet program. We normalized the data seen in Table 9.3 by dividing each element by the
count of the total number of tags. This normalized data can be seen in Table 9.4. The code for this first step is:
start by filling in the tag to tag transition count matrix:
tagToTagTransitionCount = new float[uniqueTagCount][uniqueTagCount];
ptagCount=+tagCount; puniqueTagCount=+uniqueTagCount;
for int i = 0; i uniqueTagCount; i++ { for int j = 0; j uniqueTagCount; j++ {
tagToTagTransitionCount[i][j] = 0; }
} String tag1 = String tagList.get0;
int index1 = uniqueTags.indexOftag1; inefficient int index0;
for int i = 0, size1 = wordList.size - 1;
i size1; i++ { index0 = index1;
tag1 = String tagList.geti + 1; index1 = uniqueTags.indexOftag1;
inefficient tagToTagTransitionCount[index0][index1]++;
} WriteCSVfileuniqueTags, uniqueTags,
tagToTagTransitionCount, tag_to_tag; Note that all calls to the utility method W riteCSV f ile are for debug only: if you
use this example on a large training set i.e., a large text corpus like Treebank of hand-tagged text then these 2D arrays containing transition and probability values
will be very large so viewing them with a spreadsheet is convenient.
Then the method train model calculates the probabilities of transitioning from tag[N] to tag[M] see Table 9.4.
Here is the code for calculating these transition probabilities: now calculate the probabilities of transitioning
170
JJ IN
VB VBN
TO NNP
PRP NN
RB VBG
DT JJ
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.50
0.50 0.00
0.00 IN
0.00 0.00
0.00 0.00
0.00 0.14
0.00 0.29
0.00 0.00
0.57 VB
0.00 0.11
0.00 0.00
0.11 0.11
0.00 0.04
0.04 0.00
0.52 VBN
0.00 0.00
0.00 0.00
0.00 1.00
0.00 0.00
0.00 0.00
0.00 TO
0.00 0.00
0.40 0.00
0.00 0.20
0.00 0.00
0.00 0.00
0.40 NNP
0.00 0.05
0.76 0.00
0.00 0.00
0.00 0.00
0.00 0.00
0.00 PRP
0.00 0.00
0.67 0.00
0.00 0.00
0.00 0.00
0.00 0.00
0.00 NN
0.00 0.10
0.16 0.03
0.06 0.00
0.00 0.03
0.00 0.00
0.00 RB
0.00 0.00
0.00 0.00
0.00 0.33
0.33 0.00
0.00 0.00
0.00 VBG
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.00
1.00 DT
0.04 0.00
0.04 0.00
0.00 0.00
0.00 0.93
0.00 0.00
0.00 Table 9.4: Normalize data in Table 9.3 to get probability of one tag seen in row
transitioning to another tag seen in column from tag[N] to tag[M]:
probabilityTag1ToTag2 = new float[uniqueTagCount][uniqueTagCount];
for int i = 0; i uniqueTagCount; i++ { int count =
Integertags.get StringuniqueTags.geti.intValue;
ptag: + uniqueTags.geti + , count=+count; for int j = 0; j uniqueTagCount; j++ {
probabilityTag1ToTag2[i][j] = 0.0001f + tagToTagTransitionCount[i][j]
floatcount; }
} WriteCSVfileuniqueTags, uniqueTags,
probabilityTag1ToTag2, test_datamarkovprob_tag_to_tag;
Finally, in the method train model we complete the training by defining the array probabilityW ordGivenT ag
[uniqueW ordCount][uniqueT agCount] which shows the probability of a tag at index N producing a word at index N in the
input training text. Here is the code for this last training step:
now calculate the probability of a word, given
171
JJ IN
VB VBN
TO NNP
PRP NN
RB went
0.00 0.00
0.07 0.00
0.00 0.00
0.00 0.00
0.00 mary
0.00 0.00
0.00 0.00
0.00 0.52
0.00 0.00
0.00 played
0.00 0.00
0.07 0.00
0.00 0.00
0.00 0.00
0.00 river
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.03
0.00 leave
0.00 0.00
0.07 0.00
0.00 0.00
0.00 0.03
0.00 dog
0.00 0.00
0.04 0.00
0.00 0.00
0.00 0.23
0.00 away
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.00
0.33 chased
0.00 0.00
0.11 1.00
0.00 0.00
0.00 0.00
0.00 at
0.00 0.14
0.00 0.00
0.00 0.00
0.00 0.00
0.00 tired
0.50 0.00
0.04 0.00
0.00 0.00
0.00 0.00
0.00 good
0.50 0.00
0.00 0.00
0.00 0.00
0.00 0.00
0.00 had
0.00 0.00
0.04 0.00
0.00 0.00
0.00 0.00
0.00 throw
0.00 0.00
0.07 0.00
0.00 0.00
0.00 0.03
0.00 from
0.00 0.14
0.00 0.00
0.00 0.00
0.00 0.00
0.00 so
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.00
0.33 stayed
0.00 0.00
0.04 0.00
0.00 0.00
0.00 0.00
0.00 absense
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.03
0.00 street
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.06
0.00 john
0.00 0.00
0.00 0.00
0.00 0.48
0.00 0.00
0.00 ball
0.00 0.00
0.04 0.00
0.00 0.00
0.00 0.06
0.00 on
0.00 0.29
0.00 0.00
0.00 0.00
0.00 0.00
0.00 cat
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.32
0.00 later
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.00
0.33 she
0.00 0.00
0.00 0.00
0.00 0.00
0.33 0.00
0.00 of
0.00 0.14
0.00 0.00
0.00 0.00
0.00 0.00
0.00 with
0.00 0.29
0.00 0.00
0.00 0.00
0.00 0.00
0.00 saw
0.00 0.00
0.19 0.00
0.00 0.00
0.00 0.03
0.00 Table 9.5: Probabilities of words having specific tags. Only a few tags are shown in
this table.
172
a proceeding tag: probabilityWordGivenTag =
new float[uniqueWordCount][uniqueTagCount]; for int i = 0; i uniqueWordCount; i++ {
String tag = uniqueTags.getj; for int j = 0; j uniqueTagCount; j++ {
String tag = uniqueTags.getj; note: index of tag is one less than index
of emitted word we are testing: int countTagOccurence = tags.gettag;
float wordWithTagOccurence = 0; for int n=0, sizem1=wordList.size-1;
nsizem1; n++ { String testWord = wordList.getn;
String testTag = tagList.getn;
if testWord.equalsword testTag.equalstag {
wordWithTagOccurence++; }
} probabilityWordGivenTag[i][j] =
wordWithTagOccurence floatcountTagOccurence; }
} WriteCSVfileuniqueWords, uniqueTags,
probabilityWordGivenTag, test_datamarkovprob_word_given_tag;