Interactive Command Line Use of Weka

Relative absolute error Root relative squared error Total Number of Instances 9 This output shows results for testing on the original training data so the classification is perfect. In practice, you will test on separate data sets. === Confusion Matrix === a b c -- classified as 3 0 0 | a = buy 0 3 0 | b = sell 0 0 3 | c = hold The confusion matrix shows the prediction columns for each data sample rows. Here we see the original data three buy, three sell, and three hold samples. The following output shows random sampling testing: === Stratified cross-validation === Correctly Classified Instances 4 44.4444 Incorrectly Classified Instances 5 55.5556 Kappa statistic 0.1667 Mean absolute error 0.3457 Root mean squared error 0.4513 Relative absolute error 75.5299 Root relative squared error 92.2222 Total Number of Instances 9 With random sampling, we see in the confusion matrix that the three buy recom- mendations are still perfect, but that both of the sell recommendations are wrong with one buy and two holds and that two of what should have been hold recom- mendations are buy recommendations. === Confusion Matrix === a b c -- classified as 3 0 0 | a = buy 1 0 2 | b = sell 2 0 1 | c = hold 133 The example in this section is partially derived from documentation at the web site http:weka.sourceforge.netwiki. This example loads the training ARFF data file seen at the beginning of this chapter and loads a similar ARFF file for testing that is equivalent to the original training file except that small random changes have been made to the numeric attribute values in all samples. A decision tree model is trained and tested on the new test ARFF data. import weka.classifiers.meta.FilteredClassifier; import weka.classifiers.trees.J48; import weka.core.Instances; import weka.filters.unsupervised.attribute.Remove; import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; public class WekaStocks { public static void mainString[] args throws Exception { We start by creating a new training instance by supplying a reader for the stock training ARFF file and setting the number of attributes to use: Instances training_data = new Instances new BufferedReader new FileReader test_datastock_training_data.arff; training_data.setClassIndex training_data.numAttributes - 1; We want to test with separate data so we open a separate examples ARFF file to test against: Instances testing_data = new Instances new BufferedReader new FileReader test_datastock_testing_data.arff; testing_data.setClassIndex training_data.numAttributes - 1; 134 The method toSummaryString prints a summary of a set of training or testing instances. String summary = training_data.toSummaryString; int number_samples = training_data.numInstances; int number_attributes_per_sample = training_data.numAttributes; System.out.println Number of attributes in model = + number_attributes_per_sample; System.out.println Number of samples = + number_samples; System.out.printlnSummary: + summary; System.out.println; Now we create a new classifier a J48 classifier in this case and we see how to optionally filter remove samples. We build a classifier using the training data and then test it using the separate test data set: a classifier for decision trees: J48 j48 = new J48; filter for removing samples: Remove rm = new Remove; remove first attribute rm.setAttributeIndices1; filtered classifier FilteredClassifier fc = new FilteredClassifier; fc.setFilterrm; fc.setClassifierj48; train using stock_training_data.arff: fc.buildClassifiertraining_data; test using stock_testing_data.arff: for int i = 0; i testing_data.numInstances; i++ { double pred = fc.classifyInstancetesting_data. instancei; System.out.printgiven value: + testing_data.classAttribute. valueinttesting_data.instancei. classValue; System.out.println. predicted value: + 135 testing_data.classAttribute.valueintpred; } } } This example program produces the following output some output not shown due to page width limits: Number of attributes in model = 4 Number of samples = 9 Summary: Relation Name: stock Num Instances: 9 Num Attributes: 4 Name Type Nom Int Real ... 1 percent_change_since_open Num 11 89 ... 2 percent_change_from_day_l Num 22 78 ... 3 percent_change_from_day_h Num 0 100 ... 4 action Nom 100 ... given value: hold. predicted value: hold given value: sell. predicted value: sell given value: buy. predicted value: buy given value: hold. predicted value: buy given value: sell. predicted value: sell given value: buy. predicted value: buy given value: hold. predicted value: hold given value: sell. predicted value: buy given value: buy. predicted value: buy

8.4 Suggestions for Further Study

Weka is well documented in the book Data Mining: Practical Machine Learning Tools and Techniques, Second Edition [Ian H. Witten Author, Eibe Frank. 2005]. Additional documentation can be found at weka.sourceforge.netwikiindex.php. 136 9 Statistical Natural Language Processing We will cover a wide variety of techniques for processing text in this chapter. The part of speech tagger, text categorization, clustering, spelling, and entity extraction examples are all derived from either my open source projects or my commercial projects. I wrote the Markov model example code for an earlier edition of this book. I am not offering you a very formal view of Statistical Natural Language Processing in this chapter; rather, I collected Java code that I have been using for years on various projects and simplified it to hopefully make it easier for you to understand and modify for your own use. The web site http:nlp.stanford.edulinksstatnlp.html is an excellent resource for both papers when you need more theory and additional software for Statistical Natural Language Processing. For Python programmers I can recommend the statistical NLP toolkit NLTK nltk.sourceforge.net that includes an online book and is licensed using the GPL.

9.1 Tokenizing, Stemming, and Part of Speech Tagging Text

Tokenizing text is the process of splitting a string containing text into individual tokens. Stemming is the reduction of words to abbreviated word roots that allow for easy comparison for equality of similar words. Tagging is identifying what part of speech each word is in input text. Tagging is complicated by many words having different parts of speech depending on context examples: “bank the air- plane,” “the river bank,” etc. You can find the code in this section in the code ZIP file for this book in the files srccomknowledgebooksnlpfasttagFastTag.java and srccomknowledgebooksnlputilTokenizer.java. The required data files are in the directory test data in the files lexicon.txt for processing English text and lex- icon medpost.txt for processing medical text. The FastTag project can also be found on my open source web page: http:www.markwatson.comopensource We will also look at a public domain word stemmer that I frequently use in this 137