Weka (http://www.cs.waikato.ac.nz/ml/weka/) is Data Mining software from The University of Waikato. In Slovenia, The Bioinformatics Laboratory has also developed well known software Orange (http://orange.biolab.si/). Both tools have GUI interface and a library for programmatic access. The main difference is that Weka is Java and Orange is python -based.
Here I will give a short example how to use Weka within Java. Tha Java file is accessible here: Weka.java. All you need to do is put weka.jar to classpath, compile and run Weka.java (of course you need to have c:\\temp folder or choose another one).
For classification problems we normally have to identify features. In Weka standard types of attributes are numeric, nominal, string, date and relation. Relation attribute can represent a whole dataset. There are also some functions for data preprocessing available. Here we define some attributes:
[codesyntax lang=”java”]
//1.ATTRIBUTES //numeric Attribute attr = new Attribute("my-numeric"); System.out.println(attr.isNumeric()); //nominal FastVector myNomVals = new FastVector(); for (int i=0; i<10; i++) myNomVals.addElement("value_"+i); Attribute attr1 = new Attribute("my-nominal", myNomVals); System.out.println(attr1.isNominal()); //string Attribute attr2 = new Attribute("my-string", (FastVector)null); System.out.println(attr2.isString()); //date Attribute attr3 = new Attribute("my-date", "dd-MM-yyyy"); System.out.println(attr3.isDate()); //whole relation can also be an attr //Attribute attr4 = new Attribute("my-relation", new Instances(...));
[/codesyntax]
When we have attributes, we can form the dataset aka. relation (reading and writing from files will come later):
[codesyntax lang=”java”]
//2.create dataset FastVector attrs = new FastVector(); attrs.addElement(attr); attrs.addElement(attr1); attrs.addElement(attr2); attrs.addElement(attr3); Instances dataset = new Instances("my_dataset", attrs, 0);
[/codesyntax]
Now we have defined the relation structure. There are a few possible ways to fill the dataset and here we present few of them:
[codesyntax lang=”java”]
//3.add instances //first instance double[] attValues = new double[dataset.numAttributes()]; attValues[0] = 55; attValues[1] = dataset.attribute("my-nominal").indexOfValue("value_5"); attValues[2] = dataset.attribute("my-string").addStringValue("Slavko"); attValues[3] = dataset.attribute("my-date").parseDate("7-6-1987"); dataset.add(new Instance(1.0, attValues)); //second instance attValues = new double[dataset.numAttributes()]; attValues[0] = Instance.missingValue(); attValues[1] = dataset.attribute(1).indexOfValue("value_9"); attValues[2] = dataset.attribute(2).addStringValue("Marinka"); attValues[3] = dataset.attribute(3).parseDate("23-4-1989"); dataset.add(new Instance(1.0, attValues)); //third instance Instance example = new Instance(4); example.setValue(attr, 16); example.setValue(attr1, "value_7"); example.setValue(attr2, "Mirko"); example.setValue(attr3, attr3.parseDate("1-1-1988")); dataset.add(example);
[/codesyntax]
Up to here we have the dataset in the memory. We can use it (class attribute needs yet to be set), print it to stdout or file:
[codesyntax lang=”java”]
//4.output dataset System.out.println(dataset); //5.save dataset String file = "C:\\temp\\weka_test.arff"; ArffSaver saver = new ArffSaver(); saver.setInstances(dataset); saver.setFile(new File(file)); saver.writeBatch(); //6.read dataset ArffLoader loader = new ArffLoader(); loader.setFile(new File(file)); dataset = loader.getDataSet();
[/codesyntax]
As we have one string attribute, we need to properly preprocess it as very few classifiers support them. We can accomplish this with filters, for example changin it to nominal attribute:
[codesyntax lang=”java”]
//7.preprocess strings (almost no classifier supports them) StringToWordVector filter = new StringToWordVector(); filter.setInputFormat(dataset); dataset = Filter.useFilter(dataset, filter); System.out.println(dataset);
[/codesyntax]
We have the data. The next thing is building a classifier. Weka contains a lot well known classifiers like naive Bayes, decision trees, perceptrons, etc.. I like SVMs and I use LibSVM with Weka. Weka already has built-in LibSVM API, so the only thing you need to do is to include libsvm.jar to classpath and use LibSVM as classifier instance.
Another very easy task is also saving and retrieving back classifiers. The only thing to be aware of is the class index! You must set it before learning the classifier. Best practice is to always set class attribute as last one.
[codesyntax lang=”java”]
//8.build classifier dataset.setClassIndex(1); Classifier classifier = new J48(); classifier.buildClassifier(dataset); //9.save classifier OutputStream os = new FileOutputStream(file); ObjectOutputStream objectOutputStream = new ObjectOutputStream(os); objectOutputStream.writeObject(classifier); //10. read classifier back InputStream is = new FileInputStream(file); ObjectInputStream objectInputStream = new ObjectInputStream(is); classifier = (Classifier) objectInputStream.readObject(); objectInputStream.close();
[/codesyntax]
Usually we need to know how good the classifications are. Weka supports a number of evaluation tools, like CV and different measures. Here we will resample our dataset, create the train and learn dataset and output some results.
[codesyntax lang=”java”]
//11.evaluate //resample if needed dataset = dataset.resample(new Random(42)); //split to 70:30 learn and test set double percent = 70.0; int trainSize = (int) Math.round(dataset.numInstances() * percent / 100); int testSize = dataset.numInstances() - trainSize; Instances train = new Instances(dataset, 0, trainSize); Instances test = new Instances(dataset, trainSize, testSize); train.setClassIndex(1); test.setClassIndex(1); //do eval Evaluation eval = new Evaluation(train); //trainset eval.evaluateModel(classifier, test); //testset System.out.println(eval.toSummaryString()); System.out.println(eval.weightedFMeasure()); System.out.println(eval.weightedPrecision()); System.out.println(eval.weightedRecall());
[/codesyntax]
When classifying new instances, we must be aware to transform classifier’s result to the class attribute value – it returns only the index of a value (for classification purposes)!
[codesyntax lang=”java”]
//12.classify //result System.out.println(classifier.classifyInstance(dataset.firstInstance())); //classified result value System.out.println(dataset.attribute(dataset.classIndex()).value((int)dataset.firstInstance().classValue())); System.out.println(classifier.distributionForInstance(dataset.firstInstance()));
[/codesyntax]
I hope this example was useful to you. I tried to show how to use weka for some quick tasks.
Thanks for the share!
Nancy.R
Thanks for the share! Very useful info, looking to communicate!
Webmaster of best elliptical machine
Digged google 30 mins, findally i get it, thanks!
Thanks for the code.
But i got error in
new Instance(1.0, attValues).
Because Instance is a Interface. How can i create object for that.
I am using latest weka.jar.
Sorry dude. I got the solution. I used source code. so only that problem . Now i am using Jar. Now no error .