Quick intro to Weka – Slavko Žitnik, PhD

Weka (http://www.cs.waikato.ac.nz/ml/weka/) is Data Mining software from The University of Waikato. In Slovenia, The Bioinformatics Laboratory has also developed well known software Orange (http://orange.biolab.si/). Both tools have GUI interface and a library for programmatic access. The main difference is that Weka is Java and Orange is python -based.

Here I will give a short example how to use Weka within Java. Tha Java file is accessible here: Weka.java. All you need to do is put weka.jar to classpath, compile and run Weka.java (of course you need to have c:\\temp folder or choose another one).

For classification problems we normally have to identify features. In Weka standard types of attributes are numeric, nominal, string, date and relation. Relation attribute can represent a whole dataset. There are also some functions for data preprocessing available. Here we define some attributes:

[codesyntax lang=”java”]

//1.ATTRIBUTES
//numeric
Attribute attr = new Attribute("my-numeric");
System.out.println(attr.isNumeric());

//nominal
FastVector myNomVals = new FastVector();
for (int i=0; i<10; i++)
	myNomVals.addElement("value_"+i);
Attribute attr1 = new Attribute("my-nominal", myNomVals);
System.out.println(attr1.isNominal());

//string
Attribute attr2 = new Attribute("my-string", (FastVector)null);
System.out.println(attr2.isString());

//date
Attribute attr3 = new Attribute("my-date", "dd-MM-yyyy");
System.out.println(attr3.isDate());

//whole relation can also be an attr
//Attribute attr4 = new Attribute("my-relation", new Instances(...));

[/codesyntax]

When we have attributes, we can form the dataset aka. relation (reading and writing from files will come later):

[codesyntax lang=”java”]

//2.create dataset
FastVector attrs = new FastVector();
    attrs.addElement(attr);
    attrs.addElement(attr1);
    attrs.addElement(attr2);
    attrs.addElement(attr3);
Instances dataset = new	Instances("my_dataset", attrs, 0);

[/codesyntax]

Now we have defined the relation structure. There are a few possible ways to fill the dataset and here we present few of them:

[codesyntax lang=”java”]

//3.add instances
//first instance
double[] attValues = new double[dataset.numAttributes()];
	attValues[0] = 55;
	attValues[1] = dataset.attribute("my-nominal").indexOfValue("value_5");
        attValues[2] = dataset.attribute("my-string").addStringValue("Slavko");
	attValues[3] = dataset.attribute("my-date").parseDate("7-6-1987");
dataset.add(new Instance(1.0, attValues));

//second instance
attValues = new double[dataset.numAttributes()];
	attValues[0] = Instance.missingValue();
	attValues[1] = dataset.attribute(1).indexOfValue("value_9");
	attValues[2] = dataset.attribute(2).addStringValue("Marinka");
	attValues[3] = dataset.attribute(3).parseDate("23-4-1989");
dataset.add(new Instance(1.0, attValues));

//third instance
Instance example = new Instance(4);
	example.setValue(attr, 16);
	example.setValue(attr1, "value_7");
	example.setValue(attr2, "Mirko");
	example.setValue(attr3, attr3.parseDate("1-1-1988"));
dataset.add(example);

[/codesyntax]

Up to here we have the dataset in the memory. We can use it (class attribute needs yet to be set), print it to stdout or file:

[codesyntax lang=”java”]

//4.output dataset
System.out.println(dataset);

//5.save dataset
String file = "C:\\temp\\weka_test.arff";
ArffSaver saver = new ArffSaver();
saver.setInstances(dataset);
saver.setFile(new File(file));
saver.writeBatch();

//6.read dataset
ArffLoader loader = new ArffLoader();
loader.setFile(new File(file));
dataset = loader.getDataSet();

[/codesyntax]

As we have one string attribute, we need to properly preprocess it as very few classifiers support them. We can accomplish this with filters, for example changin it to nominal attribute:

[codesyntax lang=”java”]

//7.preprocess strings (almost no classifier supports them)
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataset);
dataset = Filter.useFilter(dataset, filter);
System.out.println(dataset);

[/codesyntax]

We have the data. The next thing is building a classifier. Weka contains a lot well known classifiers like naive Bayes, decision trees, perceptrons, etc.. I like SVMs and I use LibSVM with Weka. Weka already has built-in LibSVM API, so the only thing you need to do is to include libsvm.jar to classpath and use LibSVM as classifier instance.

Another very easy task is also saving and retrieving back classifiers. The only thing to be aware of is the class index! You must set it before learning the classifier. Best practice is to always set class attribute as last one.

[codesyntax lang=”java”]

//8.build classifier
dataset.setClassIndex(1);
Classifier classifier = new J48();
classifier.buildClassifier(dataset);

//9.save classifier
OutputStream os = new FileOutputStream(file);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(os);
objectOutputStream.writeObject(classifier);

//10. read classifier back
InputStream is = new FileInputStream(file);
ObjectInputStream objectInputStream = new ObjectInputStream(is);
classifier = (Classifier) objectInputStream.readObject();
objectInputStream.close();

[/codesyntax]

Usually we need to know how good the classifications are. Weka supports a number of evaluation tools, like CV and different measures. Here we will resample our dataset, create the train and learn dataset and output some results.

[codesyntax lang=”java”]

//11.evaluate

//resample if needed
dataset = dataset.resample(new Random(42));

//split to 70:30 learn and test set
double percent = 70.0;
int trainSize = (int) Math.round(dataset.numInstances() * percent / 100);
int testSize = dataset.numInstances() - trainSize;
Instances train = new Instances(dataset, 0, trainSize);
Instances test = new Instances(dataset, trainSize, testSize);
train.setClassIndex(1);
test.setClassIndex(1);

//do eval
Evaluation eval = new Evaluation(train); //trainset
eval.evaluateModel(classifier, test); //testset
System.out.println(eval.toSummaryString());
System.out.println(eval.weightedFMeasure());
System.out.println(eval.weightedPrecision());
System.out.println(eval.weightedRecall());

[/codesyntax]

When classifying new instances, we must be aware to transform classifier’s result to the class attribute value – it returns only the index of a value (for classification purposes)!

[codesyntax lang=”java”]

//12.classify
//result
System.out.println(classifier.classifyInstance(dataset.firstInstance()));
//classified result value
System.out.println(dataset.attribute(dataset.classIndex()).value((int)dataset.firstInstance().classValue()));
System.out.println(classifier.distributionForInstance(dataset.firstInstance()));

[/codesyntax]

I hope this example was useful to you. I tried to show how to use weka for some quick tasks.

5 Comments

Nancy October 2, 2011

Thanks for the share!
Nancy.R
Jessica October 9, 2011

Thanks for the share! Very useful info, looking to communicate!

Webmaster of best elliptical machine
Samantha October 9, 2011

Digged google 30 mins, findally i get it, thanks!
SAN August 9, 2012

Thanks for the code.

But i got error in
new Instance(1.0, attValues).
Because Instance is a Interface. How can i create object for that.
I am using latest weka.jar.
SAN August 10, 2012

Sorry dude. I got the solution. I used source code. so only that problem . Now i am using Jar. Now no error .

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Hvala za objavo, Slavko.

There is new Netatalk version >= 2.2.1 and OSX Mountain Lion -> follow tutorial at http://pwntr.com/2012/03/03/easy-mac-os-x-lion-10-7-time-machine-backup-using-an-ubuntu-linux-server-11-10-12-04-lts-and-up/.

Thanks for all the information, it was very helpful i really like that you are providing information on android app…

Sorry dude. I got the solution. I used source code. so only that problem . Now i am using Jar.…

Thanks for the code. But i got error in new Instance(1.0, attValues). Because Instance is a Interface. How can i…