{"id":568,"date":"2011-09-25T12:15:21","date_gmt":"2011-09-25T10:15:21","guid":{"rendered":"https:\/\/blog.zitnik.si\/?p=568"},"modified":"2015-08-05T15:37:51","modified_gmt":"2015-08-05T13:37:51","slug":"quick-intro-to-weka","status":"publish","type":"post","link":"https:\/\/blog.zitnik.si\/?p=568","title":{"rendered":"Quick intro to Weka"},"content":{"rendered":"<p>Weka (<a href=\"http:\/\/www.cs.waikato.ac.nz\/ml\/weka\/\" target=\"_blank\">http:\/\/www.cs.waikato.ac.nz\/ml\/weka\/<\/a>) is Data Mining software from The University of Waikato. In Slovenia, The Bioinformatics Laboratory has also developed well known software Orange (<a href=\"http:\/\/orange.biolab.si\/\" target=\"_blank\">http:\/\/orange.biolab.si\/<\/a>). Both tools have GUI interface and a library for programmatic access. The main difference is that Weka is Java and Orange is python -based.<\/p>\n<p>Here I will give a short example how to use Weka within Java. Tha Java file is accessible here: <a href=\"http:\/\/zitnik.si\/temp\/Weka.java\" target=\"_blank\">Weka.java<\/a>. All you need to do is put weka.jar to classpath, compile and run Weka.java (of course you need to have c:\\\\temp folder or choose another one).<\/p>\n<p>For classification problems we normally have to identify features. In Weka standard types of attributes are numeric, nominal, string, date and relation. Relation attribute can represent a whole dataset. There are also some functions for data preprocessing available. Here we define some attributes:<\/p>\n<p>[codesyntax lang=&#8221;java&#8221;]<\/p>\n<pre>\/\/1.ATTRIBUTES\r\n\/\/numeric\r\nAttribute attr = new Attribute(\"my-numeric\");\r\nSystem.out.println(attr.isNumeric());\r\n\r\n\/\/nominal\r\nFastVector myNomVals = new FastVector();\r\nfor (int i=0; i&lt;10; i++)\r\n\tmyNomVals.addElement(\"value_\"+i);\r\nAttribute attr1 = new Attribute(\"my-nominal\", myNomVals);\r\nSystem.out.println(attr1.isNominal());\r\n\r\n\/\/string\r\nAttribute attr2 = new Attribute(\"my-string\", (FastVector)null);\r\nSystem.out.println(attr2.isString());\r\n\r\n\/\/date\r\nAttribute attr3 = new Attribute(\"my-date\", \"dd-MM-yyyy\");\r\nSystem.out.println(attr3.isDate());\r\n\r\n\/\/whole relation can also be an attr\r\n\/\/Attribute attr4 = new Attribute(\"my-relation\", new Instances(...));<\/pre>\n<p>[\/codesyntax]<\/p>\n<p>When we have attributes, we can form the dataset aka. relation (reading and writing from files will come later):<\/p>\n<p>[codesyntax lang=&#8221;java&#8221;]<\/p>\n<pre>\/\/2.create dataset\r\nFastVector attrs = new FastVector();\r\n    attrs.addElement(attr);\r\n    attrs.addElement(attr1);\r\n    attrs.addElement(attr2);\r\n    attrs.addElement(attr3);\r\nInstances dataset = new\tInstances(\"my_dataset\", attrs, 0);<\/pre>\n<p>[\/codesyntax]<\/p>\n<p>Now we have defined the relation structure. There are a few possible ways to fill the dataset and here we present few of them:<\/p>\n<p>[codesyntax lang=&#8221;java&#8221;]<\/p>\n<pre>\/\/3.add instances\r\n\/\/first instance\r\ndouble[] attValues = new double[dataset.numAttributes()];\r\n\tattValues[0] = 55;\r\n\tattValues[1] = dataset.attribute(\"my-nominal\").indexOfValue(\"value_5\");\r\n        attValues[2] = dataset.attribute(\"my-string\").addStringValue(\"Slavko\");\r\n\tattValues[3] = dataset.attribute(\"my-date\").parseDate(\"7-6-1987\");\r\ndataset.add(new Instance(1.0, attValues));\r\n\r\n\/\/second instance\r\nattValues = new double[dataset.numAttributes()];\r\n\tattValues[0] = Instance.missingValue();\r\n\tattValues[1] = dataset.attribute(1).indexOfValue(\"value_9\");\r\n\tattValues[2] = dataset.attribute(2).addStringValue(\"Marinka\");\r\n\tattValues[3] = dataset.attribute(3).parseDate(\"23-4-1989\");\r\ndataset.add(new Instance(1.0, attValues));\r\n\r\n\/\/third instance\r\nInstance example = new Instance(4);\r\n\texample.setValue(attr, 16);\r\n\texample.setValue(attr1, \"value_7\");\r\n\texample.setValue(attr2, \"Mirko\");\r\n\texample.setValue(attr3, attr3.parseDate(\"1-1-1988\"));\r\ndataset.add(example);<\/pre>\n<p>[\/codesyntax]<\/p>\n<p>Up to here we have the dataset in the memory. We can use it (class attribute needs yet to be set), print it to stdout or file:<\/p>\n<p>[codesyntax lang=&#8221;java&#8221;]<\/p>\n<pre>\/\/4.output dataset\r\nSystem.out.println(dataset);\r\n\r\n\/\/5.save dataset\r\nString file = \"C:\\\\temp\\\\weka_test.arff\";\r\nArffSaver saver = new ArffSaver();\r\nsaver.setInstances(dataset);\r\nsaver.setFile(new File(file));\r\nsaver.writeBatch();\r\n\r\n\/\/6.read dataset\r\nArffLoader loader = new ArffLoader();\r\nloader.setFile(new File(file));\r\ndataset = loader.getDataSet();<\/pre>\n<p>[\/codesyntax]<\/p>\n<p>As we have one string attribute, we need to properly preprocess it as very few classifiers support them. We can accomplish this with filters, for example changin it to nominal attribute:<\/p>\n<p>[codesyntax lang=&#8221;java&#8221;]<\/p>\n<pre>\/\/7.preprocess strings (almost no classifier supports them)\r\nStringToWordVector filter = new StringToWordVector();\r\nfilter.setInputFormat(dataset);\r\ndataset = Filter.useFilter(dataset, filter);\r\nSystem.out.println(dataset);<\/pre>\n<p>[\/codesyntax]<\/p>\n<p>We have the data. The next thing is building a classifier. Weka contains a lot well known classifiers like naive Bayes, decision trees, perceptrons, etc.. I like SVMs and I use LibSVM with Weka. Weka already has built-in LibSVM API, so the only thing you need to do is to include libsvm.jar to classpath and use LibSVM as classifier instance.<\/p>\n<p>Another very easy task is also saving and retrieving back classifiers. The only thing to be aware of is the class index! You must set it before learning the classifier. Best practice is to always set class attribute as last one.<\/p>\n<p>[codesyntax lang=&#8221;java&#8221;]<\/p>\n<pre>\/\/8.build classifier\r\ndataset.setClassIndex(1);\r\nClassifier classifier = new J48();\r\nclassifier.buildClassifier(dataset);\r\n\r\n\/\/9.save classifier\r\nOutputStream os = new FileOutputStream(file);\r\nObjectOutputStream objectOutputStream = new ObjectOutputStream(os);\r\nobjectOutputStream.writeObject(classifier);\r\n\r\n\/\/10. read classifier back\r\nInputStream is = new FileInputStream(file);\r\nObjectInputStream objectInputStream = new ObjectInputStream(is);\r\nclassifier = (Classifier) objectInputStream.readObject();\r\nobjectInputStream.close();<\/pre>\n<p>[\/codesyntax]<\/p>\n<p>Usually we need to know how good the classifications are. Weka supports a number of evaluation tools, like CV and different measures. Here we will resample our dataset, create the train and learn dataset and output some results.<\/p>\n<p>[codesyntax lang=&#8221;java&#8221;]<\/p>\n<pre>\/\/11.evaluate\r\n\r\n\/\/resample if needed\r\ndataset = dataset.resample(new Random(42));\r\n\r\n\/\/split to 70:30 learn and test set\r\ndouble percent = 70.0;\r\nint trainSize = (int) Math.round(dataset.numInstances() * percent \/ 100);\r\nint testSize = dataset.numInstances() - trainSize;\r\nInstances train = new Instances(dataset, 0, trainSize);\r\nInstances test = new Instances(dataset, trainSize, testSize);\r\ntrain.setClassIndex(1);\r\ntest.setClassIndex(1);\r\n\r\n\/\/do eval\r\nEvaluation eval = new Evaluation(train); \/\/trainset\r\neval.evaluateModel(classifier, test); \/\/testset\r\nSystem.out.println(eval.toSummaryString());\r\nSystem.out.println(eval.weightedFMeasure());\r\nSystem.out.println(eval.weightedPrecision());\r\nSystem.out.println(eval.weightedRecall());<\/pre>\n<p>[\/codesyntax]<\/p>\n<p>When classifying new instances, we must be aware to transform classifier&#8217;s result to the class attribute value &#8211; it returns only the index of a value (for classification purposes)!<\/p>\n<p>[codesyntax lang=&#8221;java&#8221;]<\/p>\n<pre>\/\/12.classify\r\n\/\/result\r\nSystem.out.println(classifier.classifyInstance(dataset.firstInstance()));\r\n\/\/classified result value\r\nSystem.out.println(dataset.attribute(dataset.classIndex()).value((int)dataset.firstInstance().classValue()));\r\nSystem.out.println(classifier.distributionForInstance(dataset.firstInstance()));<\/pre>\n<p>[\/codesyntax]<\/p>\n<p>I hope this example was useful to you. I tried to show how to use weka for some quick tasks.<\/p>\n<div style=\"margin-top: 0px; margin-bottom: 0px;\" class=\"sharethis-inline-share-buttons\" ><\/div>","protected":false},"excerpt":{"rendered":"<p>Weka (http:\/\/www.cs.waikato.ac.nz\/ml\/weka\/) is Data Mining software from The University of Waikato. In Slovenia, The Bioinformatics Laboratory has also developed well known software Orange (http:\/\/orange.biolab.si\/). Both tools have GUI interface and a library for programmatic access. The main difference is that Weka is Java and Orange is python -based. Here I&#8230;<\/p>\n<div class=\"more-link-wrapper\"><a class=\"more-link\" href=\"https:\/\/blog.zitnik.si\/?p=568\">Continue reading<span class=\"screen-reader-text\">Quick intro to Weka<\/span><\/a><\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21],"tags":[],"class_list":["post-568","post","type-post","status-publish","format-standard","hentry","category-research","entry"],"_links":{"self":[{"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=\/wp\/v2\/posts\/568","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=568"}],"version-history":[{"count":9,"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=\/wp\/v2\/posts\/568\/revisions"}],"predecessor-version":[{"id":792,"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=\/wp\/v2\/posts\/568\/revisions\/792"}],"wp:attachment":[{"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=568"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=568"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.zitnik.si\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=568"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}