Category Archives: Research

Data extraction using Alchemy API

As I am very interested into information extraction from textual data, from the research perspective, I have tried to used the services from Alchemy API.

Alchemy API is known as one of the best natural language processing services and offers a number of APIs for: entity extraction, sentiment analysis, keyword extraction, concept tagging, relation extraction, text categorization, author extraction, language detection, text extraction, microformats parsing, feed detection and linked data support. I am  mostly interested into entity extraction and relation extraction as they are the key subtasks of information extraction.

The company offers AlchemyAPI as a service, on premise services and custom built services. They also offer 1000 requests/per day in the free tier (this is what I use).

To access the services, you need first to register for an access key and sign the agreement that you will not misuse the services. Then the services can be accessed using prepared SDKs (currently available for Node.js, Pyhton, Java, Android, Perl, C# and PHP) or directly using HTTP requests as the API is very straightforward.

The entity extraction should normally consist of named entity recognition and coreference resolution. Their service does not return coreference clusters, but they perform entity disambiguation and connect entities to known linked data sources like DBPedia, FreeBase, Yago and OpenCyC. An example of extracted data in the XML format:

<entity>
            <type>Company</type>
            <relevance>0.451904</relevance>
            <count>1</count>
            <text>Facebook</text>
            <disambiguated>
                <name>Facebook</name>
                <subType>Website</subType>
                <subType>VentureFundedCompany</subType>
                <website>http://www.facebook.com/</website>
                <dbpedia>http://dbpedia.org/resource/Facebook</dbpedia>
                <freebase>http://rdf.freebase.com/ns/m.02y1vz</freebase>
                <yago>http://yago-knowledge.org/resource/Facebook</yago>
                <crunchbase>http://www.crunchbase.com/company/facebook</crunchbase>
            </disambiguated>
        </entity>
        <entity>
            <type>Person</type>
            <relevance>0.44154</relevance>
            <count>1</count>
            <text>Blair Garrou</text>
        </entity>
        <entity>
            <type>OperatingSystem</type>
            <relevance>0.43957</relevance>
            <count>1</count>
            <text>Android</text>
        </entity>

For relation extraction, an example XML, returned as a result, looks as follows:

<relation>
            <subject>
                <text>Madonna</text>
            </subject>
            <action>
                <text>enjoys</text>
                <lemmatized>enjoy</lemmatized>
                <verb>
                    <text>enjoy</text>
                    <tense>present</tense>
                </verb>
            </action>
            <object>
                <text>tasty Pepsi</text>
                <sentiment>
                    <type>positive</type>
                    <score>0.069017</score>
                </sentiment>
                <sentimentFromSubject>
                    <type>positive</type>
                    <score>0.0624976</score>
                </sentimentFromSubject>
                <entities>
                    <entity>
                        <type>Company</type>
                        <text>Pepsi</text>
                    </entity>
                </entities>
            </object>
        </relation>

I have also tried some tools that are available from some research groups, but the AlchemyAPI seems to work with very high precision on real-life data. In the field of NLP the AlchemyAPI also represents one of the best and most comprehensive NLP suite.

Informatics 2013 Conference

This year we had a bilateral project with Faculty of Electrical Engineering, Belgrade, Serbia. My Serbian friend Bojan Furlan and I were working on Intelligent Question Routing Systems to predict best users as answerers for a given question. The result of the cooperation was a joint paper that we published at the Informatics 2013 conference.

The conference was held in Spišská Nová Ves, Slovakia to where I decided to go by car.

On the 2nd November I went to Budapest where I met some friends and did some sightseeing. On monday (4th November) Bojan came by train and then we went to Slovakia. As we set the GPS to use the shortest path, we drove through Drožnjava, where the road is not in a very good condition and also there was a thick fog. We were moved from the Metropol hotel to Renesance hotel as there were some accomodation problems. Also, we were joined with a friend from Ukraine in the same room.

I Spisska everythin was closed at 9 o’clock and there was nothing to do in the evenings. Otherwise the city is nice and they also have some shopping centres – e.g. Tesco, Madaras, …. One afternoon I went there to buy some shoes and clothes.

The conference social events were really calm and ended soon – at around 9pm in the evenings. On tuesday we went to see Kežmarok castle, which is very nice with a lot of collections from 15h century until today. There I also tried the Slovakian national alcohol – Borovička.

The conference programme was more general informatics and programming languages oriented. One of the keynote speakers was Prof. Dr. Andreas Bollin from the University of Klagenfurt, whose title of the talk was “Evolution before Birth? – A Closer Look on Sofware Deterioration”. He presented some ideas of the formal models and future directions. The second keynote was regarding the definitions of languages from mathematical point of view and was given by Prof. Dr. Zoltán Fülöp. I also remember few interesting presentations, especially the one regarding website fragment processing – Isomorphic mapping of DOM trees for Cluster-Based Page Segmentation. The idea was to represent the webpage as a tree structured HTML and eliminate redundancy to encode the structure. Another talk was regarding the on-the-fly decisions how much data to send to the clients to reduce the number of server-side processing – Benchmark-based Optimization of Computational Capacity Distribution in a Client-server Web Application. There exist some frameworks that can benchmark clients and when a request is send to server, the server knows the capabilities – e.g. processing power – of the client. The third interested talk was about performance evaluation of Micro instances at Amazon EC2 – Performance of a Java Web Application Running on Amazon EC2 Micro Instance.

Our paper was presented by Bojan Furlan and I shot the presentation, which is available below:

(If you do not see the video, you can download it from http://zitnik.si/temp/informatics2013/IQRS_Informatic2013_BojanFurlan.mp4.)

Presentation:

Some pictures from Budapest, Spisska, Kežmarok castle and Košice:

Machine Learning for everyone – PredictionIO

PredictionIO (http://prediction.io/) is an open source machine learning (ML) server. Its goal is to make personalization and recommendation algorithms more accessible to programmers without ML knowledge. It includes recommendation engine and similarity engine which can be instantiated, configured and evaluated via web-based GUI.

Due to a limited number of integrated ML methods I do not think this product should be already called “machine learning server“. As I was curious how does the system work, I tested it. Therefore in this post I review how to install and use the server.

1. Installation

First we need to install the server and its dependencies. I was using Mac OSX Mavericks  (10.9, GM):

  • We need to install MongoDB (http://www.mongodb.org). Currently, version 2.4.6 was available. To run the database, we need to create a db folder and run the service
$ mkdir /data/db
$ ./mongod
  • Then we need to download Apache Hadoop (http://hadoop.apache.org/) and add it to PATH.
  • The Prediction IO server and MongoDB connector should be installed as follows in the getting stared guide:
git clone https://github.com/mongodb/mongo-hadoop.git
cd mongo-hadoop
git checkout r1.1.0
./sbt publish-local
 
git clone https://github.com/PredictionIO/PredictionIO.git
cd PredictionIO
bin/build.sh
bin/package.sh

2. Run the server

After we packaged the distribution, we can run the server from dist/target/PredictionIO-<version>. First we need to run the setup script ./bin/setup.sh and then run it ./bin/start-all.sh.

The server is accessible only to registered users, which can be added using the following command ./bin/users. After that, we can login to the server via the default port: http://localhost:9000/.

Screen Shot 2013-10-21 at 14.34.26

Later, if we see the message “This feature will be available soon.”, we need to run the setup script again and restart the server.

3. Write an example application

Firstly, we create an application. The result of this step is an App Key, which is used for our script. Then we create an engine – we chose recommendation engine. We need to define item types and some basic recommendation parameters. Afterwards we select a recommendation algorithm and set its parameters.

The main idea is to have a set of users and a set of different items to predict new items for new or existing users.

a

b

c

d

e

f

g

 

Secondly, we need to populate the database via our program and then we can call functions to get new predictions. We published our sample code on GitHub (https://github.com/szitnik/prediction-io-Test). The key idea was to have 4 users and their friendships (modelled as view action) to predict new possible friendships.

h

i

j

After we inserted the data, the system calculated all possibilites and stored them into the MongoDB database:

13/10/21 16:22:50 INFO mongodb.MongoDbCollector: Putting key for output: null { "uid" : "1_XWoman" , "iid" : "1_xwoman" , "score" : 0.8630746441103142 , "itypes" : [ "person"] , "algoid" : 2 , "modelset" : true}
 13/10/21 16:22:50 INFO mongodb.MongoDbCollector: Putting key for output: null { "uid" : "1_XWoman" , "iid" : "1_mirco" , "score" : 0.7472373332042158 , "itypes" : [ "person"] , "algoid" : 2 , "modelset" : true}
...
 13/10/21 16:22:50 INFO mongodb.MongoDbCollector: Putting key for output: null { "uid" : "1_Mirco" , "iid" : "1_jurcek" , "score" : 0.8126793796567733 , "itypes" : [ "person"] , "algoid" : 2 , "modelset" : true}
 13/10/21 16:22:50 INFO mongodb.MongoDbCollector: Putting key for output: null { "uid" : "1_Mirco" , "iid" : "1_mirco" , "score" : 0.6338884061675459 , "itypes" : [ "person"] , "algoid" : 2 , "modelset" : true}
 13/10/21 16:22:50 INFO mongodb.MongoDbCollector: Putting key for output: null { "uid" : "1_Mirco" , "iid" : "1_xwoman" , "score" : 0.5107068611450168 , "itypes" : [ "person"] , "algoid" : 2 , "modelset" : true}
...
 13/10/21 16:22:50 INFO mongodb.MongoDbCollector: Putting key for output: null { "uid" : "1_Johan" , "iid" : "1_mirco" , "score" : 0.9135908855406835 , "itypes" : [ "person"] , "algoid" : 2 , "modelset" : true}
 13/10/21 16:22:50 INFO mongodb.MongoDbCollector: Putting key for output: null { "uid" : "1_Johan" , "iid" : "1_johan" , "score" : 0.7490173941424351 , "itypes" : [ "person"] , "algoid" : 2 , "modelset" : true}

After that we ran some queries to get new friends recommendations:

Retrieve top 1 recommendations for user Mirco
Recommendations: jurcek
Retrieve top 1 recommendations for user XWoman
Recommendations: mirco

The web-based GUI also supports some parameters tuning and evaluation methods for recommendation algorithms.

To conclude, I believe the PredictionIO project is a nice start to bring ML (although I do not agree with ML naming here :)) methods closer to a large group of programmers.