Press "Enter" to skip to content

Natural Language Processing in Slovene language

I have checked tools for text processing in Slovene language. As I have found out, there are no published named entity recognizers, relation extractors or co-reference resolution systems.

Good news is that there has been some work done in lemmatizing (sl. lematizacija) and POS tagging (sl. oblikoslovno označevanje) which will be very important for text preprocessing. The entry point where you can find more information is http://bos.zrc-sazu.si/. There are published some theories, references to scientific articles and research project. For me, the important resources are tagged datasets and learned models.

For Slovene, there are three main datasets that are woth mentioning (tagged according to TEI P5 standard):

  • JOS datasets: It consists of two datasets jos100k and jos1M with 100.000 and 1.000.000 hand-checked linguistical annotations.
  • FidaPLUS dataset: It contains a part of MULTEXT-East texts and is reference dataset for Slovene language as it is very representative.
  • MULTEXT-East: This is a spin-off of MULTEXT project. It contains morphological annotations for eastern languages: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Lithuanian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene and Ukrainian.

Learned models for lemmatizer and POS tagger can be found here: http://oznacevalnik.slovenscina.eu/Vsebine/Sl/ProgramskaOprema/Oblikoslovni.aspx. Sofware is unfortunatelly (unix user now :() written in .NET, but it contains training data some examples. Maybe I could get the source code from the developers to rewrite it to Java or I will do it on my own. On videolectures.net there is a nice presentation of work done on this project. Another lemmatiser can be found here: http://lemmatise.ijs.si/Software – it may be the same as previous one as it is writen by researchers from the same department.

For Slovene there also exists a version of Wordnet like lexicon, named sloWNet 2.2. Currently it contains 20.000 synsets along with 17.000 literals.

More information can also be found on Josef Stefan’s reasearch group site: http://nl.ijs.si/. Commercially available tools are available from company Amebis. The most products they have are rule based and from my point of view can only be used as helpers for research.

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.