Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger

Super User

5 years ago

[This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Parts of Speech (POS) tagging is a crucial part in natural language processing. It consists of labelling each word in a text document with a certain category like noun, verb, adverb, pronoun, … . At BNOSAC, we use it on a dayly basis in order to select only nouns before we do topic detection or in specific NLP flows. For R users working with different languages, the number of POS tagging options is small and all have up or downsides. The following taggers are commonly used.

The Stanford Part-Of-Speech Tagger which is terribly slow, the language set is limited to English/French/German/Spanish/Arabic/Chinese (no Dutch). R packages for this are available at http://datacube.wu.ac.at.
Treetagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger) contains more languages but is only usable for non-commercial purposes (can be used based on the koRpus R package)
OpenNLP is faster and allows to do POS tagging for Dutch, Spanish, Polish, Swedish, English, Danish, German but no French or Eastern-European languages. R packages for this are available at http://datacube.wu.ac.at.
Package pattern.nlp (https://github.com/bnosac/pattern.nlp) allows Parts of Speech tagging and lemmatisation for Dutch, French, English, German, Spanish, Italian but needs Python installed which is not always easy to request at IT departments
SyntaxNet and Parsey McParseface (https://github.com/tensorflow/models/tree/master/syntaxnet) have good accuracy for POS tagging but need tensorflow installed which might be too much installation hassle in a corporate setting not to mention the computational resources needed.

Comes in RDRPOSTagger which BNOSAC released at https://github.com/bnosac/RDRPOSTagger. It has the following features:

Easily installable in a corporate environment as a simple R package based on rJava
Covering more than 40 languages:
UniversalPOS annotation for languages: Ancient_Greek, Ancient_Greek-PROIEL, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Czech-CAC, Czech-CLTT, Danish, Dutch, Dutch-LassySmall, English, English-LinES, Estonian, Finnish, Finnish-FTB, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Kazakh, Latin, Latin-ITTB, Latin-PROIEL, Latvian, Norwegian, Old_Church_Slavonic, Persian, Polish, Portuguese, Portuguese-BR, Romanian, Russian-SynTagRus, Slovenian, Slovenian-SST, Spanish, Spanish-AnCora, Swedish, Swedish-LinES, Tamil, Turkish. Prepend the UD_ to the language if you want to used these models.
MORPH annotation for languages: Bulgarian, Czech, Dutch, French, German, Portuguese, Spanish, Swedish
POS annotation for languages: English, French, German, Hindi, Italian, Thai, Vietnamese
Fast tagging as the Single Classification Ripple Down Rules are easy to execute and hence are quick on larger text volumes
Competitive accuracy in comparison to state-of-the-art POS and morphological taggers
Cross-platform running on Windows/Linux/Mac
It allows to do the Morphological, POS tagging and universal POS tagging of sentences

The Ripple Down Rules a basic binary classification trees which are built on top of the Universal Dependencies datasets available at http://universaldependencies.org. The methodology of this is explained in detail at the paper ‘ A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging’ available at http://content.iospress.com/articles/ai-communications/aic698. If you just want to apply POS tagging on your text, you can go ahead as follows:

library(RDRPOSTagger)
rdr_available_models()

## POS annotation
x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist")
tagger <- rdr_model(language = "English", annotation = "POS")
rdr_pos(tagger, x = x)

## MORPH/POS annotation
x <- c("Dus godvermehoeren met pus in alle puisten , zei die schele van Van Bukburg .", 
       "Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont",
       "  ", "", NA)
tagger <- rdr_model(language = "Dutch", annotation = "MORPH")
rdr_pos(tagger, x = x)

## Universal POS tagging annotation
tagger <- rdr_model(language = "UD_Dutch", annotation = "UniversalPOS")
rdr_pos(tagger, x = x)

## This gives the following output
 sentence.id word.id             word word.type
           1       1              Dus       ADV
           1       2   godvermehoeren      VERB
           1       3              met       ADP
           1       4              pus      NOUN
           1       5               in       ADP
           1       6             alle      PRON
           1       7          puisten      NOUN
           1       8                ,     PUNCT
           1       9              zei      VERB
           1      10              die      PRON
           1      11           schele       ADJ
           1      12              van       ADP
           1      13              Van     PROPN
           1      14          Bukburg     PROPN
           1      15                .     PUNCT
           2       1               Er       ADV
           2       2              was       AUX
           2       3             toen     SCONJ
           2       4              dat     SCONJ
           2       5           liedje      NOUN
           2       6              van       ADP
           2       7 tietenkonttieten      VERB
           2       8             kont     PROPN
           2       9           tieten      VERB
           2      10     kontkontkont     PROPN
           2      11                .     PUNCT
           3       0             <NA>      <NA>
           4       0             <NA>      <NA>
           5       0             <NA>      <NA>

The function rdr_pos requests as input a vector of sentences. If you need to transform you text data to sentences, just use tokenize_sentences from the tokenizers package.

Good luck with text mining.
If you need our help for a text mining project. Let us know, we’ll be glad to get you started.

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.