Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Parts of Speech (POS) tagging is a crucial part in natural language processing. It consists of labelling each word in a text document with a certain category like noun, verb, adverb, pronoun, … . At BNOSAC, we use it on a dayly basis in order to select only nouns before we do topic detection or in specific NLP flows. For R users working with different languages, the number of POS tagging options is small and all have up or downsides. The following taggers are commonly used.
- The Stanford Part-Of-Speech Tagger which is terribly slow, the language set is limited to English/French/German/Spanish/Arabic/Chinese (no Dutch). R packages for this are available at http://datacube.wu.ac.at.
- Treetagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger) contains more languages but is only usable for non-commercial purposes (can be used based on the koRpus R package)
- OpenNLP is faster and allows to do POS tagging for Dutch, Spanish, Polish, Swedish, English, Danish, German but no French or Eastern-European languages. R packages for this are available at http://datacube.wu.ac.at.
- Package pattern.nlp (https://github.com/bnosac/pattern.nlp) allows Parts of Speech tagging and lemmatisation for Dutch, French, English, German, Spanish, Italian but needs Python installed which is not always easy to request at IT departments
- SyntaxNet and Parsey McParseface (https://github.com/tensorflow/models/tree/master/syntaxnet) have good accuracy for POS tagging but need tensorflow installed which might be too much installation hassle in a corporate setting not to mention the computational resources needed.
Comes in RDRPOSTagger which BNOSAC released at https://github.com/bnosac/RDRPOSTagger. It has the following features:
- Easily installable in a corporate environment as a simple R package based on rJava
- Covering more than 40 languages:
UniversalPOS annotation for languages: Ancient_Greek, Ancient_Greek-PROIEL, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Czech-CAC, Czech-CLTT, Danish, Dutch, Dutch-LassySmall, English, English-LinES, Estonian, Finnish, Finnish-FTB, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Kazakh, Latin, Latin-ITTB, Latin-PROIEL, Latvian, Norwegian, Old_Church_Slavonic, Persian, Polish, Portuguese, Portuguese-BR, Romanian, Russian-SynTagRus, Slovenian, Slovenian-SST, Spanish, Spanish-AnCora, Swedish, Swedish-LinES, Tamil, Turkish. Prepend the UD_ to the language if you want to used these models.
MORPH annotation for languages: Bulgarian, Czech, Dutch, French, German, Portuguese, Spanish, Swedish
POS annotation for languages: English, French, German, Hindi, Italian, Thai, Vietnamese - Fast tagging as the Single Classification Ripple Down Rules are easy to execute and hence are quick on larger text volumes
- Competitive accuracy in comparison to state-of-the-art POS and morphological taggers
- Cross-platform running on Windows/Linux/Mac
- It allows to do the Morphological, POS tagging and universal POS tagging of sentences
The Ripple Down Rules a basic binary classification trees which are built on top of the Universal Dependencies datasets available at http://universaldependencies.org. The methodology of this is explained in detail at the paper ‘ A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging’ available at http://content.iospress.com/articles/ai-communications/aic698. If you just want to apply POS tagging on your text, you can go ahead as follows:
library(RDRPOSTagger) rdr_available_models() ## POS annotation x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist") tagger <- rdr_model(language = "English", annotation = "POS") rdr_pos(tagger, x = x) ## MORPH/POS annotation x <- c("Dus godvermehoeren met pus in alle puisten , zei die schele van Van Bukburg .", "Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont", " ", "", NA) tagger <- rdr_model(language = "Dutch", annotation = "MORPH") rdr_pos(tagger, x = x) ## Universal POS tagging annotation tagger <- rdr_model(language = "UD_Dutch", annotation = "UniversalPOS") rdr_pos(tagger, x = x) ## This gives the following output sentence.id word.id word word.type 1 1 Dus ADV 1 2 godvermehoeren VERB 1 3 met ADP 1 4 pus NOUN 1 5 in ADP 1 6 alle PRON 1 7 puisten NOUN 1 8 , PUNCT 1 9 zei VERB 1 10 die PRON 1 11 schele ADJ 1 12 van ADP 1 13 Van PROPN 1 14 Bukburg PROPN 1 15 . PUNCT 2 1 Er ADV 2 2 was AUX 2 3 toen SCONJ 2 4 dat SCONJ 2 5 liedje NOUN 2 6 van ADP 2 7 tietenkonttieten VERB 2 8 kont PROPN 2 9 tieten VERB 2 10 kontkontkont PROPN 2 11 . PUNCT 3 0 <NA> <NA> 4 0 <NA> <NA> 5 0 <NA> <NA>
The function rdr_pos requests as input a vector of sentences. If you need to transform you text data to sentences, just use tokenize_sentences from the tokenizers package.
Good luck with text mining.
If you need our help for a text mining project. Let us know, we’ll be glad to get you started.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.