'Multilingual NLTK for POS Tagging and Lemmatizer

Recently I approached to the NLP and I tried to use NLTK and TextBlob for analyzing texts. I would like to develop an app that analyzes reviews made by travelers and so I have to manage a lot of texts written in different languages. I need to do two main operations: POS Tagging and lemmatization. I have seen that in NLTK there is a possibility to choice the the right language for sentences tokenization like this:

tokenizer = nltk.data.load('tokenizers/punkt/PY3/italian.pickle')

I haven't found the the right way to set the language for POS Tagging and Lemmatizer in different languages yet. How can I set the correct corpora/dictionary for non-english texts such as Italian, French, Spanish or German? I also see that there is a possibility to import the "TreeBank" or "WordNet" modules, but I don't understand how I can use them. Otherwise, where can I find the respective corporas?

Can you give me some suggestion or reference? Please take care that I'm not an expert of NLTK.

Many Thanks.



Solution 1:[1]

If you are looking for another multilingual POS tagger, you might want to try RDRPOSTagger: a robust, easy-to-use and language-independent toolkit for POS and morphological tagging. See experimental results including performance speed and tagging accuracy on 13 languages in this paper. RDRPOSTagger now supports pre-trained POS and morphological tagging models for Bulgarian, Czech, Dutch, English, French, German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese. RDRPOSTagger also supports the pre-trained Universal POS tagging models for 40 languages.

In Python, you can utilize the pre-trained models for tagging a raw unlabeled text corpus as:

python RDRPOSTagger.py tag PATH-TO-PRETRAINED-MODEL PATH-TO-LEXICON PATH-TO-RAW-TEXT-CORPUS

Example: python RDRPOSTagger.py tag ../Models/POS/German.RDR ../Models/POS/German.DICT ../data/GermanRawTest

If you would like to program with RDRPOSTagger, please follow code lines 92-98 in RDRPOSTagger.py module in pSCRDRTagger package. Here is an example:

r = RDRPOSTagger()
r.constructSCRDRtreeFromRDRfile("../Models/POS/German.RDR") #Load POS tagging model for German
DICT = readDictionary("../Models/POS/German.DICT") #Load a German lexicon 
r.tagRawSentence(DICT, "Die Reaktion des deutschen Außenministers zeige , daß dieser die außerordentlich wichtige Rolle Irans in der islamischen Welt erkenne .")

r = RDRPOSTagger()
r.constructSCRDRtreeFromRDRfile("../Models/POS/French.RDR") # Load POS tagging model for French
DICT = readDictionary("../Models/POS/French.DICT") # Load a French lexicon
r.tagRawSentence(DICT, "Cette annonce a fait l' effet d' une véritable bombe . ")

Solution 2:[2]

There is no option that you can pass to NLTK's POS-tagging and lemmatizing functions that will make them process other languages.

One solution would be to get a training corpus for each language and to train your own POS-taggers with NLTK, then figure out a lemmatizing solution, maybe dictonary-based, for each language.

That might be overkill though, as there is already a single stop solution for both tasks in Italian, French, Spanish and German (and many other languages): TreeTagger. It is not as state-of-the-art as the POS-taggers and lemmatizers in English, but it still does a good job.

What you want is to install TreeTagger on your system and be able to call it from Python. Here is a GitHub repo by miotto that lets you do just that.

The following snippet shows you how to test that you set up everything correctly. As you can see, I am able to POS-tag and lemmatize in one function call, and I can do it just as easily in English and in French.

>>> import os
>>> os.environ['TREETAGGER'] = "/opt/treetagger/cmd" # Or wherever you installed TreeTagger
>>> from treetagger import TreeTagger
>>> tt_en = TreeTagger(encoding='utf-8', language='english')
>>> tt_en.tag('Does this thing even work?')
[[u'Does', u'VBZ', u'do'], [u'this', u'DT', u'this'], [u'thing', u'NN', u'thing'], [u'even', u'RB', u'even'], [u'work', u'VB', u'work'], [u'?', u'SENT', u'?']]
>>> tt_fr = TreeTagger(encoding='utf-8', language='french')
>>> tt_fr.tag(u'Mon Dieu, faites que ça marche!')
[[u'Mon', u'DET:POS', u'mon'], [u'Dieu', u'NOM', u'Dieu'], [u',', u'PUN', u','], [u'faites', u'VER:pres', u'faire'], [u'que', u'KON', u'que'], [u'\xe7a', u'PRO:DEM', u'cela'], [u'marche', u'NOM', u'marche'], [u'!', u'SENT', u'!']]

Since this question gets asked a lot (and since the installation process is not super straight-forward, IMO), I will write a blog post on the matter and update this answer with a link to it as soon as it is done.

EDIT: Here is the above-mentioned blog post.

Solution 3:[3]

I quite like using SpaCy for multilingual NLP. They have trained models for Catalan, Chinese, Danish, Dutch, English, French, German, Greek, Italian, Japanese, Lithuanian, Macedonian, Norwegian Bokmäl, Polish, Portuguese, Romanian, Russian and Spanish.

You would simply load a different model depending on the language you're working with:

import spacy
nlp_DE = spacy.load("de_core_news_sm")
nlp_FR = spacy.load("fr_core_news_sm")

It's not as accurate as Treetagger or Hanovertagger but it is very easy to use while outputting useable results that are much better than NLTK.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 Justin Schmidt