'NLTK find german nouns

I want to extract all german nouns from a german text in lemmatized form with NLTK.

I also checked spacy but NLTK is much more preferred because in english it already works with the needed performance and requested data structure.

I have the following working code for english:

import nltk
from nltk.stem import WordNetLemmatizer

#germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'

text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'

tokens = nltk.word_tokenize(text)
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']

print (tokens)

I get the print as expected: ['year', 'sender', 'key', 'recipient']

Now I tried to do this for German:

import nltk
from nltk.stem import WordNetLemmatizer

germanText='Jahrtausendelang ging man davon aus, dass auch der Sender einen geheimen Schlüssel, und zwar den gleichen wie der Empfänger, benötigt.'
#text='For thousands of years it was assumed that the sender also needed a secret key, the same as the recipient.'

tokens = nltk.word_tokenize(germanText, language='german')
tokens = [tok.lower() for tok in tokens]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(tok) for tok in tokens]
tokens = [word for (word, pos) in nltk.pos_tag(tokens) if pos[0] == 'N']

print (tokens)

And I get a wrong result: ['jahrtausendelang', 'man', 'davon', 'au', 'der', 'sender', 'einen', 'geheimen', 'zwar', 'den', 'gleichen', 'wie', 'der', 'empfänger', 'benötigt']

The lemmatization did not work and the noun extraction did not work.

How is the proper way to apply different languages to this code?

I also checked other solutions like:

from nltk.stem.snowball import GermanStemmer
stemmer = GermanStemmer("german") # Choose a language
tokenGer=stemmer.stem(tokens)

But this would make me start from the beginning.



Solution 1:[1]

I have found a way with the HanoverTagger:

from HanTa import HanoverTagger as ht
tagger = ht.HanoverTagger('morphmodel_ger.pgz')
words = nltk.word_tokenize(text)
print(tagger.tag_sent(words) )
tokens=[word for (word,x,pos) in tagger.tag_sent(words,taglevel= 1) if pos == 'NN']

I get the outcome as expected: ['Jahrtausendelang', 'Sender', 'Schlüssel', 'Empfänger']

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Veritas_in_Numeris