'How to lemmatise a dataframe column Python

How can lemmatise a dataframe column. CSV file "train.csv" looks like this

id  tweet
1   retweet if you agree
2   happy birthday your majesty
3   essential oils are not made of chemicals

I performed the following

import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

train_data = pd.read_csv('train.csv', error_bad_lines=False)
print(train_data)

# Removing stop words
stop = stopwords.words('english')
test = pd.DataFrame(train_data['tweet'])
test.columns = ['tweet']

test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test['tweet_without_stopwords'])

# TOKENIZATION
tt = TweetTokenizer()
test['tokenised_tweet'] = test['tweet_without_stopwords'].apply(tt.tokenize)
print(test)

output:

0 retweet if you agree ... [retweet, agree]
1 happy birthday your majesty ... [happy, birthday, majesty]
2 essential oils are not made of chemicals ... [essential, oils, made, chemicals]


I tried the following to lemmatise but I'm getting this error TypeError: unhashable type: 'list'


lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in test['tokenised_tweet']]]
print(lemmatized)



Solution 1:[1]

I would do the calculation on the dataframe itself:

changing:

lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in test['tokenised_tweet']]]
print(lemmatized)
lmtzr = WordNetLemmatizer()
test['lemmatize'] = test['tokenised_tweet'].apply(
                    lambda lst:[lmtzr.lemmatize(word) for word in lst])

full code:

from io import StringIO
import pandas as pd
data=StringIO(
"""id;tweet
1;retweet if you agree
2;happy birthday your majesty
3;essential oils are not made of chemicals"""
)
test = pd.read_csv(data,sep=";")

import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

# Removing stop words
stop = stopwords.words('english')

test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test['tweet_without_stopwords'])

# TOKENIZATION
tt = TweetTokenizer()
test['tokenised_tweet'] = test['tweet_without_stopwords'].apply(tt.tokenize)
print(test)

lmtzr = WordNetLemmatizer()
test['lemmatize'] = test['tokenised_tweet'].apply(
                    lambda lst:[lmtzr.lemmatize(word) for word in lst])
print(test['lemmatize'])

output

0                    [retweet, agree]
1          [happy, birthday, majesty]
2    [essential, oil, made, chemical]
Name: lemmatize, dtype: object

Solution 2:[2]

Just for future reference, not to revive an old thread.

Here is how I have done it, it could be improved but it works:

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

df['Summary'] = df['Summary'].apply(lemmatize_text)
df['Summary'] = df['Summary'].apply(lambda x : " ".join(x))

'''

Change the names of the DF columns to your choosings, basically this tokenizes each of the texts, lemmatizes them, and rejoins them once finished.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Bernardo stearns reisen
Solution 2 Run 4ever