'How to lemmatise a dataframe column Python
How can lemmatise a dataframe column. CSV file "train.csv" looks like this
id tweet
1 retweet if you agree
2 happy birthday your majesty
3 essential oils are not made of chemicals
I performed the following
import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
train_data = pd.read_csv('train.csv', error_bad_lines=False)
print(train_data)
# Removing stop words
stop = stopwords.words('english')
test = pd.DataFrame(train_data['tweet'])
test.columns = ['tweet']
test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test['tweet_without_stopwords'])
# TOKENIZATION
tt = TweetTokenizer()
test['tokenised_tweet'] = test['tweet_without_stopwords'].apply(tt.tokenize)
print(test)
output:
0 retweet if you agree ... [retweet, agree]
1 happy birthday your majesty ... [happy, birthday, majesty]
2 essential oils are not made of chemicals ... [essential, oils, made, chemicals]
I tried the following to lemmatise but I'm getting this error TypeError: unhashable type: 'list'
lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in test['tokenised_tweet']]]
print(lemmatized)
Solution 1:[1]
I would do the calculation on the dataframe itself:
changing:
lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in test['tokenised_tweet']]]
print(lemmatized)
lmtzr = WordNetLemmatizer()
test['lemmatize'] = test['tokenised_tweet'].apply(
lambda lst:[lmtzr.lemmatize(word) for word in lst])
full code:
from io import StringIO
import pandas as pd
data=StringIO(
"""id;tweet
1;retweet if you agree
2;happy birthday your majesty
3;essential oils are not made of chemicals"""
)
test = pd.read_csv(data,sep=";")
import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
# Removing stop words
stop = stopwords.words('english')
test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test['tweet_without_stopwords'])
# TOKENIZATION
tt = TweetTokenizer()
test['tokenised_tweet'] = test['tweet_without_stopwords'].apply(tt.tokenize)
print(test)
lmtzr = WordNetLemmatizer()
test['lemmatize'] = test['tokenised_tweet'].apply(
lambda lst:[lmtzr.lemmatize(word) for word in lst])
print(test['lemmatize'])
output
0 [retweet, agree]
1 [happy, birthday, majesty]
2 [essential, oil, made, chemical]
Name: lemmatize, dtype: object
Solution 2:[2]
Just for future reference, not to revive an old thread.
Here is how I have done it, it could be improved but it works:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
df['Summary'] = df['Summary'].apply(lemmatize_text)
df['Summary'] = df['Summary'].apply(lambda x : " ".join(x))
'''
Change the names of the DF columns to your choosings, basically this tokenizes each of the texts, lemmatizes them, and rejoins them once finished.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Bernardo stearns reisen |
Solution 2 | Run 4ever |