'Why is my word lemmatization not working as expected?
Hi stackoverflow community! Long-time reader but first-time poster. I'm currently trying my hand at NLP and after reading a few forum posts touching upon this topic, I can't seem to get the lemmatizer to work properly (function pasted below). Comparing my original text vs preprocessed text, all the cleaning steps work as expected, except the lemmatization. I've even tried specifying the part of speech : 'v' to not default the word as noun, and still get the base form of the verb (ex: turned -> turn , are -> be, reading -> read) ... however this doesn't seem to be working.
Appreciate another set of eyes and feedback - thanks!
# key imports
import pandas as pd
import numpy as np
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import WordNetLemmatizer
import contractions
# cleaning functions
def to_lower(text):
'''
Convert text to lowercase
'''
return text.lower()
def remove_punct(text):
return ''.join(c for c in text if c not in punctuation)
def remove_stopwords(text):
'''
Removes stop words which don't have meaning (ex: is, the, a, etc.)
'''
additional_stopwords = ['app']
stop_words = set(stopwords.words('english')) - set(['not','out','in'])
stop_words = stop_words.union(additional_stopwords)
return ' '.join([w for w in nltk.word_tokenize(text) if not w in stop_words])
def fix_contractions(text):
'''
Expands contractions
'''
return contractions.fix(text)
# preprocessing pipeline
def preprocess(text):
# convert to lower case
lower_text = to_lower(text)
sentence_tokens = sent_tokenize(lower_text)
word_list = []
for each_sent in sentence_tokens:
# fix contractions
clean_text = fix_contractions(each_sent)
# remove punctuation
clean_text = remove_punct(clean_text)
# filter out stop words
clean_text = remove_stopwords(clean_text)
# get base form of word
wnl = WordNetLemmatizer()
for part_of_speech in ['v']:
lemmatized_word = wnl.lemmatize(clean_text, part_of_speech)
# split the sentence into word tokens
word_tokens = word_tokenize(lemmatized_word)
for i in word_tokens:
word_list.append(i)
return word_list
# lemmatize not properly working to get base form of word
# ex: 'turned' still remains 'turned' without returning base form 'turn'
# ex: 'running' still remains 'running' without getting base form 'run'
sample_data = posts_with_text['post_text'].head(5)
print(sample_data)
sample_data.apply(preprocess)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|