'Why is my word lemmatization not working as expected?

Hi stackoverflow community! Long-time reader but first-time poster. I'm currently trying my hand at NLP and after reading a few forum posts touching upon this topic, I can't seem to get the lemmatizer to work properly (function pasted below). Comparing my original text vs preprocessed text, all the cleaning steps work as expected, except the lemmatization. I've even tried specifying the part of speech : 'v' to not default the word as noun, and still get the base form of the verb (ex: turned -> turn , are -> be, reading -> read) ... however this doesn't seem to be working.

Appreciate another set of eyes and feedback - thanks!

# key imports

import pandas as pd
import numpy as np
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem import WordNetLemmatizer
import contractions


# cleaning functions

def to_lower(text):
    '''
    Convert text to lowercase
    '''
    return text.lower()

def remove_punct(text):
    return ''.join(c for c in text if c not in punctuation)

def remove_stopwords(text):
    '''
    Removes stop words which don't have meaning (ex: is, the, a, etc.)
    '''
    additional_stopwords = ['app']

    stop_words = set(stopwords.words('english')) - set(['not','out','in']) 
    stop_words = stop_words.union(additional_stopwords)
    return ' '.join([w for w in nltk.word_tokenize(text) if not w in stop_words])

def fix_contractions(text):
    '''
    Expands contractions
    '''
    return contractions.fix(text)



# preprocessing pipeline

def preprocess(text):
    # convert to lower case
    lower_text = to_lower(text)
    sentence_tokens = sent_tokenize(lower_text)
    word_list = []      
            
    for each_sent in sentence_tokens:
        # fix contractions
        clean_text = fix_contractions(each_sent)
        # remove punctuation
        clean_text = remove_punct(clean_text)
        # filter out stop words
        clean_text = remove_stopwords(clean_text)
        # get base form of word
        wnl = WordNetLemmatizer()
        for part_of_speech in ['v']:
            lemmatized_word = wnl.lemmatize(clean_text, part_of_speech)
        # split the sentence into word tokens
        word_tokens = word_tokenize(lemmatized_word)
        for i in word_tokens:
            word_list.append(i)                     
    return word_list

# lemmatize not properly working to get base form of word
# ex: 'turned' still remains 'turned' without returning base form 'turn'
# ex: 'running' still remains 'running' without getting base form 'run'



sample_data = posts_with_text['post_text'].head(5)
print(sample_data)
sample_data.apply(preprocess)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Why is my word lemmatization not working as expected?

Sources

Related Questions