'How do I remove nonsensical or incomplete words from a corpus?

I am using some text for some NLP analyses. I have cleaned the text taking steps to remove non-alphanumeric characters, blanks, duplicate words and stopwords, and also performed stemming and lemmatization:

from nltk.tokenize import word_tokenize
import nltk.corpus
import re
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import pandas as pd

data_df = pd.read_csv('path/to/file/data.csv')

stopwords = nltk.corpus.stopwords.words('english') 

stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

# Function to remove duplicates from sentence
def unique_list(l):
    ulist = []
    [ulist.append(x) for x in l if x not in ulist]
    return ulist

for i in range(len(data_df)):

    # Convert to lower case, split into individual words using word_tokenize
    sentence = word_tokenize(data_df['O_Q1A'][i].lower()) #data['O_Q1A'][i].split(' ')

    # Remove stopwords
    filtered_sentence = [w for w in sentence if not w in stopwords]

    # Remove duplicate words from sentence
    filtered_sentence = unique_list(filtered_sentence)

    # Remove non-letters
    junk_free_sentence = []
    for word in filtered_sentence:
        junk_free_sentence.append(re.sub("[^\w\s]", " ", word)) # Remove non-letters, but don't remove whitespaces just yet
        #junk_free_sentence.append(re.sub("/^[a-z]+$/", " ", word)) # Take only alphabests

    # Stem the junk free sentence
    stemmed_sentence = []
    for w in junk_free_sentence:
        stemmed_sentence.append(stemmer.stem(w))

    # Lemmatize the stemmed sentence
    lemmatized_sentence = []
    for w in stemmed_sentence:
        lemmatized_sentence.append(lemmatizer.lemmatize(w))

    data_df['O_Q1A'][i] = ' '.join(lemmatized_sentence)

But when I display the top 10 words (according to some criteria), I still get some junk like:

ask
much
thank
work
le
know
via
sdh
n
sy
t
n t
recommend
never

Out of these top 10 words, only 5 are sensible (ask, know, recommend, thank and work). What more do I need to do to retain only meaningful words?



Solution 1:[1]

The default NLTK stoplist is a minimal one and it certainly does'nt conatin words like 'ask' 'much', because they are not generally nonsensical. These words are only irrelevnt to you but may not be to others. For your problem, you can always use your custom stopwords filter after using NLTK. A simple example:

def removeStopWords(str):
    #select english stopwords
    cachedStopWords = set(stopwords.words("english"))
    #add custom words
    cachedStopWords.update(('ask', 'much', 'thank', 'etc.'))
    #remove stop words
    new_str = ' '.join([word for word in str.split() if word not in cachedStopWords]) 
    return new_str

Alternatively, you can edit the NLTK stopwords list, which is essentially a text file, stored in the NLTK installation directory.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Vivek Jain