'How do I remove nonsensical or incomplete words from a corpus?
I am using some text for some NLP analyses. I have cleaned the text taking steps to remove non-alphanumeric characters, blanks, duplicate words and stopwords, and also performed stemming and lemmatization:
from nltk.tokenize import word_tokenize
import nltk.corpus
import re
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import pandas as pd
data_df = pd.read_csv('path/to/file/data.csv')
stopwords = nltk.corpus.stopwords.words('english')
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
# Function to remove duplicates from sentence
def unique_list(l):
ulist = []
[ulist.append(x) for x in l if x not in ulist]
return ulist
for i in range(len(data_df)):
# Convert to lower case, split into individual words using word_tokenize
sentence = word_tokenize(data_df['O_Q1A'][i].lower()) #data['O_Q1A'][i].split(' ')
# Remove stopwords
filtered_sentence = [w for w in sentence if not w in stopwords]
# Remove duplicate words from sentence
filtered_sentence = unique_list(filtered_sentence)
# Remove non-letters
junk_free_sentence = []
for word in filtered_sentence:
junk_free_sentence.append(re.sub("[^\w\s]", " ", word)) # Remove non-letters, but don't remove whitespaces just yet
#junk_free_sentence.append(re.sub("/^[a-z]+$/", " ", word)) # Take only alphabests
# Stem the junk free sentence
stemmed_sentence = []
for w in junk_free_sentence:
stemmed_sentence.append(stemmer.stem(w))
# Lemmatize the stemmed sentence
lemmatized_sentence = []
for w in stemmed_sentence:
lemmatized_sentence.append(lemmatizer.lemmatize(w))
data_df['O_Q1A'][i] = ' '.join(lemmatized_sentence)
But when I display the top 10 words (according to some criteria), I still get some junk like:
ask
much
thank
work
le
know
via
sdh
n
sy
t
n t
recommend
never
Out of these top 10 words, only 5 are sensible (ask
, know
, recommend
, thank
and work
). What more do I need to do to retain only meaningful words?
Solution 1:[1]
The default NLTK stoplist is a minimal one and it certainly does'nt conatin words like 'ask' 'much', because they are not generally nonsensical. These words are only irrelevnt to you but may not be to others. For your problem, you can always use your custom stopwords filter after using NLTK. A simple example:
def removeStopWords(str):
#select english stopwords
cachedStopWords = set(stopwords.words("english"))
#add custom words
cachedStopWords.update(('ask', 'much', 'thank', 'etc.'))
#remove stop words
new_str = ' '.join([word for word in str.split() if word not in cachedStopWords])
return new_str
Alternatively, you can edit the NLTK stopwords list, which is essentially a text file, stored in the NLTK installation directory.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Vivek Jain |