'How to extract only English words from a from big text corpus using nltk?
I am want remove all non dictionary english words from text corpus. I have removed stopwords, tokenized and countvectorized the data. I need extract only the English words and attach them back to the dataframe .
data['Clean_addr'] = data['Adj_Addr'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if not item.isdigit()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
data['Clean_addr'] = data['Clean_addr'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
cv = CountVectorizer( max_features = 200,analyzer='word')
cv_addr = cv.fit_transform(data.pop('Clean_addr'))
Sample Dump of the File I am using
Solution 1:[1]
after you first tokenize your text corpus, you could instead stem the word tokens
import nltk
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language="english")
- is algorithm executing stemming
- stemming is just the process of breaking a word down into its root.
- is passed the argument 'english' ? ? ? porter2 stemming algorithm
- more precisely, this 'english' argument ? ? ?
stem.snowball.EnglishStemmer
- (porter2 stemmer considered to be better than the original
porter
stemmer)
- more precisely, this 'english' argument ? ? ?
?
stems = [stemmer.stem(t) for t in tokenized]
Above, I define a list comprehension, which executes as follows:
- list comprehension loops over our tokenized input
list
tokenized- (tokenized can also be any other other iterable input instance)
- list comprehension's action is to perform a
.stem
method on each tokenized word using theSnowballStemmer
instance stemmer - list comprehension then collects only the set of English stems
- i.e., it is a list that should collect only stemmed English word tokens
?
Caveat: ? list comprehension could conceivably include certain identical inflected words in other languages which English decendends from because porter2 would mistakenly think them English words
Solution 2:[2]
Down To The Essence
I had a VERY similar need. Your question appeared in my search. Felt I needed to look further, and I found THIS. I did a bit of modification for my specific needs (only English words from TONS of technical data sheets = no numbers or test standards or values or units, etc.). After much pain with other approaches, the below worked. I hope it can be a good launching point for you and others.
import nltk
from nltk.corpus import stopwords
words = set(nltk.corpus.words.words())
stop_words = stopwords.words('english')
file_name = 'Full path to your file'
with open(file_name, 'r') as f:
text = f.read()
text = text.replace('\n', ' ')
new_text = " ".join(w for w in nltk.wordpunct_tokenize(text)
if w.lower() in words
and w.lower() not in stop_words
and len(w.lower()) > 1)
print(new_text)
Solution 3:[3]
I used the pyenchant library to do this.
import enchant
d = enchant.Dict("en_US")
def get_eng_words(data):
eng =[]
for sample in tqdm(data):
sentence=''
word_tokens = nltk.word_tokenize(sample)
for word in word_tokens:
if(d.check(word)):
if(sentence ==''):
sentence = sentence + word
else:
sentence = sentence +" "+ word
print(sentence)
eng.append(sentence)
return eng
To save it just do this!
sentences=get_eng_words(df['column'])
df['column']=pd.DataFrame(sentences)
Hope it helps anyone!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Thom Ives |
Solution 3 | reisen Inaba |