'How to extract only English words from a from big text corpus using nltk?

I am want remove all non dictionary english words from text corpus. I have removed stopwords, tokenized and countvectorized the data. I need extract only the English words and attach them back to the dataframe .

data['Clean_addr'] = data['Adj_Addr'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
        data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if  not  item.isdigit()]))
        data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
        data['Clean_addr'] = data['Clean_addr'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
        cv = CountVectorizer( max_features = 200,analyzer='word')
        cv_addr = cv.fit_transform(data.pop('Clean_addr'))

Sample Dump of the File I am using

https://www.dropbox.com/s/allhfdxni0kfyn6/Test.csv?dl=0



Solution 1:[1]

after you first tokenize your text corpus, you could instead stem the word tokens

import nltk
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language="english")

SnowballStemmer

?

stems = [stemmer.stem(t) for t in tokenized]  

Above, I define a list comprehension, which executes as follows:

  1. list comprehension loops over our tokenized input list tokenized
    • (tokenized can also be any other other iterable input instance)
  2. list comprehension's action is to perform a .stem method on each tokenized word using the SnowballStemmer instance stemmer
  3. list comprehension then collects only the set of English stems
    • i.e., it is a list that should collect only stemmed English word tokens

?

Caveat: ? list comprehension could conceivably include certain identical inflected words in other languages which English decendends from because porter2 would mistakenly think them English words

Solution 2:[2]

Down To The Essence

I had a VERY similar need. Your question appeared in my search. Felt I needed to look further, and I found THIS. I did a bit of modification for my specific needs (only English words from TONS of technical data sheets = no numbers or test standards or values or units, etc.). After much pain with other approaches, the below worked. I hope it can be a good launching point for you and others.

import nltk
from nltk.corpus import stopwords
words = set(nltk.corpus.words.words())
stop_words = stopwords.words('english')


file_name = 'Full path to your file'
with open(file_name, 'r') as f:
    text = f.read()
    text = text.replace('\n', ' ')

new_text = " ".join(w for w in nltk.wordpunct_tokenize(text)
                    if w.lower() in words
                    and w.lower() not in stop_words
                    and len(w.lower()) > 1)

print(new_text)

Solution 3:[3]

I used the pyenchant library to do this.

import enchant
d = enchant.Dict("en_US")
def get_eng_words(data):
    eng =[]

    for sample in tqdm(data):
        sentence=''
        word_tokens = nltk.word_tokenize(sample)
            for word in word_tokens:
            if(d.check(word)):
                if(sentence ==''):
                    sentence = sentence + word
                else:
                    sentence = sentence +" "+ word
        print(sentence)
        eng.append(sentence)
    return eng

To save it just do this!

sentences=get_eng_words(df['column'])
df['column']=pd.DataFrame(sentences)

Hope it helps anyone!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Thom Ives
Solution 3 reisen Inaba