'pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python
I am trying to I am tring to delete stop words via spark,the code is as follow
from nltk.corpus import stopwords
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
word_list=["ourselves","out","over", "own", "same" ,"shan't" ,"she", "she'd", "what", "the", "fuck", "is", "this","world","too","who","who's","whom","yours","yourself","yourselves"]
wordlist=spark.createDataFrame([word_list]).rdd
def stopwords_delete(word_list):
filtered_words=[]
print word_list
for word in word_list:
print word
if word not in stopwords.words('english'):
filtered_words.append(word)
filtered_words=wordlist.map(stopwords_delete)
print(filtered_words)
and I got the error as follow:
pickle.PicklingError: args[0] from newobj args has the wrong class
I don't know why,can somebody help me.
Thanks in advance
Solution 1:[1]
It's to do with uploading of stop words module. As a work around import stopwords library with in the function itself. please see the similar issue linked below. I had the same issue and this work around fixed the problem.
def stopwords_delete(word_list):
from nltk.corpus import stopwords
filtered_words=[]
print word_list
I would recommend from pyspark.ml.feature import StopWordsRemover
as permanent fix.
Solution 2:[2]
Probably, it's just because you are defining the stopwords.words('english') every time on the executor. Define it outside and this would work.
Solution 3:[3]
You are using map over a rdd which has only one row and each word as a column.so, the entire row of rdd which is of type is passed to stopwords_delete fuction and in the for loop within that, is trying to match rdd to stopwords and it fails.Try like this,
filtered_words=stopwords_delete(wordlist.flatMap(lambda x:x).collect())
print(filtered_words)
I got this output as filtered_words,
["shan't", "she'd", 'fuck', 'world', "who's"]
Also, include a return in your function.
Another way, you could use list comprehension to replace the stopwords_delete fuction,
filtered_words = wordlist.flatMap(lambda x:[i for i in x if i not in stopwords.words('english')]).collect()
Solution 4:[4]
the problem is related to stopwords.words('english') line, you need to determine it in a stable way
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Abhishek Gupta |
Solution 3 | |
Solution 4 | Zakaria_b |