'ValueError: RDD is empty-- Pyspark (Windows Standalone)
I am trying to create an RDD but spark not creating it, throwing back error, pasted below;
data = records.map(lambda r: LabeledPoint(extract_label(r), extract_features(r)))
first_point = data.first()
Py4JJavaError Traceback (most recent call last)
<ipython-input-19-d713906000f8> in <module>()
----> 1 first_point = data.first()
2 print "Raw data: " + str(first[2:])
3 print "Label: " + str(first_point.label)
4 print "Linear Model feature vector:\n" + str(first_point.features)
5 print "Linear Model feature vector length: " + str(len (first_point.features))
C:\spark\python\pyspark\rdd.pyc in first(self)
1313 ValueError: RDD is empty
1314 """
-> 1315 rs = self.take(1)
1316 if rs:
1317 return rs[0]
C:\spark\python\pyspark\rdd.pyc in take(self, num)
1295
1296 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1297 res = self.context.runJob(self, takeUpToNumLeft, p)..................
Any help will be greatly appreciated.
Thank you, Innocent
Solution 1:[1]
Your records
is empty. You could verify by calling records.first()
.
Calling first
on an empty RDD raises error, but not collect
. For example,
records = sc.parallelize([])
records.map(lambda x: x).collect()
[]
records.map(lambda x: x).first()
ValueError: RDD is empty
Solution 2:[2]
I was also facing this issue, with the FIRST() action method, I checked and found that RDD is empty hence I was getting this issue. Make sure RDD has at least one record to process.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | shuaiyuancn |
Solution 2 |