'Seeing an error that says: 'numpy.ndarray' object has no attribute 'map'
I am selecting a subset of data from a larger dataframe.
dataset = df.select('RatingScore',
'CategoryScore',
'CouponBin',
'TTM',
'Price',
'Spread',
'Coupon',
'WAM',
'DV')
dataset = dataset.fillna(0)
dataset.show(5,True)
dataset.printSchema()
Now, I fee that into my KMeans model
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
import numpy as np
data_array=np.array(dataset)
#data_array = np.array(dataset.select('RatingScore', 'CategoryScore', 'CouponBin', 'TTM', 'Price', 'Spread', 'Coupon', 'WAM', #'DV').collect())
# Build the model (cluster the data)
clusters = KMeans.train(data_array, 2, maxIterations=10, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = data_array.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
This line: clusters = KMeans.train(data_array, 2, maxIterations=10, initializationMode="random")
Throws this error: AttributeError: 'numpy.ndarray' object has no attribute 'map'
From the code, you can see that I tried to create the array two different ways. Neither worked. If I try to fee in the items straight from the subset-dataframe, I get this error:
AttributeError: 'DataFrame' object has no attribute 'map'
What am I missing here?
Solution 1:[1]
I think there are two ways:
- convert the
pandas.DataFrame
into aspark_df.rdd
as suggested in other similar situations - convert the
pandas.DataFrame
into multiplepandas.Series
according to its official doc
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Hanchen |