'Seeing an error that says: 'numpy.ndarray' object has no attribute 'map'

I am selecting a subset of data from a larger dataframe.

dataset = df.select('RatingScore',
             'CategoryScore',
             'CouponBin',
             'TTM',
             'Price',
             'Spread',
             'Coupon', 
             'WAM', 
             'DV')

dataset = dataset.fillna(0)
dataset.show(5,True)
dataset.printSchema()

Now, I fee that into my KMeans model

from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
import numpy as np

data_array=np.array(dataset)

#data_array =  np.array(dataset.select('RatingScore', 'CategoryScore', 'CouponBin', 'TTM', 'Price', 'Spread', 'Coupon', 'WAM', #'DV').collect())

# Build the model (cluster the data)
clusters = KMeans.train(data_array, 2, maxIterations=10, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
    center = clusters.centers[clusters.predict(point)]
    return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = data_array.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

This line: clusters = KMeans.train(data_array, 2, maxIterations=10, initializationMode="random")

Throws this error: AttributeError: 'numpy.ndarray' object has no attribute 'map'

From the code, you can see that I tried to create the array two different ways. Neither worked. If I try to fee in the items straight from the subset-dataframe, I get this error:

AttributeError: 'DataFrame' object has no attribute 'map'

What am I missing here?



Solution 1:[1]

I think there are two ways:

  1. convert the pandas.DataFrame into a spark_df.rdd as suggested in other similar situations
  2. convert the pandas.DataFrame into multiple pandas.Series according to its official doc

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Hanchen