'Spark Calculate Standard deviation row wise
I need to calculate Standard deviation row wise assuming that I already have a column with calculated mean per row.I tried this
SD= (reduce(sqrt((add, (abs(col(x)-col("mean"))**2 for x in df.columns[3:])) / n))).alias("SD")
dfS = df.withColumn("SD",SD)
dfS.select("stddev").show()
but I got the following error
AttributeError: 'builtin_function_or_method' object has no attribute '_get_object_id'
Solution 1:[1]
Your code is completely mixed up (at its current state it won't even cause the exception you described in the question). sqrt
should be placed outside reduce
call:
from pyspark.sql.functions import col, sqrt
from operator import add
from functools import reduce
df = spark.createDataFrame([("_", "_", 2, 1, 2, 3)], ("_1", "_2", "mean"))
cols = df.columns[3:]
sd = sqrt(
reduce(add, ((col(x) - col("mean")) ** 2 for x in cols)) / (len(cols) - 1)
)
sd
# Column<b'SQRT((((POWER((_4 - mean), 2) + POWER((_5 - mean), 2)) + POWER((_6 - mean), 2)) / 2))'>
df.withColumn("sd", sd).show()
# +---+---+----+---+---+---+---+
# | _1| _2|mean| _4| _5| _6| sd|
# +---+---+----+---+---+---+---+
# | _| _| 2| 1| 2| 3|1.0|
# +---+---+----+---+---+---+---+
Solution 2:[2]
You can use a simple udf with Numpy's built in std function and make sure to return a python float rather than a numpy data type - as that will cause an error.
from pyspark.sql.functions import udfimport numpy as np
from pyspark.sql.types import DoubleType
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
data = [(200,500),
(300,450),
(000,450)]
columns = ["num1","num2"]
df= spark.createDataFrame(data=data, schema = columns)
def std_func(x, y):
std_val = np.std([x, y])
return float(std_val)
std_func_udf = udf(std_func, DoubleType())
df.withColumn('std', std_func_udf('num1', 'num2')).show()
# +----+----+-----+
# |num1|num2| std|
# +----+----+-----+
# | 200| 500|150.0|
# | 300| 450| 75.0|
# | 0| 450|225.0|
# +----+----+-----2
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Gal_M |