'Why is SparkR-dropna not giving me the desired output?
I have applied the following code on airquality dataset available in R, which has some missing values. I want to omit the rows which has NAs
library(SparkR)
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
sc <- sparkR.init("local",sparkHome = "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
sqlContext <- sparkRSQL.init(sc)
path<-"/Users/devesh/work/airquality/"
aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv", header="true", inferSchema="true")
head(dropna(aq,how="any"))
Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6
The NAs still exist in the output. Am I missing something here?
Solution 1:[1]
Missing values in native R are represented with a logical constant, <NA>
. SparkR DataFrames represents missing values with NULL. If you use createDataFrame() to turn a local R data.frame into a distributed SparkR DataFrame, SparkR will automatically convert <NA>
to NULL. However, if you are creating a SparkR DataFrame by reading in data from a file using read.df(), you may have strings of "NA", but not R logical constant <NA>
missing value representations. String "NA" is not automatically converted to NULL, so dropna() will not consider it as a missing value.
If you have "NA" strings in your csv you might filter them rather than using dropna():
filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")
head(filtered_aq)
Solution 2:[2]
I have used a different example for your reference for removing NA
:
>data_local <- data.frame(Id=1:4, Age=c(40, 52, 25, NA))
>data <- createDataFrame(sqlContext, data_local)
>head(data)
Id Age
1 1 40
2 2 52
3 3 25
4 4 NA
>head(dropna(data,how="any"))
Id Age
1 1 40
2 2 52
3 3 25
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Community |
Solution 2 | Arun Gunalan |