'Spark DataFrame is Untyped vs DataFrame has schema?
I am beginner to Spark, while reading about Dataframe, I have found below two statements for dataframe very often-
1) DataFrame is untyped 2) DataFrame has schema (Like database table which has all information related to table attribute - name, type, not null)
aren't both statements are contradicting ? First we are saying Dataframe is un typed and at the same time we are also saying Dataframe has information about all columns i.e. schema , please help me what i am missing here ? because if dataframe has schema then it is also knowing about type of the columns so how it become un typed ?
Solution 1:[1]
DataFrames are dynamically typed, while Datasets and RDDs are statically typed. That means when you define a Dataset or RDD you need to explicitly specify a class that represents the content. This can be useful, because when you go to write transformations on your Dataset, the compiler can check your code for type safety. Take for example this dataset of Pet info. When I use pet.species
or pet.name
the compiler knows their types at compile time.
case class Pet(name: String, species: String, age: Int, weight: Double)
val data: Dataset[Pet] = Seq(
Pet("spot", "dog", 2, 50.5),
Pet("mittens", "cat", 11, 15.5),
Pet("mickey", "mouse", 1, 1.5)).toDS
println(data.map(x => x.getClass.getSimpleName).first)
// Pet
val newDataset: Dataset[String] = data.map(pet => s"I have a ${pet.species} named ${pet.name}.")
When we switch to using a DataFrame, the schema stays the same and the data is still typed (or structured), but this information is only available at runtime. This is called dynamic typing. This prevents the compiler from catching your mistakes, but it can be very useful because it allows you to write sql like statements and defining new columns on the fly, for example appending columns to an existing DataFrame, without needing to define a new class for every little operation. This flip side is that you can define bad operations that result in nulls or in some cases, runtime errors.
val df: DataFrame = data.toDF
df.printSchema()
// root
// |-- name: string (nullable = true)
// |-- species: string (nullable = true)
// |-- age: integer (nullable = false)
// |-- weight: double (nullable = false)
val newDf: DataFrame = df
.withColumn("some column", ($"age" + $"weight"))
.withColumn("bad column", ($"name" + $"age"))
newDf.show()
// +-------+-------+---+------+-----------+----------+
// | name|species|age|weight|some column|bad column|
// +-------+-------+---+------+-----------+----------+
// | spot| dog| 2| 50.5| 52.5| null|
// |mittens| cat| 11| 15.5| 26.5| null|
// | mickey| mouse| 1| 1.5| 2.5| null|
// +-------+-------+---+------+-----------+----------+
Solution 2:[2]
Spark checks DataFrame type align to those of that are in given schema or not, in run time and not in compile time. It is because elements in DataFrame are of Row type and Row type cannot be parameterized by a type by a compiler in compile time so the compiler cannot check its type. Because of that DataFrame is untyped and it is not type-safe.
Datasets on the other hand check whether types conform to the specification at compile time. That’s why Datasets are type safe.
for more informations https://blog.knoldus.com/spark-type-safety-in-dataset-vs-dataframe/
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Bi Rico |
Solution 2 | Thierry Tchoffo |