'Is there are difference between PySpark and SparkSQL? If so, what's the difference?
Long story short, I'm tasked with converting files from SparkSQL to PySpark as my first task at my new job.
However, I'm unable to see many differences outside of syntax. Is SparkSQL an earlier version of PySpark or a component of it or something different altogether?
And yes, it's my first time using these tools. But, I have experience with both Python & SQL, so it's not seeming to be that difficult of a task. Just want a better understanding.
Example of the syntax difference I'm referring to:
spark.read.table("db.table1").alias("a")
.filter(F.col("a.field1") == 11)
.join(
other = spark.read.table("db.table2").alias("b"),
on = 'field2',
how = 'left'
Versus
df = spark.sql(
"""
SELECT b.field1,
CASE WHEN ...
THEN ...
ELSE ...
end field2
FROM db.table1 a
LEFT JOIN db.table2 b
on a.field1= b.field1
WHERE a.field1= {}
""".format(field1)
)
Solution 1:[1]
From the documentation: PySpark is an interface within which you have the components of spark viz. Spark core, SparkSQL, Spark Streaming and Spark MLlib.
Coming to the task you have been assigned, it looks like you've been tasked with translating SQL-heavy code into a more PySpark-friendly format.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | fuzzy-memory |