'At what point should you force a cache in Spark when performing heavy transformations?
Say you have something like this:
big_table1 = spark.table('db.big_table1').cache()
big_table2 = spark.table('db.big_table2').cache()
big_table2 = spark.table('db.big_table3').cache()
# ... etc
And from these tables, you make a number of dfs...
output1 = (
# transformations here: filtering/joining etc the big tables
)
output2 = (
# transformations here: filtering/joining etc the big tables
)
# ... etc
Then you want to combine all the outputs:
final_output = (output1
.union(output2)
# ...etc
)
Then you want to save the results to a table:
(final_output
.write
.saveAsTable('db.final_output')
)
As I understand things, caching is lazy so we need to use an action to force the cache. But at what point in the process above is it best to do that?
Would you do...
final_output.count()
...just before you write to the table?
In that case, spark would have to go through the whole series of the transformations, then union them, then return the count. So would it go "Ah, you asked me to cache the big_tables - I'll do that first, then I'll use the stuff in memory to help me do all these hairy transformations and create your output."
Or would it go "Ah, you asked me to cache these big_tables. I'll do these big transformations, get the count, and then I'll put all this stuff in memory in case you ask me again."
In other words, would it be better to do...
output1.count()
output2.count()
# ... etc
...or even...
big_table1.count()
big_table2.count()
# ...etc
... upstream, to ensure that everything is cached ahead of time?
Or does it not matter where you forced the cache, as long as it happened before you write to the table?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|