'How to make sure my DataFrame frees its memory?
I have a Spark/Scala job in which I do this:
- 1: Compute a big DataFrame
df1
+cache
it into memory - 2: Use
df1
to computedfA
- 3: Read raw data into
df2
(again, its big) +cache
it
When performing (3), I do no longer need df1
. I want to make sure its space gets freed. I cached
at (1) because this DataFrame gets used in (2) and its the only way to make sure I do not recompute it each time but only once.
I need to free its space and make sure it gets freed. What are my options?
I thought of these, but it doesn't seem to be sufficient:
df=null
df.unpersist()
Can you document your answer with a proper Spark documentation link?
Solution 1:[1]
df.unpersist
should be sufficient, but it won't necessarily free it right away. It merely marks the dataframe for removal.
You can use df.unpersist(blocking = true)
which will block until the dataframe is removed before continuing on.
Solution 2:[2]
User of Spark has no way to manually trigger garbage collection.
Assigning df=null
is not going to release much memory, because DataFrame does not hold data - it is just a description of computation.
If your application has memory issue have a look at Garbage Collection tuning guide. It has suggestion where to start and what can be changed to improve GC
Solution 3:[3]
df.unpersist(blocking = true) This will solve the issue
For further explanation -> https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | puhlen |
Solution 2 | |
Solution 3 | Ayomal Praveen |