'Pyspark caches dataframe by default or not?

If i read a file in pyspark:

Data = spark.read(file.csv)

Then for the life of the spark session, the ‘data’ is available in memory,correct? So if i call data.show() 5 times, it will not read from disk 5 times. Is it correct? If yes, why do i need:

Data.cache()


Solution 1:[1]

If i read a file in pyspark: Data = spark.read(file.csv) Then for the life of the spark session, the ‘data’ is available in memory,correct?

No. Nothing happens here due to Spark lazy evaluation, which happens upon the first call to show() in your case.

So if i call data.show() 5 times, it will not read from disk 5 times. Is it correct?

No. The dataframe will be re-evaluated for each call to show. Caching the dataframe will prevent that re-evaluation, forcing the data to be read from cache instead.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1