'Why does persist(StorageLevel.MEMORY_AND_DISK) give different results than cache() with HBase?

I could sound naive asking this question but this is a problem that I have recently faced in my project. Need some better understanding on it.

df.persist(StorageLevel.MEMORY_AND_DISK)

Whenever we use such persist on a HBase read - the same data is returned again and again for the other subsequent batches of the streaming job but HBase is updated for every batch run.

HBase Read Code:

val df = sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog -> schema)).format(dbSetup.dbClass).load().persist(StorageLevel.MEMORY_AND_DISK)

I replaced persist(StorageLevel.MEMORY_AND_DISK) with cache() and it was returning updated records from HBase table as expected.

The reason we tried to use persist(StorageLevel.MEMORY_AND_DISK) is to ensure that the in-memory storage does not get full and we do not end up doing all transformations all over again during the execution of a particular stream.

Spark Version - 1.6.3 HBase Version - 1.1.2.2.6.4.42-1

Could someone explain me this and help me get a better understanding?



Solution 1:[1]

As you mentioned you are looking for a reason "why" therefore I'm answering this because otherwise this question will remain unanswered as there's no rational reason these days to run spark 1.6.3 to sense what happens with that specific HBASE version.

Internally, spark calls persist() when you use cache() and it behaves differently on RDDs than on Datasets(or Dataframes). On RDDs it uses MEMORY_ONLY and on Datasets, MEMORY_AND_DISK.I cant see what you've coded(fully) but generally I can say, you shouldn't have face the difference between the two ways of cache and persist and your issue is simply a version incompatibility btw those or simply a bug that wasn't fixed by Apache.

There are several places to check to see what's wrong

In this link https://spark.apache.org/releases/spark-release-1-6-3.html you can find that maintainance of the code is hapening in branch 1.6 so this is the place to find the code https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/CacheManager.scala

Hope it helped.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Aramis NSR