'spark save simple string to text file

I have a spark job that needs to store the last time it ran to a text file. This has to work both on HDFS but also on local fs (for testing).

However it seems that this is not at all so straight forward as it seems.

I have been trying with deleting the dir and getting "can't delete" error messages. Trying to store a simple sting value into a dataframe to parquet and back again.

this is all so convoluted that it made me take a step back.

What's the best way to just store a string (timestamp of last execution in my case) to a file by overwriting it?

EDIT:

The nasty way I use it now is as follows:

sqlc.read.parquet(lastExecution).map(t => "" + t(0)).collect()(0)

and

sc.parallelize(List(lastExecution)).repartition(1).toDF().write.mode(SaveMode.Overwrite).save(tsDir)


Solution 1:[1]

This sounds like storing simple application/execution metadata. As such, saving a text file shouldn't need to be done by "Spark" (ie, it shouldn't be done in distributed spark jobs, by workers).

The ideal place for you to put it is in your driver code, typically after constructing your RDDs. That being said, you wouldn't be using the Spark API to do this, you'd rather be doing something as trivial as using a writer or a file output stream. The only catch here is how you'll read it back. Assuming that your driver program runs on the same computer, there shouldn't be a problem.

If this value is to be read by workers in future jobs (which is possibly why you want it in hdfs), and you don't want to use the Hadoop API directly, then you will have to ensure that you have only one partition so that you don't end up with multiple files with the trivial value. This, however, cannot be said for the local storage (it gets stored on the machine where the worker executing the task is running), managing this will simply be going overboard.

My best option would be to use the driver program and create the file on the machine running the driver (assuming it is the same that will be used next time), or, even better, to put it in a database. If this value is needed in jobs, then the driver can simply pass it through.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ernest_k