'Attempting to Cache s3 files

I have two pipelines that I run. The first pipeline reads files from s3 does some processing and updates the files. The second pipeline runs multiple jobs and for each job i download files from s3 and produce some output. I feel i am wasting a lot of time on my second pipeline by doing multiple downloads as i currently do not cache these files when i am using them for multiple jobs. So in that light i am attempting to cache s3 files locally.

I did some research and figured out that s3fs or fsspec can be used. So far i am able to download and open a file from s3 using s3fs but I am not sure how to cache it locally.

import s3fs
import pandas as pd

FS = s3fs.S3FileSystem()

file = FS.open('s3://my-datasets/something/foo.csv')
# of = fsspec.open("filecache::s3://bucket/key", s3={'anon': True}, filecache={'cache_storage'='/tmp/files'})
df = pd.read_csv(file, sep='|', header=None)
print(df)

As you can see in the code above i am opening a file from s3 and then reading it to a dataframe. Now i am wondering if there is an argument or something i can pass so that this file gets cached.

The alternative approach of course is i can check if the file exists in some path and if it does then use that and if not then download it but i feel like there must be a better way to do caching. I am open to any and all suggestions.



Solution 1:[1]

Amazon S3 is an object storage service that can be accessed via authenticated API requests.

Tools such as s3fs present Amazon S3 as a filesystem, but they need to convert such usage into normal S3 API calls. When there are lots of updates made in S3 or the local s3fs virtual disk, it can take some time to update the other side and in high-usage scenarios they can become out-of-sync.

The fact that s3fs keeps a cache of files means that the files might become out-of-sync more quickly, depending upon how often it goes back and checks whether content has changed in S3.

It is basically adding another layer of complexity between your app and S3. If you can go direct, it will always be more reliable. But, it means you might need to implement some of the useful functionality yourself.

If you're going to use it in production situations, I would recommend creating a testing platform that simulates an appropriate level of usage to confirm that all systems work as expected.

Solution 2:[2]

You can cache S3 files locally with s3fs and fsspec: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally

The two examples in the doc both worked fine for me. It seems you have the second option actually as commented out code in your example. Did that not work for you?

In any case, the first example in your case would be

import pandas as pd
import fsspec

fs = fsspec.filesystem("filecache", target_protocol='s3', cache_storage='/tmp/files/', check_files=True)
with fs.open('s3://my-datasets/something/foo.csv') as file:
    df = pd.read_csv(file, sep='|', header=None)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 John Rotenstein
Solution 2 Darkdragon84