'Read csv from Azure blob Storage and store in a DataFrame

I'm trying to read multiple CSV files from blob storage using python.

The code that I'm using is:

blob_service_client = BlobServiceClient.from_connection_string(connection_str)
container_client = blob_service_client.get_container_client(container)
blobs_list = container_client.list_blobs(folder_root)
for blob in blobs_list:
    blob_client = blob_service_client.get_blob_client(container=container, blob="blob.name")
    stream = blob_client.download_blob().content_as_text()

I'm not sure what is the correct way to store the CSV files read in a pandas dataframe.

I tried to use:

df = df.append(pd.read_csv(StringIO(stream)))

But this shows me an error.

Any idea how can I to do this?



Solution 1:[1]

You could download the file from blob storage, then read the data into a pandas DataFrame from the downloaded file.

from azure.storage.blob import BlockBlobService
import pandas as pd
import tables

STORAGEACCOUNTNAME= <storage_account_name>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

#download from blob
t1=time.time()
blob_service=BlockBlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
t2=time.time()
print(("It takes %s seconds to download "+blobname) % (t2 - t1))

# LOCALFILE is the file path
dataframe_blobdata = pd.read_csv(LOCALFILENAME)

For more details, see here.


If you want to do the conversion directly, the code will help. You need to get content from the blob object and in the get_blob_to_text there's no need for the local file name.

from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))

Solution 2:[2]

import pandas as pd
data = pd.read_csv('blob_sas_url')

The Blob SAS Url can be found by right clicking on the azure portal's blob file that you want to import and selecting Generate SAS. Then, click Generate SAS token and URL button and copy the SAS url to above code in place of blob_sas_url.

Solution 3:[3]

You can now directly read from BlobStorage into a Pandas DataFrame:

mydata = pd.read_csv(
        f"abfs://{blob_path}",
        storage_options={
            "connection_string": os.environ["STORAGE_CONNECTION"]
    })

where blob_path is the path to your file, given as {container-name}/{blob-preifx.csv}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Sahaj Raj Malla
Solution 3 user3778817