'how to load data to jupyter notebook VM from google cloud?

I am trying to load a bunch of csv files stored on my google cloud into my jupyter notebook. I use python 3 and gsutil does not work.

Lets's assume I have 6 .csv files in '\bucket1\1'. does anybody know what I should do?



Solution 1:[1]

You are running a Jupyter Notebook on a Google Cloud VM instance. And you want to load 6 .csv files (that you currently have on your Cloud Storage) into it.

Install the dependencies:

pip install google-cloud-storage
pip install pandas

Run the following script on your Notebook:

from google.cloud import storage
import pandas as pd

bucket_name = "my-bucket-name"

storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)

# When you have your files in a subfolder of the bucket.
my_prefix = "csv/" # the name of the subfolder
blobs = bucket.list_blobs(prefix = my_prefix, delimiter = '/')

for blob in blobs:
    if(blob.name != my_prefix): # ignoring the subfolder itself 
        file_name = blob.name.replace(my_prefix, "")
        blob.download_to_filename(file_name) # download the file to the machine
        df = pd.read_csv(file_name) # load the data
        print(df)

# When you have your files on the first level of your bucket

blobs = bucket.list_blobs()

for blob in blobs:
    file_name = blob.name
    blob.download_to_filename(file_name) # download the file to the machine
    df = pd.read_csv(file_name) # load the data
    print(df)

Notes:

  • Pandas is a good dependency used when dealing with data analysis in python, so it will make it easier for you to work with the csv files.

  • The code contains 2 alternatives: one if you have the objects inside a subfolder and other one if you have the objects on the first level, use the one that applies to your case.

  • The code cycles through all the objects, so you might get errors if you have some other kind of objects in there.

  • In case you already have the files on the machine where you are running the Notebook, then you can ignore the Google Cloud Storage lines and just specify the root/relative path of each file on the "read_csv" method.

  • For more information about listing Cloud Storage objects go here and for downloading Cloud Storage objects go here.

Solution 2:[2]

Another way of downloading files from your bucket directly to your Jupyter Notebook works like this:

from google.cloud import storage
import pandas as pd
df = pd.read_csv('gs://name-of-your-bucket/path-to-your-file/name-of-your-file.csv', sep=",")

So here you also make use of the pandas library and directly specifying the path to your file in your Google Cloud Storage it via "gs://...".

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 lschmidt90