'how to load data to jupyter notebook VM from google cloud?
I am trying to load a bunch of csv files stored on my google cloud into my jupyter notebook. I use python 3 and gsutil
does not work.
Lets's assume I have 6 .csv files in '\bucket1\1'. does anybody know what I should do?
Solution 1:[1]
You are running a Jupyter Notebook on a Google Cloud VM instance. And you want to load 6 .csv files (that you currently have on your Cloud Storage) into it.
Install the dependencies:
pip install google-cloud-storage
pip install pandas
Run the following script on your Notebook:
from google.cloud import storage
import pandas as pd
bucket_name = "my-bucket-name"
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
# When you have your files in a subfolder of the bucket.
my_prefix = "csv/" # the name of the subfolder
blobs = bucket.list_blobs(prefix = my_prefix, delimiter = '/')
for blob in blobs:
if(blob.name != my_prefix): # ignoring the subfolder itself
file_name = blob.name.replace(my_prefix, "")
blob.download_to_filename(file_name) # download the file to the machine
df = pd.read_csv(file_name) # load the data
print(df)
# When you have your files on the first level of your bucket
blobs = bucket.list_blobs()
for blob in blobs:
file_name = blob.name
blob.download_to_filename(file_name) # download the file to the machine
df = pd.read_csv(file_name) # load the data
print(df)
Notes:
Pandas is a good dependency used when dealing with data analysis in python, so it will make it easier for you to work with the csv files.
The code contains 2 alternatives: one if you have the objects inside a subfolder and other one if you have the objects on the first level, use the one that applies to your case.
The code cycles through all the objects, so you might get errors if you have some other kind of objects in there.
In case you already have the files on the machine where you are running the Notebook, then you can ignore the Google Cloud Storage lines and just specify the root/relative path of each file on the "read_csv" method.
For more information about listing Cloud Storage objects go here and for downloading Cloud Storage objects go here.
Solution 2:[2]
Another way of downloading files from your bucket directly to your Jupyter Notebook works like this:
from google.cloud import storage
import pandas as pd
df = pd.read_csv('gs://name-of-your-bucket/path-to-your-file/name-of-your-file.csv', sep=",")
So here you also make use of the pandas library and directly specifying the path to your file in your Google Cloud Storage it via "gs://...".
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | lschmidt90 |