'Extracting excel files from the FTP to BigQuery using Cloud Functions
I am working on creating an automated script to download files from a FTP and store them into BigQuery.
Problem is that BigQuery accepts only .csv
files. For this reason I am looking for ways to handle also .xlsx
and .xls
files, conditional on that I am planning to put this batch code in the Cloud.
I am mentioning the latter as one way of going about to convert .xlsx
files to .csv
is to use something like:
import pandas as pd
data_xls = pd.read_excel('file_on_ftp.xlsx')
data_xls.to_csv('csvfile.csv', encoding='utf-8', index=False)
However, this will create local files somewhere in the temporary storage. Obviously with Cloud Functions I have to then monitor whether the file has been deleted afterwards or not, making it not reliable when one cloud function would Crash.
Are there therefore better ways to handle .xlsx
loading into BigQuery? Or is this the way to go?
Solution 1:[1]
You might be interested this guide, which was just recently published: "Streaming data from Cloud Storage into BigQuery using Cloud Functions".
The general architecture would be:
- upload the files in question from FTP to Cloud Storage
- your Cloud Function receives the upload event from Cloud Storage
- your Cloud Function loads the file into memory (no storage on disk)
- your Cloud Function streams the data to BigQuery
I'm not sure if #1 suits your need to ensure that the file is not left behind somewhere, but I think it's probably the best possible solution if you are operating under the assumption that the file needs to be uploaded to GCP somewhere (the alternative would be streaming to BigQuery directly from your local machine, or an instance that is under your control).
Solution 2:[2]
I good method is to extract the file from FTP to GCS through Cloud Functions. Like @Dustin mentioned, you can then stream the data into BigQuery from GCS
Here's a way to extract files from FTP via Cloud Functions
#import libraries
from google.cloud import storage
import wget
def importFile(request):
#set storage client
client = storage.Client()
# get bucket
bucket = client.get_bucket('BUCKET-NAME') #without gs://
blob = bucket.blob('file-name.csv')
#See if file already exists
if blob.exists() == False:
#copy file to google storage
try:
link = 'ftp://account:[email protected]/folder/file.csv' #for non-public ftp files
ftpfile = wget.download(link, out='/tmp/destination-file-name.csv') #save downloaded file in /tmp folder of Cloud Functions
blob.upload_from_filename(ftpfile)
print('Copied file to Google Storage!')
#print error if file doesn't exists
except BaseException as error:
print('An exception occurred: {}'.format(error))
#print error if file already exists in Google Storage
else:
print('File already exists in Google Storage')
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Gidi9 |