'GCP Cloud Function to write data to BigQuery runs with success but data doesn't appear in BigQuery table
I am running the following cloud function. It runs with success and indicates data was loaded to the table. But when I query the BigQuery no data has been added. I am getting no errors and no indication that it isn't working.
from google.cloud import bigquery
import pandas as pd
def download_data(event, context):
df = pd.read_csv('https://covid.ourworldindata.org/data/ecdc/full_data.csv')
# Create an empty list
Row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =[rows.date, rows.location, rows.new_cases, rows.new_deaths, rows.total_cases, rows.total_deaths]
#print(my_list)
# append the list to the final list
Row_list.append(my_list)
## Get Biq Query Set up
client = bigquery.Client()
table_id = "<project_name>.raw.daily_load"
table = client.get_table(table_id)
print(client)
print(table_id)
print(table)
errors = client.insert_rows(table, Row_list) # Make an API request.
if errors == []:
print("New rows have been added.")
Attempted so far;
- Check data was being pulled -> PASSED, I printed out row_list and data is there
- Run locally from my machine -> PASSED, data appeared when I ran it from a python terminal
- Print out the table details -> PASSED, see attached screenshot it all appears in the logs
- Confirm it is able to find the table -> PASSED, I changed the name of the table to one that didn't exist and it failed
Not sure what is next, any advice would be greatly appreciated
Solution 1:[1]
Assuming that the App Engine default service account has the default Editor role assigned and that you have a very simple schema for the BigQuery table. For example:
Field name Type Mode Policy tags Description
date STRING NULLABLE
location STRING NULLABLE
new_cases INTEGER NULLABLE
new_deaths INTEGER NULLABLE
total_cases INTEGER NULLABLE
total_deaths INTEGER NULLABLE
The following modification of your code should work for an HTTP triggered function. Notice that you were not including the Row_list.append(my_list)
in the for loop to populate your list with the elements and that according to the samples on the documentation you should be using a list of tuples:
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
table_id = "[PROJECT-ID].[DATASET].[TABLE]"
def download_data(request):
df = pd.read_csv('https://covid.ourworldindata.org/data/ecdc/full_data.csv')
# Create an empty list
Row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =(rows.date, rows.location, rows.new_cases, rows.new_deaths, rows.total_cases, rows.total_deaths)
# append the list to the final list
Row_list.append(my_list)
## Get Biq Query Set up
table = client.get_table(table_id)
errors = client.insert_rows(table, Row_list) # Make an API request.
if errors == []:
print("New rows have been added.")
With the very simple requirements.txt file:
# Function dependencies, for example:
# package>=version
pandas
google-cloud-bigquery
Solution 2:[2]
You can directly stream the data from the website to BigQuery using Cloud Functions but the data should be clean and conform to BigQuery standards else the e insertion will fail. One more point to note is that the dataframe columns must match the table columns for the data to be successfully inserted. I tested this out and saw insertion errors returned by the client when the column names didn’t match.
Writing the function
I have created a simple Cloud Function using the documentation and pandas example. The dependencies that need to be included are google-cloud-bigquery and pandas.
main.py
from google.cloud import bigquery
import pandas as pd
def hello_gcs(event,context):
df = pd.read_csv('https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv')
df.set_axis(["Month", "Year_1", "Year_2", "Year_3"], axis=1, inplace=True) ## => Rename the columns if necessary
table_id = "project.dataset.airtravel"
## Get BiqQuery Set up
client = bigquery.Client()
table = client.get_table(table_id)
errors = client.insert_rows_from_dataframe(table, df) # Make an API request.
if errors == []:
print("Data Loaded")
return "Success"
else:
print(errors)
return "Failed"
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-bigquery
pandas
Now you can directly deploy the function.
Output
Output Table
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Daniel Ocando |
Solution 2 |