'Is there a way to upload the byte type created with zlib to Google bigquery?

I want to input string data into bigquery by implied by pyhton's zlib library.

Here is an example code that uses zlib to generate data:

import zlib    
import pandas as pd

string = 'abs'
df = pd.DataFrame()

data = zlib.compress(bytearray(string, encoding='utf-8'), -1)
df.append({'id' : 1, 'data' : data}, ignore_index=True)

I've also tried both methods provided by the bigquery API, but both of them give me an error.

The schema is:

job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("id", bigquery.enums.SqlTypeNames.NUMERIC),
        bigquery.SchemaField("data", bigquery.enums.SqlTypeNames.BYTES),
    ],
    write_disposition = "WRITE_APPEND"
)

Examples of methods I have tried are:

1. bigquery API
job = bigquery_client.load_table_from_dataframe(
    df, table, job_config=job_config
)

job.result()


2. pandas_gbq
df.to_gbq(detination_table, project_id, if_exists='append')

However, both give similar errors.

1. error 
pyarrow.lib.ArrowInvalid: Got bytestring of length 8 (expected 16)


2. error
pandas_gbq.gbq.InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.

Is there any way to solve this ?

I want to input python bytestring as bigquery byte data.

Thank you



Solution 1:[1]

The problem isn't coming from the insertion of your zlib compressed data. The error occurs on the insertion of the value of your key id which is dataframe value 1 to the NUMERIC data type in BigQuery.

The easiest solution for this is to change the datatype of your schema in BigQuery from NUMERIC to INTEGER.

However, if you really need your schema to be in NUMERIC datatype, you may convert the dataframe datatype of 1 on your python code using decimal library as derived from this SO post before loading it to BigQuery.

You may refer to below sample code.

from google.cloud import bigquery
import pandas

import zlib   
import decimal

# Construct a BigQuery client object.
client = bigquery.Client()

# Set table_id to the ID of the table to create.
table_id = "my-project.my-dataset.my-table"

string = 'abs'
df = pandas.DataFrame()

data = zlib.compress(bytearray(string, encoding='utf-8'), -1)
record = df.append({'id' : 1, 'data' : data}, ignore_index=True)

df_2 = pandas.DataFrame(record)

df_2['id'] = df_2['id'].astype(str).map(decimal.Decimal)

dataframe = pandas.DataFrame(
    df_2,
    # In the loaded table, the column order reflects the order of the
    # columns in the DataFrame.
    columns=[
        "id",
        "data",
    ],
)
job_config = bigquery.LoadJobConfig(
    schema=[
          bigquery.SchemaField("id", bigquery.enums.SqlTypeNames.NUMERIC),
          bigquery.SchemaField("data", bigquery.enums.SqlTypeNames.BYTES),
    ],
    write_disposition = "WRITE_APPEND"
)

job = client.load_table_from_dataframe(
    dataframe, table_id, job_config=job_config
)  # Make an API request.
job.result()  # Wait for the job to complete.

table = client.get_table(table_id)  # Make an API request.
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), table_id
    )
)

OUTPUT:

enter image description here enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1