'Need help creating schema for loading CSV into BigQuery

I am trying to load some CSV files into BigQuery from Google Cloud Storage and wrestling with schema generation. There is an auto-generate option but it is poorly documented. The problem is that if I choose to let BigQuery generate the schema, it does a decent job of guessing data types, but only sometimes does it recognizes the first row of the data as a header row, and sometimes it does not (treats the 1st row as data and generates column names like string_field_N). The first rows of my data are always header rows. Some of the tables have many columns (over 30), and I do not want to mess around with schema syntax because BigQuery always bombs with an uninformative error message when something (I have no idea what) is wrong with the schema.

So: How can I force it to recognize the first row as a header row? If that isn't possible, how do I get it to spit out the schema it generated in the proper syntax so that I can edit it (for appropriate column names) and use that as the schema on import?



Solution 1:[1]

Schema auto detection in BigQuery should be able to detect the first row of your CSV file as column names in most cases. One of the cases for which column name detection fails is when you have similar data types all over your CSV file. For instance, BigQuery schema auto detect would not be able to detect header names for the following file since every field is a String.

headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b

The "Header rows to skip" option in the UI would not help fixing this shortcoming of schema auto detection in BigQuery.

Solution 2:[2]

If you are following the GCP documentation for Loading CSV Data from Google Cloud Storage you have the option to skip n number of rows:

(Optional) An integer indicating the number of header rows in the source data.

The option is called "Header rows to skip" in the Web UI, but it's also available as a CLI flag (--skip_leading_rows) and as BigQuery API property (skipLeadingRows)

Solution 3:[3]

I would recommend doing 2 things here:

  1. Preprocess your file and store the final layout of the file sans the first row i.e. the header row
  2. BQ load accepts an additional parameter in form of a JSON schema file, use this to explicitly define the table schema and pass this file as a parameter. This allows you the flexibility to alter schema at any point in time, if required

Allowing BQ to autodetect schema is not advised.

Solution 4:[4]

Yes you can modify the existing schema (aka DDL) using bq show..

bq show --schema --format=prettyjson project_id:dataset.table > myschema.json

Note that this will result in you creating a new BQ table all together.

Solution 5:[5]

If 'column name' type and 'datatype' are the same for all over the csv file, then BigQuery misunderstood that 'column name' as data. And add a self generated name for the column. I couldn't find any technical way to solve this.
So I took another approach. 
If the data is not sensitive, then add another column with the 'column name' in string type. And all of the values in the column in number type. Ex. Column name 'Test' and all values are 0. Upload the file to the BigQuery and use this query to drop the column name.

ALTER TABLE <table name> DROP COLUMN <Test>

Change and according to your Table.

Solution 6:[6]

I have way to schema for loading csv into bigquery. You just enough edit value column, for example :

weight|total|summary
2|4|just string
2.3|89.5|just string

if use schema generator by bigquery, field weight and total will define as INT64, but when insert second rows so error or failed. So, you just enough edit first rows like this

weight|total|summary
'2'|'4'|just string
2.3|89.5|just string

You must set field weight & total as STRING, and if you want to aggregate you just use convert type data in bigquery.

cheers

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Soroush Sotoudeh
Solution 2 Guillermo Cacheda
Solution 3 Raunak Jhawar
Solution 4 Raunak Jhawar
Solution 5 Awal
Solution 6 randi dwi putra