'BigQuery insert values AS, assume nulls for missing columns

Imagine there is a table with 1000 columns. I want to add a row with values for 20 columns and assume NULLs for the rest.

INSERT VALUES syntax can be used for that:

INSERT INTO `tbl` (
  date, 
  p, 
  ... # 18 more names
)
VALUES(
  DATE('2020-02-01'), 
  'p3',
  ... # 18 more values
)

The problem with it is that it is hard to tell which value corresponds to which column. And if you need to change/comment out some value then you have to make edits in two places.

INSERT SELECT syntax also can be used:

INSERT INTO `tbl`
SELECT 
  DATE('2020-02-01') AS date, 
  'p3' AS p,
  ... # 18 more value AS column
  ... # 980 more NULL AS column

Then if I need to comment out some column just one line has to be commented out. But obviously having to set 980 NULLs is an inconvenience.

What is the way to combine both approaches? To achieve something like:

INSERT INTO `tbl`
SELECT 
  DATE('2020-02-01') AS date, 
  'p3' AS p,
  ... # 18 more value AS column

The query above doesn't work, the error is Inserted row has wrong column count; Has 20, expected 1000.



Solution 1:[1]

Your first version is really the only one you should ever be using for SQL inserts. It ensures that every target column is explicitly mentioned, and is unambiguous with regard to where the literals in the VALUES clause should go. You can use the version which does not explicitly mention column names. At first, it might seem that you are saving yourself some code. But realize that there is a column list which will be used, and it is the list of all the table's columns, in whatever their positions from definition are. Your code might work, but appreciate that any addition/removal of a column, or changing of column order, can totally break your insert script. For this reason, most will strongly advocate for the first version.

Solution 2:[2]

You can try following solution, it is combination of above 2 process which you have highlighted in case study:-

INSERT INTO `tbl` (date, p, 18 other coll names)
SELECT 
  DATE('2020-02-01') AS date, 
  'p3' AS p,
  ... # 18 more value AS column 

Couple of things you should consider here are :-

  1. Other 980 Columns should ne Nullable, that means it should hold NULL values.
  2. All 18 columns in Insert line and Select should be in same order so that data will be inserted in same correct order.
  3. To Avoid any confusion, try to use Alease in Select Query same as Insert Table Column name. It will remove any ambiguity.

Hopefully it will work for you.

Solution 3:[3]

In BigQuery, the best way to do what you're describing is to first load to a staging table. I'll assume you can get the values you want to insert into JSON format with keys that correspond to the target column names.

values.json

{"date": "2020-01-01", "p": "p3", "column": "value", ... }

Then generate a schema file for the target table and save it locally

bq show --schema project:dataset.tbl > schema.json

Load the new data to the staging table using the target schema. This gives you "named" null values for each column present in the target schema but missing from your json, bypassing the need to write them out.

bq load --replace --source_format=NEWLINE_DELIMIITED_JSON \
project:dataset.stg_tbl values.json schema.json

Now the insert select statement works every time

insert into `project:dataset.tbl`
select * from `project:dataset.stg_tbl`

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Tim Biegeleisen
Solution 2 Vibhor Gupta
Solution 3 Colin Lethem