'BigQuery - DELETE statement to remove duplicates
There are plenty of great posts on SQL that selects unique rows and write (truncates) a table so the dus are removed. e.g
WITH ev AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
FROM `duplicates`
)
SELECT
* EXCEPT(rowNum)
FROM
ev
WHERE rowNum = 1
I was trying to explore this slightly differently using DML and DELETE
(e.g if you don't want to use a BQ savedQuery, just execute SQL). What I want to do is roughly:
WITH dup_events AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
FROM `duplicates`
)
DELETE FROM
dup_events
WHERE rowNum > 1
but got this error in the console:
Syntax error: Expected "(" or keyword SELECT but got keyword DELETE at [10:1]
Can this be achieved (standardSQL) using DELETE
?
Solution 1:[1]
From the syntax documentation, the argument to DELETE
needs to be a table, and there is no provision for using a WITH
clause. This makes sense given that you can't delete from what is essentially a logical view (a CTE). You can express what you want by putting the logic inside the filter, e.g.
DELETE
FROM duplicates AS d
WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC)
FROM `duplicates` AS d2
WHERE d.id = d2.id AND d.loadTime = d2.loadTime) > 1;
Solution 2:[2]
Below actually :o) works
#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
SELECT AS STRUCT id, MAX(loadTime) loadTime
FROM `yourproject.yourdataset.duplicates`
GROUP BY id)
Note: it assumes that loadTime is also unique - meaning if for given id there are more than one record with latest loadTime - they all will be preserved
Solution 3:[3]
This has to be the easiest way:
create or replace table `myproject.mydataset.duplicates` as (
select distinct *
from `myproject.mydataset.duplicates`)
If you have an array data type, try this:
-- build a test table with a duplicate and an array datatype column --
create or replace table DW.pmoTest as (
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 2, 3] as int_array, current_date as createdate
union all
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 7, 3] as int_array, current_date as createdate
union all
select 2 as ID, 'chamri' as firstname,ARRAY<INT64>[1, 2, 39, 4] as int_array, current_date as createdate
);
-- recreate table without duplicate row
create or replace table DW.pmoTest as (
SELECT col.* FROM (
SELECT ARRAY_AGG(tbl ORDER BY createdate LIMIT 1)[OFFSET(0)] col
FROM DW.pmoTest tbl
GROUP BY ID
)
);
Solution 4:[4]
Those answers above works only for small size table. If you have a large size partition table, and only want to remove duplicates in a given range, use the SQL below:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles");
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `gcp_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Daniel |
Solution 2 | Mikhail Berlyant |
Solution 3 | Daniel |
Solution 4 | Daniel |