'How to delete Duplicate records in snowflake database table

how to delete the duplicate records from snowflake table. Thanks

ID Name
1  Apple
1  Apple
2  Apple
3  Orange
3  Orange

Result should be:

ID Name
1  Apple
2  Apple
3  Orange


Solution 1:[1]

Adding here a solution that doesn't recreate the table. This because recreating a table can break a lot of existing configurations and history.

Instead we are going to delete only the duplicate rows and insert a single copy of each, within a transaction:


-- find all duplicates
create or replace transient table duplicate_holder as (
    select $1, $2, $3
    from some_table
    group by 1,2,3
    having count(*)>1
);

-- time to use a transaction to insert and delete
begin transaction;

-- delete duplicates
delete from some_table a
using duplicate_holder b
where (a.$1,a.$2,a.$3)=(b.$1,b.$2,b.$3);

-- insert single copy
insert into some_table
select * 
from duplicate_holder;

-- we are done
commit;

Advantages:

  • Doesn't recreate the table
  • Doesn't modify the original table
  • Only deletes and inserts duplicated rows (good for time travel storage costs, avoids unnecessary reclustering)
  • All in a transaction

Solution 2:[2]

Snowflake does not have effective primary keys, their use is primarily with ERD tools. Snowflake does not have something like a ROWID either, so there is no way to identify duplicates for deletion.

It is possible to temporarily add a "is_duplicate" column, eg. numbering all the duplicates with the ROW_NUMBER() function, and then delete all records with "is_duplicate" > 1 and finally delete the utility column.

Another way is to create a duplicate table and swap, as others have suggested. However, constraints and grants must be kept. One way to do this is:

CREATE TABLE new_table LIKE old_table COPY GRANTS;
INSERT INTO new_table SELECT DISTINCT * FROM old_table;
ALTER TABLE old_table SWAP WITH new_table;

The code above removes exact duplicates. If you want to end up with a row for each "PK" you need to include logic to select which copy you want to keep.

This illustrates the importance to add update timestamp columns in a Snowflake Data Warehouse.

Solution 3:[3]

If you have some primary key as such:

CREATE TABLE fruit (key number, id number, name text);

insert into fruit values (1,1, 'Apple'), (2,1,'Apple'),
      (3,2, 'Apple'), (4,3, 'Orange'), (5,3, 'Orange');

as then

DELETE FROM fruit
WHERE key in (
  SELECT key 
  FROM (
      SELECT key
          ,ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY key) AS rn
      FROM fruit
  )
  WHERE rn > 1
);

But if you do not have a unique key then you cannot delete that way. At which point a

CREATE TABLE new_table_name AS
SELECT id, name FROM (
    SELECT id
        ,name
        ,ROW_NUMBER() OVER (PARTITION BY id, name) AS rn
    FROM table_name
)
WHERE rn > 1

and then swap them

ALTER TABLE table_name SWAP WITH new_table_name

Solution 4:[4]

this has been bothering me for some time as well. As snowflake has added support for qualify you can now create a dedupped table with a single statement without subselects:

CREATE TABLE fruit (id number, nam text);
insert into fruit values (1, 'Apple'), (1,'Apple'),
      (2, 'Apple'), (3, 'Orange'), (3, 'Orange');


CREATE OR REPLACE TABLE fruit AS 
SELECT * FROM 
fruit 
qualify row_number() OVER (PARTITION BY id, nam ORDER BY id, nam) = 1;
SELECT * FROM fruit;

Of course you are left with a new table and loose table history, primary keys, foreign keys and such.

Solution 5:[5]

Based on above ideas.....following query worked perfectly in my case.

CREATE OR REPLACE TABLE SCHEMA.table
 AS
SELECT
    DISTINCT * 
FROM
    SCHEMA.table
  ;

Solution 6:[6]

Here's a very simple approach that doesn't need any temporary tables. It will work very nicely for small tables, but might not be the best approach for large tables.

insert overwrite into some_table
select distinct * from some_table
;

The OVERWRITE keyword means that the table will be truncated before the insert takes place.

Solution 7:[7]

Your question boils down to: How can I delete one of two perfectly identical rows? . You can't. You can only do a DELETE FROM fruit where ID = 1 and Name = 'Apple';, then both rows will go away. Or you don't, and keep both.

For some databases, there are workarounds using internal rows, but there isn't any in snowflake, see https://support.snowflake.net/s/question/0D50Z00008FQyGqSAL/is-there-an-internalmetadata-unique-rowid-in-snowflake-that-i-can-reference . You cannot limit deletes, either, so your only option is to create a new table and swap.


Additional Note on Hans Henrik Eriksen's remark on the importance of update timestamps: This is a real help when the duplicates where added later. If, for example, you want to keep the newer values, you can then do this:

-- setup
create table fruit (ID Integer, Name VARCHAR(16777216), "UPDATED_AT" TIMESTAMP_NTZ);
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (2, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- wait > 1 nanosecond
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);

-- delete older duplicates (DESC)
DELETE FROM fruit
  WHERE (ID
  , UPDATED_AT) IN (
     SELECT ID
     , UPDATED_AT
     FROM (
         SELECT ID
         , UPDATED_AT
         , ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) AS rn
         FROM fruit
     )
     WHERE rn > 1
  );

Solution 8:[8]

The following solution is effective if you are looking at one or few columns as primary key references for the table.

-- Create a temp table to hold our duplicates (only second occurrence)
CREATE OR REPLACE TRANSIENT TABLE temp_table AS (
  SELECT [col1], [col2], .. [coln]
  FROM (
    SELECT *, ROW_NUMBER () OVER(
      PARTITION BY [pk]1, [pk]2, .. [pk]m
      ORDER BY [pk]1, [pk]2, .. [pk]m) AS duplicate_count
      FROM [schema].[table]
  ) WHERE duplicate_count = 2
);


-- Delete all the duplicate records from the table
DELETE FROM [schema].[table] t1
USING temp_table t2
WHERE 
  t1.[pk]1 = t2.[pk]1 AND 
  t1.[pk]2 = t2.[pk]2 AND
  ..
  t1.[pk]n = t2.[pk]m;

-- Insert single copy using the temp_table in the original table
INSERT INTO [schema].[table]
SELECT * 
FROM temp_table;

Solution 9:[9]

simple UNION eliminate duplicates on use case of just all columns/no pks.

anyway problem should he solved as early on ingestion pipeline, and/or use scd etc.

Just a raw magic best way how to delete is wrong in principle, use scd with high resolution timestamp, solves any problem.

you want fix massive dups load ? then add column like batch id and remove all batch loaded records

Its like being healthy, you have 2 approaches:

  1. eat a lot > get far > go-to a gym to burn it
  2. eat well > have healthy life style and no need for gym.

So before discussing best gym, try change life style.

hope this helps, learn to do pressure upstream on data producers instead of living like jesus christ trying to clean up the mess of everyone.

Solution 10:[10]

Not sure if people are still interested in this but I've used the below query which is more elegant and seems to have worked

create or replace table {{your_table}} as 
select * from {{your_table}}
qualify row_number() over (partition by {{criteria_columns}} order by 1) = 1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3
Solution 4 Lutz K.
Solution 5 Ali Bigdeli
Solution 6 TobyLL
Solution 7 ynux
Solution 8 Amit Pathak
Solution 9 Emanuel Oliveira
Solution 10 Andrei Budaes