'What is the best way to cleanup and recreate databricks delta table?

I am trying to cleanup and recreate databricks delta table for integration tests.

I want to run the tests on devops agent so i am using JDBC (Simba driver) but it says statement type "DELETE" is not supported.

When i cleanup the underlying DBFS location using DBFS API "rm -r" it cleans up the table but next read after recreate gives an error - A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table DELETE statement.

Also if i simply do DELETE from delta table on data i still see the underlying dbfs directory and the files intact. How can I clean up the delta as well as underlying files gracefully?



Solution 1:[1]

You can use VACUUM command to do the clean up. I haven't used it yet.

If you are using spark, you can use overwriteSchema option to reload the data.

If you can provide the more details on how you are using it, it would be better

Solution 2:[2]

The perfect steps are as follows: When you do a DROP TABLE and DELETE FROM TABLE TABLE NAME the following things happen in :

  • DROP TABLE : drops your table but data still resides.(Also you can't create a new table definition with changes in schema in the same location.)
  • DELETE FROM TABLE deletes data from table but transaction log still resides.

So, Step 1 - DROP TABLE schema.Tablename

STEP 2 - %fs rm -r /mnt/path/where/your/table/definition/is/pointed/fileNames.parquet

Step 3 - % fs ls make sure there is no data and also no transaction log at that location

Step 4 : NOW>!!!!! RE_RUN your CREATE TABLE statement with any changes you desire UISNG delta location /mnt/path/where/your/table/definition/is/pointed/fileNames.parquet

Step 5 : Start using the table and verify using %sql desc formatted schema.Tablename

Solution 3:[3]

Make sure that you are not creating an external table. There are two types of tables:

1) Managed Tables

2) External Tables (Location for dataset is specified)

When you delete Managed Table, spark is responsible for cleanup of metaData of that table stored in metastore and for cleanup of the data (files) present in that table.

But for external table, spark do not owns the data, so when you delete external table, only metadata present in metastore is deleted by spark and data (files) which were present in that table do not get deleted.

After this if you confirm that your tables are managed tables and still dropping table is not deleting files then you can use VACUUM command:

VACUUM <databaseName>.<TableName> [RETAIN NUM HOURS]

This will cleanup all the uncommitted files from table's folder. I hope this helps you.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Piyush Patel
Solution 2
Solution 3 Bilal Shafqat