'Databricks: Z-order vs partitionBy

I am learning Databricks and I have some questions about z-order and partitionBy. When I am reading about both functions it sounds pretty similar. Both functions are grouping data in some way that accelerate reading operations. Also it s looks like partitionBy is good with join operations but I don t really understand what function should I use when I want only to read data. Can you tell how should I think about both functions to use it correctly?



Solution 1:[1]

Partitioning physically splits the data into different files/directories having only one specific value, while ZOrder provides clustering of related data inside the files that may contain multiple possible values for given column.

Partitioning is useful when you have a low cardinality column - when there are not so many different possible values - for example, you can easily partition by year & month (maybe by day), but if you partition in addition by hour, then you'll have too many partitions with too many files, and it will lead to big performance problems.

ZOrder allows to create bigger files that are more efficient to read compared to many small files.

But you can combine both partitioning with ZOrder - for example partition by year/month, and ZOrder by day - that will allow to collocate data of the same day close to each other, and you can access them faster (because you read fewer files).

Besides ZOrder, you can also use data skipping to efficiently filter out files that doesn't contain data you need for your query.

You can read about data skipping & ZOrder in the following blog post.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1