'How "stable" is monotonically_increasing_id() in Spark?
I'm looking for an inexpensive way to distinguish duplicates and/or uniquely identify rows. I've been looking at the Spark built-ins monotonically_increasing_id()
and uuid()
.
The problem with uuid() is that it does not retain its value and seems to be evaluated on the spot. For example
with uuids as (select uuid() as uuid)
select * from uuids join uuids
produces two different UUIDs.
If I use monotonically_increasing_id()
, I get two identical values, but can I trust that to always work? In other words, if I have a CTE, where I have an id
column generated by monotonically_increasing_id()
, will any later rows derived from a row from that CTE have a consistent value of the id
column within the same query?
In pseudo-SQL:
with /* ... */
with_ids as (select monotonically_increasing_id() as id, * from /* ... */),
/* ... */
derived_a as (/* Somehow derived from with_ids */),
derived_b as (/* Somehow derived from with_ids */)
select
(a.id == b.id) as are_same,
(a.id != b.id) as are_different
from derived_a as a
join derived_b as b
Will rows derived from the exact same rows of with_ids
have are_same == true
? Is it guaranteed that if the original rows were different, then are_different == true
? The former is definitely false for uuid()
.
[Updated] Another example, involving a join and group by:
with
with_ids as (
select
monotonically_increasing_id() as id
,*
from table_a)
joined as (
select struct(a.*) as packed_a, a.id
from with_ids as a
left join table_b as b
on /* whatever */
)
select collect_set(packed_a) as should_be_singular
from joined
group by id
Is the row count in the above equal to the number of rows in table_a
and is should_be_singular
a single element array?
The documentation for both functions state that they are non-deterministic, but don't really offer any details on when the functions are evaluated or how they should be used.
The issue seems to be mentioned in SPARK-14241 and this question, but it's not clear if and under what conditions monotonically_increasing_id() is consistent within a single SQL statement.
Solution 1:[1]
from my past experience when working with row identifiers (uuid, row_number or monotonically_increasing_id) I cache
the dataframe.
Then every subsequent calculation using the dataframe will have static row identifiers.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | walking |