'How to merge rows by a similar column via levenshtein distance
I'm using AWS Athena and I'm trying to merge all the rows which have a specific column with levenshtein_distance value lower then 5 and sum the normalised percentages.
The table has the following structure:
CREATE EXTERNAL TABLE `actions`(
`id` string COMMENT 'from deserializer',
`text` string COMMENT 'from deserializer',
`normalizedpercentage` float COMMENT 'from deserializer',
`timestamp` timestamp COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://xxxxxx/db/actions'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1566991410')
This is what I would like to do:
WITH t AS
(SELECT id,
text,
normalizedPercentage
FROM actions
WHERE actions.timestamp
BETWEEN timestamp '2019-08-01 00:00:01'
AND timestamp '2019-08-31 23:59:59' )
SELECT *,
SUM(normalizedPercentage)
OVER (PARTITION BY levenshtein_distance(text, EVERY_OTHER_TEXT_COLUMN) < 5) AS cumulative
FROM t
Unfortunately the PARTITION BY clause only accepts a column name.
I was thinking about defining a function and use it to loop through all the rows, however this doesn't seem to be possible in Presto.
Solution 1:[1]
you can calculate new column in temp table based on your function and then use that column for partitioning in main query
WITH t AS
(SELECT id,
text,
normalizedPercentage,case when levenshtein_distance(text, EVERY_OTHER_TEXT_COLUMN) < 5 then 'groupA' else 'groupB' end as classification
FROM actions
WHERE actions.timestamp
BETWEEN timestamp '2019-08-01 00:00:01'
AND timestamp '2019-08-31 23:59:59' )
SELECT *,
SUM(normalizedPercentage)
OVER (PARTITION BY classification ) AS cumulative
FROM t
Solution 2:[2]
Apply cross join
to itself first, as in every record to every record, then have WHERE
clause to filter out all the irrelevant records, in your case, where distance < 5
.
WITH t AS
(SELECT a.id,
a.text,
a.normalizedPercentage,
b.id b_id,
b.text b_text,
b.normalizedPercentage b_normalizedPercentage,
levenshtein_distance(a.text, b.text) distance
FROM actions a
CROSS JOIN actions b
WHERE
(a.timestamp BETWEEN timestamp '2019-08-01 00:00:01'
AND timestamp '2019-08-31 23:59:59')
AND
(b.timestamp BETWEEN timestamp '2019-08-01 00:00:01'
AND timestamp '2019-08-31 23:59:59')
AND
(levenshtein_distance(a.text, b.text) < 5)
)
SELECT *,
SUM(normalizedPercentage)
OVER (PARTITION BY distance) AS cumulative
FROM t
P.S. I haven't tested this, but this should work
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | zlidime |
Solution 2 | Tomas |