'How to merge rows by a similar column via levenshtein distance

I'm using AWS Athena and I'm trying to merge all the rows which have a specific column with levenshtein_distance value lower then 5 and sum the normalised percentages.

The table has the following structure:

CREATE EXTERNAL TABLE `actions`(
  `id` string COMMENT 'from deserializer', 
  `text` string COMMENT 'from deserializer',
  `normalizedpercentage` float COMMENT 'from deserializer', 
  `timestamp` timestamp COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
  's3://xxxxxx/db/actions'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'transient_lastDdlTime'='1566991410')

This is what I would like to do:

WITH t AS 
    (SELECT id,
         text,
         normalizedPercentage
    FROM actions
    WHERE actions.timestamp
        BETWEEN timestamp '2019-08-01 00:00:01'
            AND timestamp '2019-08-31 23:59:59' )
SELECT *,
         SUM(normalizedPercentage)
    OVER (PARTITION BY levenshtein_distance(text, EVERY_OTHER_TEXT_COLUMN) < 5) AS cumulative
FROM t

Unfortunately the PARTITION BY clause only accepts a column name.

I was thinking about defining a function and use it to loop through all the rows, however this doesn't seem to be possible in Presto.



Solution 1:[1]

you can calculate new column in temp table based on your function and then use that column for partitioning in main query

WITH t AS 
(SELECT id,
     text,
     normalizedPercentage,case when  levenshtein_distance(text, EVERY_OTHER_TEXT_COLUMN) < 5 then 'groupA' else 'groupB' end as classification
FROM actions
WHERE actions.timestamp
    BETWEEN timestamp '2019-08-01 00:00:01'
        AND timestamp '2019-08-31 23:59:59' )
   SELECT *,
     SUM(normalizedPercentage)
OVER (PARTITION BY classification ) AS cumulative
FROM t

Solution 2:[2]

Apply cross join to itself first, as in every record to every record, then have WHERE clause to filter out all the irrelevant records, in your case, where distance < 5.

WITH t AS 
    (SELECT a.id,
         a.text,
         a.normalizedPercentage,
         b.id b_id,
         b.text b_text,
         b.normalizedPercentage b_normalizedPercentage,
         levenshtein_distance(a.text, b.text) distance
    FROM actions a
    CROSS JOIN actions b
    WHERE
       (a.timestamp BETWEEN timestamp '2019-08-01 00:00:01'
                        AND timestamp '2019-08-31 23:59:59')
       AND
       (b.timestamp BETWEEN timestamp '2019-08-01 00:00:01'
                        AND timestamp '2019-08-31 23:59:59')
       AND 
       (levenshtein_distance(a.text, b.text) < 5) 
)
SELECT *,
       SUM(normalizedPercentage)
           OVER (PARTITION BY distance) AS cumulative
FROM t

P.S. I haven't tested this, but this should work

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 zlidime
Solution 2 Tomas