'Redshift split single dynamic column into multiple rows in new table
With a table like:
uid | segmentids
-------------------------+----------------------------------------
f9b6d54b-c646-4bbb-b0ec | 4454918|4455158|4455638|4455878|4455998
asd7a0s9-c646-asd7-b0ec | 1265899|1265923|1265935|1266826|1266596
gd3355ff-cjr8-assa-fke0 | 2237557|2237581|2237593
laksnfo3-kgi5-fke0-b0ec | 4454918|4455158|4455638|4455878
How to create a new table with:
uid | segmentids
-------------------------+---------------------------
f9b6d54b-c646-4bbb-b0ec | 4454918
f9b6d54b-c646-4bbb-b0ec | 1265899
f9b6d54b-c646-4bbb-b0ec | 2237557
f9b6d54b-c646-4bbb-b0ec | 4454918
f9b6d54b-c646-4bbb-b0ec | 4454918
asd7a0s9-c646-asd7-b0ec | 1265899
asd7a0s9-c646-asd7-b0ec | 1265923
asd7a0s9-c646-asd7-b0ec | 1265935
asd7a0s9-c646-asd7-b0ec | 1266826
asd7a0s9-c646-asd7-b0ec | 1266596
The number of segments are dynamic, can vary with each record. I tried the Split function with delimiter, but it requires the index in string, which is dynamic here.
Any suggestions?
Solution 1:[1]
Here is the Redshift answer, it will work with up to 10 thousand segment ids values per row.
test data
create table test_split (uid varchar(50),segmentids varchar(max));
insert into test_split
values
('f9b6d54b-c646-4bbb-b0ec','4454918|4455158|4455638|4455878|4455998'),
('asd7a0s9-c646-asd7-b0ec','1265899|1265923|1265935|1266826|1266596'),
('asd7345s9-c646-asd7-b0ec','1235935|1263456|1265675696'),
('as345a0s9-c646-asd7-b0ec','12765899|12658883|12777935|144466826|1266226|12345')
;
code
with ten_numbers as (select 1 as num union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 0)
, generted_numbers AS
(
SELECT (1000 * t1.num) + (100 * t2.num) + (10 * t3.num) + t4.num AS gen_num
FROM ten_numbers AS t1
JOIN ten_numbers AS t2 ON 1 = 1
JOIN ten_numbers AS t3 ON 1 = 1
JOIN ten_numbers AS t4 ON 1 = 1
)
, splitter AS
(
SELECT *
FROM generted_numbers
WHERE gen_num BETWEEN 1 AND (SELECT max(REGEXP_COUNT(segmentids, '\\|') + 1)
FROM test_split)
)
--select * from splitter;
, expanded_input AS
(
SELECT
uid,
split_part(segmentids, '|', s.gen_num) AS segment
FROM test_split AS ts
JOIN splitter AS s ON 1 = 1
WHERE split_part(segmentids, '|', s.gen_num) <> ''
)
SELECT * FROM expanded_input;
the first 2 cte steps (ten_numbers and generated_numbers) are used to generate a number of rows, this is needed because generate_series is not supported
The next step (splitter) just takes a number of rows equal to the max number of delimiters + 1 (which is the max number of segments)
finally, we cross join splitter with the input data, take the related value using split_part and then exclude blank parts (which are caused where the row has < the max number of segments)
Solution 2:[2]
Redshift now has the super data type & the split_to_array
function which is similar to postgresql string_to_array
Redshift now also supports unnesting arrays through a syntax similar to a LATERAL JOIN
in postgresql.
Using these techniques, we may write the same transformation in 2022 as
WITH split_up AS (
SELECT
uid
, split_to_array(segmentids) segment_array
)
SELECT
su.uid
, CAST(sid AS VARCHAR) segmentid
FROM split_up su
JOIN split_up.segment_array sid ON TRUE
Solution 3:[3]
You can iterate over the SUPER array returned by split_to_array -- see the "Unnesting and flattening" section of this post. Using the same test_split
table as the previous answer:
WITH seg_array AS
(SELECT uid,
split_to_array(segmentids, '|') segs
FROM test_split)
SELECT uid,
segmentid::int
FROM seg_array a,
a.segs AS segmentid;
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | amn34 |