'R how to group part of overlapped values among rows?

I have a data frame that some rows that need to be further grouped by some of the overlapped values among rows

col1, col2
a1,   2;3 
a2,   2
a3,   3;4
a4,   4
a5,   2;4
a6,   5
a7,   5;6
a8,   6;7

The value 2 in col2 occurred in rows 1,2 and 5 can be grouped firstly since value 3 is with 2 in row 1 and value 4 with 2 in row 5 are defined as one group, then value 3 and 4 all can be counted as 1 group as 2;

Same for group 2 that value 5 is overlapped in rows 6 and 7, and value 6 and value 7 are the same case bonded, so they can be grouped into one group as well.

So, the output can be summarized into a new column called group like the following:

col1, col2, group
a1,   2;3,  1
a2,   2,    1
a3,   3;4,  1
a4,   4,    1
a5,   2;4,  1
a6,   5,    2    
a7,   5;6,  2
a8,   6;7,  2

Any idea to use tidyverse/dplyr to that? Great thanks!



Solution 1:[1]

This is basically a graph problem where you need to find connected subgraphs. This is pretty straigrforard with the igraph library, the only tricky part is reshapeing your data properly. So with the sample data

dd <- structure(list(col1 = c("a1", "a2", "a3", "a4", "a5"), col2 = c("2;3", 
"2", "3;4", "4", "2;4")), class = "data.frame", row.names = c(NA, 
-5L))

You can use

library(dplyr)
library(tidyr)
library(igraph)

dd %>% 
  separate_rows(col2) %>% 
  select(col1, col2) %>% 
  graph_from_data_frame(directed = FALSE) %>% 
  clusters() %>% 
  membership() %>% 
  tibble::enframe() %>% 
  right_join(dd, by=c("name"="col1"))

We expand the data into different rows then connect all the edges in an undirected graph. We run a simple community detection and coerce those results back into a data frame that we join back to the original data. This does make the assumption that the values in col1 don’t overlap with the values stored in col2.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1