'Join each term with list of keywords

Probably a simple problem and you can help me quickly.

I have a vector with all the terms contained in a list of keywords. Now I want to join each term with all keywords that contain this term. Here's an example

vec <- c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat", …)
keywords <- c("small boat tour", "a house on the river", "a houseboat", …)

The expected result looks like:

              keywords     terms
       small boat tour     small
       small boat tour      boat
       small boat tour      tour
  a house on the river         a
  a house on the river     house
  a house on the river        on
  a house on the river       the
  a house on the river     river
           a houseboat         a
           a houseboat  houseboat


Solution 1:[1]

You can use expand.grid to get all combinations, wrap the words of vec in word boundaries, grepl and filter, i.e.

df1 <- expand.grid(vec, keywords)
df1[mapply(grepl, paste0('\\b' ,df1$Var1, '\\b'), df1$Var2),]

        Var1                 Var2
1      small      small boat tour
2       boat      small boat tour
5       tour      small boat tour
12     river a house on the river
13     house a house on the river
15         a a house on the river
16        on a house on the river
17       the a house on the river
24         a          a houseboat
27 houseboat          a houseboat

Solution 2:[2]

You can do a fuzzyjoin::fuzzy_join using stringr::str_detect as the matching function, and adding \\b word boundaries to each word in vec.

vec <- data.frame(terms = c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat"))
keywords <- data.frame(keywords = c("small boat tour", "a house on the river", "a houseboat"))

fuzzyjoin::fuzzy_inner_join(keywords, vec, by = c("keywords" = "terms"), 
                            match_fun = \(x, y) stringr::str_detect(x, paste0("\\b", y, "\\b")))

output

               keywords     terms
1       small boat tour     small
2       small boat tour      boat
3       small boat tour      tour
4  a house on the river     river
5  a house on the river     house
6  a house on the river         a
7  a house on the river        on
8  a house on the river       the
9           a houseboat         a
10          a houseboat houseboat

Solution 3:[3]

A way can be using strsplit and intersect.

. <- lapply(strsplit(keywords, " ", TRUE), intersect, vec)
data.frame(keywords = rep(keywords, lengths(.)), terms = unlist(.))
#               keywords     terms
#1       small boat tour     small
#2       small boat tour      boat
#3       small boat tour      tour
#4  a house on the river         a
#5  a house on the river     house
#6  a house on the river        on
#7  a house on the river       the
#8  a house on the river     river
#9           a houseboat         a
#10          a houseboat houseboat

In casevec contains all keywords there is no need for join.

. <- lapply(strsplit(keywords, " ", TRUE), unique)
data.frame(keywords = rep(keywords, lengths(.)), terms = unlist(.))

Solution 4:[4]

The expected result can be reproduced by

library(data.table)
data.table(keywords)[, .(terms = tstrsplit(keywords, "\\W+")), by = keywords]
                keywords     terms
 1:      small boat tour     small
 2:      small boat tour      boat
 3:      small boat tour      tour
 4: a house on the river         a
 5: a house on the river     house
 6: a house on the river        on
 7: a house on the river       the
 8: a house on the river     river
 9:          a houseboat         a
10:          a houseboat houseboat

This is a rather simple answer which is an interpretation of OP's sentence

I have a vector with all the terms contained in a list of keywords.

The important point is all the terms. So, it is assumed that we just can split the keywords into separate terms.

Note that the regex \\W+ is used to separate the terms in case there are more than one non-word characters between the terms, e.g., ", ".


However, in case the vector does not contain all terms intentionally we need to subset the result, e.g.

vec <- c("small", "boat", "river", "house", "tour", "houseboat")
data.table(keywords)[, .(terms = tstrsplit(keywords, "\\W+")), by = keywords][
  terms %in% vec]
               keywords     terms
1:      small boat tour     small
2:      small boat tour      boat
3:      small boat tour      tour
4: a house on the river     house
5: a house on the river     river
6:          a houseboat houseboat

Solution 5:[5]

Here is a base R one-liner with strsplit + stack

> with(keywords, rev(stack(setNames(strsplit(keywords, " "), keywords))))
                    ind    values
1       small boat tour     small
2       small boat tour      boat
3       small boat tour      tour
4  a house on the river         a
5  a house on the river     house
6  a house on the river        on
7  a house on the river       the
8  a house on the river     river
9           a houseboat         a
10          a houseboat houseboat

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sotos
Solution 2
Solution 3
Solution 4
Solution 5 ThomasIsCoding