'Join each term with list of keywords
Probably a simple problem and you can help me quickly.
I have a vector with all the terms contained in a list of keywords. Now I want to join each term with all keywords that contain this term. Here's an example
vec <- c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat", …)
keywords <- c("small boat tour", "a house on the river", "a houseboat", …)
The expected result looks like:
keywords terms
small boat tour small
small boat tour boat
small boat tour tour
a house on the river a
a house on the river house
a house on the river on
a house on the river the
a house on the river river
a houseboat a
a houseboat houseboat
Solution 1:[1]
You can use expand.grid
to get all combinations, wrap the words of vec
in word boundaries, grepl
and filter, i.e.
df1 <- expand.grid(vec, keywords)
df1[mapply(grepl, paste0('\\b' ,df1$Var1, '\\b'), df1$Var2),]
Var1 Var2
1 small small boat tour
2 boat small boat tour
5 tour small boat tour
12 river a house on the river
13 house a house on the river
15 a a house on the river
16 on a house on the river
17 the a house on the river
24 a a houseboat
27 houseboat a houseboat
Solution 2:[2]
You can do a fuzzyjoin::fuzzy_join
using stringr::str_detect
as the matching function, and adding \\b
word boundaries to each word in vec
.
vec <- data.frame(terms = c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat"))
keywords <- data.frame(keywords = c("small boat tour", "a house on the river", "a houseboat"))
fuzzyjoin::fuzzy_inner_join(keywords, vec, by = c("keywords" = "terms"),
match_fun = \(x, y) stringr::str_detect(x, paste0("\\b", y, "\\b")))
output
keywords terms
1 small boat tour small
2 small boat tour boat
3 small boat tour tour
4 a house on the river river
5 a house on the river house
6 a house on the river a
7 a house on the river on
8 a house on the river the
9 a houseboat a
10 a houseboat houseboat
Solution 3:[3]
A way can be using strsplit
and intersect
.
. <- lapply(strsplit(keywords, " ", TRUE), intersect, vec)
data.frame(keywords = rep(keywords, lengths(.)), terms = unlist(.))
# keywords terms
#1 small boat tour small
#2 small boat tour boat
#3 small boat tour tour
#4 a house on the river a
#5 a house on the river house
#6 a house on the river on
#7 a house on the river the
#8 a house on the river river
#9 a houseboat a
#10 a houseboat houseboat
In casevec
contains all keywords there is no need for join.
. <- lapply(strsplit(keywords, " ", TRUE), unique)
data.frame(keywords = rep(keywords, lengths(.)), terms = unlist(.))
Solution 4:[4]
The expected result can be reproduced by
library(data.table)
data.table(keywords)[, .(terms = tstrsplit(keywords, "\\W+")), by = keywords]
keywords terms 1: small boat tour small 2: small boat tour boat 3: small boat tour tour 4: a house on the river a 5: a house on the river house 6: a house on the river on 7: a house on the river the 8: a house on the river river 9: a houseboat a 10: a houseboat houseboat
This is a rather simple answer which is an interpretation of OP's sentence
I have a vector with all the terms contained in a list of keywords.
The important point is all the terms. So, it is assumed that we just can split the keywords into separate terms.
Note that the regex \\W+
is used to separate the terms in case there are more than one non-word characters between the terms, e.g., ", "
.
However, in case the vector does not contain all terms intentionally we need to subset the result, e.g.
vec <- c("small", "boat", "river", "house", "tour", "houseboat")
data.table(keywords)[, .(terms = tstrsplit(keywords, "\\W+")), by = keywords][
terms %in% vec]
keywords terms 1: small boat tour small 2: small boat tour boat 3: small boat tour tour 4: a house on the river house 5: a house on the river river 6: a houseboat houseboat
Solution 5:[5]
Here is a base R one-liner with strsplit
+ stack
> with(keywords, rev(stack(setNames(strsplit(keywords, " "), keywords))))
ind values
1 small boat tour small
2 small boat tour boat
3 small boat tour tour
4 a house on the river a
5 a house on the river house
6 a house on the river on
7 a house on the river the
8 a house on the river river
9 a houseboat a
10 a houseboat houseboat
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Sotos |
Solution 2 | |
Solution 3 | |
Solution 4 | |
Solution 5 | ThomasIsCoding |