'Extracting rows based on more than two partial strings that must all be part of the string
I want to extract rows that must contain two or more partial strings. For example, suppose I have the following data.table
df <- data.table(player = c('A', 'B', 'C', 'D', 'E', 'F', 'G'),
position = c('S Guard', 'P Guaaard', 'P Guard', 'S Forward',
'P Forward', 'Center', 'Center'),
points = c(28, 17, 19, 14, 23, 26, 5))
I know I can extract rows with either of the partial strings Gua
or rd
by using
df[grep("Gua|rd", df$position), ]
What if I want the the string to contain both Gua
and rd
? Tried
df[grep("Gua&rd", df$position), ]
both with and without space but doesn't work.
EDIT WITH MORE INFO
I'm actually dealing with relatively large datasets and I am using strings in one to extract data from another, something like this:
df1 <- data.frame(player = c('A', 'B', 'C', 'D', 'E', 'F', 'G'),
position1 = c('S Guard', 'P Guard', 'P Guard', 'S Forward',
'P Forward', 'Center', 'Center'),
position2 = c('S Grd', 'P Guard', 'P rdGua', 'S Forward',
'P Forward', 'Center', 'Center'))
df2 <- data.table(player = c('A', 'B', 'C', 'D', 'E', 'F', 'G'),
points = c(28, 17, 19, 14, 23, 26, 5))
df2[df1[,3] %like% "Gua|rd",c(1,2)]
The order therefore shouldn't matter i.e Guard and rdGua should return true if using &
in place of |
.
Solution 1:[1]
If you replace |
with .*
you get the desired result. In stead of grep
(or grepl
), you can also use like
from data.table.
.*
indicates that every character is allowed in between "Gua" and "rd".
Two options:
# option 1
df[grep("Gua.*rd", position)]
# option 2
df[like(position, "Gua.*rd")]
both give:
player position points 1: A S Guard 28 2: B P Guaaard 17 3: C P Guard 19
Because you are using the data.table-package, you don't need the df$
part in between the square brackets.
In response to the extra information, you could achieve that with:
df2[df1[[3]] %like% "(Gua.*rd)|(rd.*Gua)"]
which gives:
player points 1: B 17 2: C 19
The grep
/grepl
alternatieve:
df2[grepl("(Gua.*rd)|(rd.*Gua)", df1[[3]])]
Solution 2:[2]
This is probably an overkill but if you don't know the order and want to avoid having multiple grep()
calls you could do:
# Helper function
grepAnd <- function(vect, strings) {
grep(
do.call(sprintf, c(strrep('(?=.*%s)', length(strings)), as.list(strings))),
vect,
perl = TRUE
)
}
inp <- c('Gua', 'rd')
df[grepAnd(position, inp)]
# player position points
# 1: A S Guard 28
# 2: B P Guaaard 17
# 3: C P Guard 19
# Works even though the arguments are reversed
inp <- rev(inp)
df[grepAnd(position, inp)]
# player position points
# 1: A S Guard 28
# 2: B P Guaaard 17
# 3: C P Guard 19
# Can search for more than two patterns
inp <- c(inp, 'P')
df[grepAnd(position, inp)]
# player position points
# 1: B P Guaaard 17
# 2: C P Guard 19
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | sindri_baldur |