'Using dplyr to conditionally replace values in a column
I have an example data set with a column that reads somewhat like this:
Candy
Sanitizer
Candy
Water
Cake
Candy
Ice Cream
Gum
Candy
Coffee
What I'd like to do is replace it into just two factors - "Candy" and "Non-Candy". I can do this with Python/Pandas, but can't seem to figure out a dplyr based solution. Thank you!
Solution 1:[1]
Assuming your data frame is dat
and your column is var
:
dat = dat %>% mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))
Solution 2:[2]
In dplyr
and tidyr
dat %>%
mutate(var = replace(var, var != "Candy", "Not Candy"))
Significantly faster than the ifelse
approaches.
Code to create the initial dataframe can be as below:
library(dplyr)
dat <- as_data_frame(c("Candy","Sanitizer","Candy","Water","Cake","Candy","Ice Cream","Gum","Candy","Coffee"))
colnames(dat) <- "var"
Solution 3:[3]
Another solution with dplyr
using case_when
:
dat %>%
mutate(var = case_when(var == 'Candy' ~ 'Candy',
TRUE ~ 'Non-Candy'))
The syntax for case_when
is condition ~ value to replace
. Documentation here.
Probably less efficient than the solution using replace
, but an advantage is that multiple replacements could be performed in a single command while still being nicely readable, i.e. replacing to produce three levels:
dat %>%
mutate(var = case_when(var == 'Candy' ~ 'Candy',
var == 'Water' ~ 'Water',
TRUE ~ 'Neither-Water-Nor-Candy'))
Solution 4:[4]
No need for dplyr
. Assuming var
is stored as a factor already:
non_c <- setdiff(levels(dat$var), "Candy")
levels(dat$var) <- list(Candy = "Candy", "Non-Candy" = non_c)
See ?levels
.
This is much more efficient than the ifelse
approach, which is bound to be slow:
library(microbenchmark)
set.seed(01239)
smp <- data.frame(sample(dat$var, 1e6, TRUE))
names(smp) <- "var"
times <-
replicate(50,
{cop <- smp
s <- get_nanotime()
levs <- setdiff(levels(cop$var), "Candy")
levels(cop$var) <- list(Candy = "Candy", "Non-Candy" = levs)
d1 <- get_nanotime() - s
cop <- smp
s <- get_nanotime()
cop = cop %>%
mutate(candy.flag = factor(ifelse(var == "Candy",
"Candy", "Non-Candy")))
d2 <- get_nanotime() - s
cop <- smp
s <- get_nanotime()
cop$var <-
factor(cop$var == "Candy", labels = c("Non-Candy", "Candy"))
d3 <- get_nanotime() - s
c(levels = d1, dplyr = d2, direct = d3)})
(x <- apply(times, 1, median))[2]/x[1]
# dplyr direct
# 8.894303 4.962791
That is, this is 9 times faster.
Solution 5:[5]
I didn't benchmark this, but at least in some cases with more than one condition, a combination of mutate and a list seems to provide an easy solution:
# assuming that all sweet things fall in one category
dat <- data.frame(var = c("Candy", "Sanitizer", "Candy", "Water", "Cake", "Candy", "Ice Cream", "Gum", "Candy", "Coffee"))
conditions <- list("Candy" = TRUE, "Sanitizer" = FALSE, "Water" = FALSE,
"Cake" = TRUE, "Ice Cream" = TRUE, "Gum" = TRUE, "Coffee" = FALSE)
dat %>% mutate(sweet = conditions[var])
Solution 6:[6]
When you only need two values, a simple ifelse() is prettiet, I think.
Furthermore, embedded ifelses can simulate the same situation as the case_when solution proposed by PhJ (I do like his readability, though)!
dat %>%
mutate(
var = ifelse(var == "Candy", "Candy", "Non-Candy")
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | PhJ |
Solution 4 | Community |
Solution 5 | ulrich |
Solution 6 | ZKA |