'Trying to be benchmark dplyr vs data.table
Why does this code not work? How can I benchmark these to expressions?
library(data.table)
library(dplyr)
dt <- as.data.table(mtcars)
(lb <- bench::mark(
dt[, .N, by = .(am, gear) ],
count(dt, am, gear)
))
Error in all.equal.data.table(results$result[[1]], results$result[[i]]) : 'target' and 'current' must both be data.tables
Solution 1:[1]
The microbenchmark package would work very well in this situation.
library(data.table)
library(dplyr)
library(microbenchmark)
dt <- as.data.table(mtcars)
microbenchmark::microbenchmark(
dt = dt[, .N, by = .(am, gear) ],
dplyr = count(dt, am, gear)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# dt 366.895 441.917 666.3117 471.690 545.9255 8154.319 100
# dplyr 934.658 1049.023 1649.7788 1144.242 1255.5120 29170.144 100
Solution 2:[2]
I prefer to understand why the mandatory check is failing.
In this case, the differences are caused by
- a different row order (data.table
by =
returns groups in order of appearance,count()
seems to order the rows by default) - different attributes behind the scenes.
The code below fixes both issues and still checks the results:
library(data.table)
library(dplyr)
dt <- as.data.table(mtcars)
(lb <- bench::mark(
dt[, .N, keyby = .(am, gear)],
count(dt, am, gear),
check = function(x, y) all.equal(x, y, check.attributes = FALSE)
))
# A tibble: 2 × 13 expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> 1 dt[, .N, keyby = .(am, gear)] 617.3µs 688.1µs 1333. 33.5KB 4.17 640 2 480ms <data.table [4 × 3]> 2 count(dt, am, gear) 9.04ms 10.7ms 93.8 10.7KB 2.09 45 1 480ms <data.table [4 × 3]> # … with 3 more variables: memory <list>, time <list>, gc <list>
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Zach Schuster |
Solution 2 | Uwe |