'Using setDT inside a function

I'm writing a function that, among other things, coerces the input into a data.table.

library(data.table)
df <- data.frame(id = 1:10)
f <- function(df){setDT(df)}
f(df)
df[, temp := 1]

However, the last command outputs the following warning:

Warning message: In [.data.table(df, , :=(temp, 1)) : Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.

I'm using v1.9.3 of data.table and R 3.1.1. Does it mean df is copied at some point? How to avoid this warning?

Edit: The code of setDT actually uses NSE. So this seems to work:

df1 <- data.frame(id = 1:10)
f <- function(df){eval(substitute(setDT(df)),parent.frame())}
f(df1)
df1[, temp := 1]

It seems I can do other stuffs with df within the function f like

df1 <- data.frame(id = 1:10)
f <- function(df){
      eval(substitute(setDT(df)),parent.frame())
      df[, temp := 1]
      }
f(df1)

Is this the right way to do it?



Solution 1:[1]

Great question! The warning message should say: ... and fixed by taking a shallow copy of the whole table .... Will fix this.

setDT does two things:

  • set the class to data.table from data.frame/list
  • use alloc.col to over-allocate columns (so that := can be used directly)

And the 2nd step requires a shallow copy, if the input is not a data.table already. And this is why we assign the value back to the symbol in it's environment (setDT's parent frame). But the parent frame for setDT is your function f(). Therefore the setDT(df) within your function has gone through smoothly, but the df that resides in the global environment will only have it's class changed, not the over-allocation (as the shallow copy severed the link).

And in the next step, := detects that and shallow copies once again to over-allocate.

The idea so far is to use setDT to convert to data.tables before providing it to a function. But I'd like that these cases be resolved (will take a look).

Thanks a bunch!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Axeman