'Error: cannot allocate vector of size X Gb Rstudio
Never had this problem before but now it's constantly there for any piece of code I write.
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding
I have a dataset of 1482236 observations and 52 variables. 35 of those are factors and 19 are numeric. some of my factors are huge, lots of levels: Those are the highest:
forename: 114942 levels
surname: 201988 levels
postcode:793876 levels
partnername: 9164 levels
I have tried different functions and they all not work for same reason: Error: cannot allocate vector of size X
- Correlation for categorical variables:
df <- DFOld %>% dplyr::select_if(~class(.)=='factor') #create a dataframe just for factors
##function to get chi square p value and Cramers V
f = function(x,y) {
tbl = df %>% dplyr::select(x,y) %>% table()
chisq_pval = round(chisq.test(tbl)$p.value, 4)
cramV = round(cramersV(tbl), 4)
data.frame(x, y, chisq_pval, cramV) }
##create unique combinations of column names
df_comb = data.frame(t(combn(sort(names(df)), 2)), stringsAsFactors = F)
df_res = map2_df(df_comb$X1, df_comb$X2, f)
Error: cannot allocate vector of size 7.9 Gb
2. Imputation with mice
impute <- mice(df, m=3, nnet.MaxNWts=3000, seed=123, meth='cart') #, meth='cart'
Error: cannot allocate vector of size 100.1 Gb
3. XgBoostModel
library(xgboost)
set.seed(123)
caret.cv.Test <- train(HasProduct ~., data=roseTest
[,!(colnames(roseTest)%in% c("id"
))],
method = "xgbTree",
tune.gridV0= tune.grid,
trControl = train.control,
)
Error: cannot allocate vector of size 100.1 Gb
I have tried to consider half of dataset per time in order to reduce the number of observations but it makes no difference. I have also tried to expand the memory:
> memory.size()
[1] 29882.43
> memory.limit()
[1] 32447
> memory.limit(size=500000)
[1] 5e+05
it did not have any effect
I have also cleaned up the garbage
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4570479 244.1 12169265 650.0 13298034 710.2
Vcells 144700496 1104.0 5750893687 43875.9 5988830217 45691.2
Nothing has created any difference, the only thing that has worked it has been excluding those variables:
forename: 114942 levels
surname: 201988 levels
postcode:793876 levels
partnername: 9164 levels
do you know if there is a way to keep those variables while not killing the RAM?
Please help if you can,
Cheers
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|