'Using LmFuncs (Linear Regression) in Caret for Recursive Feature Elimination: How do I fix "same number of samples in x and y" error?

I'm new to R and trying to isolate the best performing features from a data set of 247 columns (246 variables + 1 outcome), and 800 or so rows (where each row is one person's data) to create a predictive model. I'm using caret to do RFE using lmfuncs - I need to use linear regression since the target variable continuous.

I use the following to split into test/training data (which hasn't evoked errors)

inTrain <- createDataPartition(data$targetVar, p = .8, list = F)
train <- data[inTrain, ]
test <- data[-inTrain, ]

The resulting test and train files have even variables within the sets. e.g X and Y contain the same number samples / all columns are the same length

My control parameters are as follows (also runs without error)

control = rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 5, verbose = F, returnResamp = "all")

But when I run RFE I get an error message saying

Error in rfe.default(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control) : there should be the same number of samples in x and y

My code for RFE is as follows, with the target variable in first column rfe_lm_profile <- rfe(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control)

I've looked through various forums, but nothing seems to work. This google.group suggests using an older version of Caret - which I tried, but got the same X/Y error https://groups.google.com/g/rregrs/c/qwcP0VGn4ag?pli=1 Others suggest converting the target variable to a factor or matrix. This hasn't helped, and evokes Warning message: In createDataPartition(data$EBI_SUM, p = 0.8, list = F) : Some classes have a single record when partitioning the data into test/train, and the same X/Y sample error if you try to carry out RFE.

Mega thanks in advance :)

p.s Here's the dput for the target variable (EBI_SUM) and a couple of variables

data <- structure(list(TargetVar = c(243, 243, 243, 243, 355, 355), Dosing = c(2, 
2, 2, 2, 2, 2), `QIDS_1 ` = c(1, 1, 3, 1, 1, 1), `QIDS_2 ` = c(3, 
3, 2, 3, 3, 3), `QIDS_3 ` = c(1, 2, 1, 1, 1, 2)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))
> 


Solution 1:[1]

Your data object should not contain spaces:

library(caret)

data <- data.frame(
  TargetVar = c(243, 243, 243, 243, 355, 355),
  Dosing = c(2, 2, 2, 2, 2, 2),
  QIDS_1 = c(1, 1, 3, 1, 1, 1),
  QIDS_2 = c(3, 3, 2, 3, 3, 3),
  QIDS_3 = c(1, 2, 1, 1, 1, 2)
)

inTrain <- createDataPartition(data$TargetVar, p = .8, list = F)
train <- data[inTrain, ]
test <- data[-inTrain, ]
control <- rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 5, verbose = F, returnResamp = "all")
rfe_lm_profile <- rfe(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1