'How to convert list of list into a dataframe with differential list structure

I applied machine learning algorithm with caret package (caretlist) to predict death in a cohort of patients according to multiple variables (e.g. age, gender, smoker, etc.):

algorithmList <- c('rf', 'pls','parRF','nnet', 'xgbTree','avNNet',
                    'gbm','monmlp','nb','glm','pcaNNet','lda','C5.0',
                    'svmLinear2','knn')
 
 set.seed(100)
 list_models <- caretList(Death_event~., data=na.exclude(dataset), methodList = algorithmList, metric="ROC", trControl=control)

Then, I used the varImp command to extract variable importance from that list of algorithm, which yields a list of list

importance <- lapply(list_models, varImp)

Output:

Importance structure

> str(importance)
List of 15
 $ rf        :List of 3
  ..$ importance:'data.frame':  11 obs. of  1 variable:
  .. ..$ Overall: num [1:11] 53.8 4.1 100 7.44 0 ...
  ..$ model     : chr "rf"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ pls       :List of 3
  ..$ importance:'data.frame':  11 obs. of  1 variable:
  .. ..$ Overall: num [1:11] 15.91 4.88 100 18.95 0 ...
  ..$ model     : chr "pls"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ parRF     :List of 3
  ..$ importance:'data.frame':  11 obs. of  1 variable:
  .. ..$ Overall: num [1:11] 51.26 3.74 100 7.66 0 ...
  ..$ model     : chr "parRF"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ nnet      :List of 3
  ..$ importance:'data.frame':  11 obs. of  1 variable:
  .. ..$ Overall: num [1:11] 14 41.9 56.4 62.1 31.2 ...
  ..$ model     : chr "nnet"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ xgbTree   :List of 3
  ..$ importance:'data.frame':  11 obs. of  1 variable:
  .. ..$ Overall: num [1:11] 100 48.1 40.2 21.5 21.1 ...
  ..$ model     : chr "xgbTree"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ avNNet    :List of 3
  ..$ importance:'data.frame':  11 obs. of  2 variables:
  .. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
  .. ..$ Death   : num [1:11] 14.37 14.36 100 45.4 9.04 ...
  ..$ model     : chr "ROC curve"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ gbm       :List of 3
  ..$ importance:'data.frame':  11 obs. of  1 variable:
  .. ..$ Overall: num [1:11] 13.543 0.749 100 6.743 0 ...
  ..$ model     : chr "gbm"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ monmlp    :List of 3
  ..$ importance:'data.frame':  11 obs. of  2 variables:
  .. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
  .. ..$ Death   : num [1:11] 14.37 14.36 100 45.4 9.04 ...
  ..$ model     : chr "ROC curve"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ nb        :List of 3
  ..$ importance:'data.frame':  11 obs. of  2 variables:
  .. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
  .. ..$ Death   : num [1:11] 14.37 14.36 100 45.4 9.04 ...
  ..$ model     : chr "ROC curve"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ glm       :List of 3
  ..$ importance:'data.frame':  11 obs. of  1 variable:
  .. ..$ Overall: num [1:11] 13 27.3 100 50.5 11.6 ...
  ..$ model     : chr "glm"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ pcaNNet   :List of 3
  ..$ importance:'data.frame':  11 obs. of  2 variables:
  .. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
  .. ..$ Death   : num [1:11] 14.37 14.36 100 45.4 9.04 ...
  ..$ model     : chr "ROC curve"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ lda       :List of 3
  ..$ importance:'data.frame':  11 obs. of  2 variables:
  .. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
  .. ..$ Death   : num [1:11] 14.37 14.36 100 45.4 9.04 ...
  ..$ model     : chr "ROC curve"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ C5.0      :List of 3
  ..$ importance:'data.frame':  11 obs. of  1 variable:
  .. ..$ Overall: num [1:11] 100 100 100 100 100 ...
  ..$ model     : chr "C5.0"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ svmLinear2:List of 3
  ..$ importance:'data.frame':  11 obs. of  2 variables:
  .. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
  .. ..$ Death   : num [1:11] 14.37 14.36 100 45.4 9.04 ...
  ..$ model     : chr "ROC curve"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"
 $ knn       :List of 3
  ..$ importance:'data.frame':  11 obs. of  2 variables:
  .. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
  .. ..$ Death   : num [1:11] 14.37 14.36 100 45.4 9.04 ...
  ..$ model     : chr "ROC curve"
  ..$ calledFrom: chr "varImp"
  ..- attr(*, "class")= chr "varImp.train"

Then, I am facing the first problem

In half of the algorithm, the importance is extracted with a different method (ROC method). That does not change anything to the interpretation, but in some algorithm the title is "Importance" whereas in others the title is "Overall", but it is exactly the same information :

$gbm
gbm variable importance

                                        Overall
Age_at_CT                              100.0000
Muscle_HU                               48.6376
history_of_CV_yes_noat_leasT_1CV_event  38.1153
VAT_Area_cm2                            19.3376
Liver_HU_Median                         17.7983
SAT_Area_cm2                            17.3343
L3_SMI_cm2m2                            15.5910
BMI                                     13.5431
Tobacco_yes_noSmoker                     6.7431
SexMale                                  0.7494
T2D_at_CTDiabetes                        0.0000

$monmlp
ROC curve variable importance

                     Importance
Age_at_CT               100.000
Muscle_HU                87.085
history_of_CV_yes_no     61.254
VAT_Area_cm2             49.174
Liver_HU_Median          47.712
Tobacco_yes_no           45.404
BMI                      14.372
Sex                      14.363
T2D_at_CT                 9.035
L3_SMI_cm2m2              7.453
SAT_Area_cm2              0.000

You'll have probably noticed in the structure that for those algorithms in which importance was extracted using ROC method, there is two subcolumns (death and no_death), but the number is exactly the same in both.

What I am trying to create is a simple tibble/data frame, wherein :

1st Column = the name of the algorithm (here the name of the list, e.g. gbm or monmlp), 2nd Column = the name of the variable (e.g. Age_at_CT, muscle_HU, etc.) and 3rd Column = the importance number (which = "Importance" in some algorithm, and "Overall" in others)

The only workaround I found was to print the list and c/c into an excel sheet algorithm per algorithm (yeah...that sucks).



Solution 1:[1]

You can do the following:

algoNames <- names(importance)
#extract the importance elements (data.frames) of the lists
importanceDfList <- lapply(importance,"[[","importance") 
#variable names are the the rownames of those data.frames
variableNameList <- lapply(importanceDfList,rownames) 
#get the importance values aout of the data.frames, respecting different namings of the columns
#if no column matches, we will discard the element 
#(here you have to think about how to deal with importance-data.frames with two columns)
possibleImportanceDataframeNames <- c("Overall","Importance")
importanceValueList <- lapply(importanceDfList, function(importanceDf) {
  matchingImportanceName <- which(possibleImportanceDataframeNames %in% names(importanceDf))
  if(!length(matchingImportanceName)) return(NULL)
  importanceDf[[matchingImportanceName]]
})

replicationTimes <- sapply(importanceValueList,length)

resultDf <- data.frame(
  Algorithm = rep(algoNames, times = replicationTimes),
  Variable = unlist(variableNameList[replicationTimes > 0]),
  Importance = unlist(importanceValueList[replicationTimes > 0]), 
  stringsAsFactors = FALSE
)

Solution 2:[2]

I found the solution based on your code !!!

I just the changed the vector name by one of the two columns name

algoNames <- names(importance)
#extract the importance elements (data.frames) of the lists
importanceDfList <- lapply(importance,"[[","importance") 
#variable names are the the rownames of those data.frames
variableNameList <- lapply(importanceDfList,rownames) 
#get the importance values aout of the data.frames, respecting different namings of the columns
#if no column matches, we will discard the element 
#(here you have to think about how to deal with importance-data.frames with two columns)
possibleImportanceDataframeNames <- c("Overall","Importance") ## HERE: I changed the "importance" by one of the two column names

importanceValueList <- lapply(importanceDfList, function(importanceDf) {
  matchingImportanceName <- which(possibleImportanceDataframeNames %in% names(importanceDf))
  if(!length(matchingImportanceName)) return(NULL)
  importanceDf[[matchingImportanceName]]
})

replicationTimes <- sapply(importanceValueList,length)

resultDf <- data.frame(
  Algorithm = rep(algoNames, times = replicationTimes),
  Variable = unlist(variableNameList[replicationTimes > 0]),
  Importance = unlist(importanceValueList[replicationTimes > 0]), 
  stringsAsFactors = FALSE
)

Again thanks for your input

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jonas
Solution 2 Max92