'How to convert list of list into a dataframe with differential list structure
I applied machine learning algorithm with caret package (caretlist) to predict death in a cohort of patients according to multiple variables (e.g. age, gender, smoker, etc.):
algorithmList <- c('rf', 'pls','parRF','nnet', 'xgbTree','avNNet',
'gbm','monmlp','nb','glm','pcaNNet','lda','C5.0',
'svmLinear2','knn')
set.seed(100)
list_models <- caretList(Death_event~., data=na.exclude(dataset), methodList = algorithmList, metric="ROC", trControl=control)
Then, I used the varImp command to extract variable importance from that list of algorithm, which yields a list of list
importance <- lapply(list_models, varImp)
Output:
> str(importance)
List of 15
$ rf :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 53.8 4.1 100 7.44 0 ...
..$ model : chr "rf"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ pls :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 15.91 4.88 100 18.95 0 ...
..$ model : chr "pls"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ parRF :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 51.26 3.74 100 7.66 0 ...
..$ model : chr "parRF"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ nnet :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 14 41.9 56.4 62.1 31.2 ...
..$ model : chr "nnet"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ xgbTree :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 100 48.1 40.2 21.5 21.1 ...
..$ model : chr "xgbTree"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ avNNet :List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ gbm :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 13.543 0.749 100 6.743 0 ...
..$ model : chr "gbm"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ monmlp :List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ nb :List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ glm :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 13 27.3 100 50.5 11.6 ...
..$ model : chr "glm"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ pcaNNet :List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ lda :List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ C5.0 :List of 3
..$ importance:'data.frame': 11 obs. of 1 variable:
.. ..$ Overall: num [1:11] 100 100 100 100 100 ...
..$ model : chr "C5.0"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ svmLinear2:List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
$ knn :List of 3
..$ importance:'data.frame': 11 obs. of 2 variables:
.. ..$ No_death: num [1:11] 14.37 14.36 100 45.4 9.04 ...
.. ..$ Death : num [1:11] 14.37 14.36 100 45.4 9.04 ...
..$ model : chr "ROC curve"
..$ calledFrom: chr "varImp"
..- attr(*, "class")= chr "varImp.train"
Then, I am facing the first problem
In half of the algorithm, the importance is extracted with a different method (ROC method). That does not change anything to the interpretation, but in some algorithm the title is "Importance" whereas in others the title is "Overall", but it is exactly the same information :
$gbm
gbm variable importance
Overall
Age_at_CT 100.0000
Muscle_HU 48.6376
history_of_CV_yes_noat_leasT_1CV_event 38.1153
VAT_Area_cm2 19.3376
Liver_HU_Median 17.7983
SAT_Area_cm2 17.3343
L3_SMI_cm2m2 15.5910
BMI 13.5431
Tobacco_yes_noSmoker 6.7431
SexMale 0.7494
T2D_at_CTDiabetes 0.0000
$monmlp
ROC curve variable importance
Importance
Age_at_CT 100.000
Muscle_HU 87.085
history_of_CV_yes_no 61.254
VAT_Area_cm2 49.174
Liver_HU_Median 47.712
Tobacco_yes_no 45.404
BMI 14.372
Sex 14.363
T2D_at_CT 9.035
L3_SMI_cm2m2 7.453
SAT_Area_cm2 0.000
You'll have probably noticed in the structure that for those algorithms in which importance was extracted using ROC method, there is two subcolumns (death and no_death), but the number is exactly the same in both.
What I am trying to create is a simple tibble/data frame, wherein :
1st Column = the name of the algorithm (here the name of the list, e.g. gbm or monmlp), 2nd Column = the name of the variable (e.g. Age_at_CT, muscle_HU, etc.) and 3rd Column = the importance number (which = "Importance" in some algorithm, and "Overall" in others)
The only workaround I found was to print the list and c/c into an excel sheet algorithm per algorithm (yeah...that sucks).
Solution 1:[1]
You can do the following:
algoNames <- names(importance)
#extract the importance elements (data.frames) of the lists
importanceDfList <- lapply(importance,"[[","importance")
#variable names are the the rownames of those data.frames
variableNameList <- lapply(importanceDfList,rownames)
#get the importance values aout of the data.frames, respecting different namings of the columns
#if no column matches, we will discard the element
#(here you have to think about how to deal with importance-data.frames with two columns)
possibleImportanceDataframeNames <- c("Overall","Importance")
importanceValueList <- lapply(importanceDfList, function(importanceDf) {
matchingImportanceName <- which(possibleImportanceDataframeNames %in% names(importanceDf))
if(!length(matchingImportanceName)) return(NULL)
importanceDf[[matchingImportanceName]]
})
replicationTimes <- sapply(importanceValueList,length)
resultDf <- data.frame(
Algorithm = rep(algoNames, times = replicationTimes),
Variable = unlist(variableNameList[replicationTimes > 0]),
Importance = unlist(importanceValueList[replicationTimes > 0]),
stringsAsFactors = FALSE
)
Solution 2:[2]
I found the solution based on your code !!!
I just the changed the vector name by one of the two columns name
algoNames <- names(importance)
#extract the importance elements (data.frames) of the lists
importanceDfList <- lapply(importance,"[[","importance")
#variable names are the the rownames of those data.frames
variableNameList <- lapply(importanceDfList,rownames)
#get the importance values aout of the data.frames, respecting different namings of the columns
#if no column matches, we will discard the element
#(here you have to think about how to deal with importance-data.frames with two columns)
possibleImportanceDataframeNames <- c("Overall","Importance") ## HERE: I changed the "importance" by one of the two column names
importanceValueList <- lapply(importanceDfList, function(importanceDf) {
matchingImportanceName <- which(possibleImportanceDataframeNames %in% names(importanceDf))
if(!length(matchingImportanceName)) return(NULL)
importanceDf[[matchingImportanceName]]
})
replicationTimes <- sapply(importanceValueList,length)
resultDf <- data.frame(
Algorithm = rep(algoNames, times = replicationTimes),
Variable = unlist(variableNameList[replicationTimes > 0]),
Importance = unlist(importanceValueList[replicationTimes > 0]),
stringsAsFactors = FALSE
)
Again thanks for your input
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jonas |
Solution 2 | Max92 |