Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

r - Two-level stacked learner (enseble model) combining elastic net and logistic regression using mlr3

I try to solve a common problem in medicine: the combination of a prediction model with other sources, eg, an expert opinion [sometimes heavily emphysised in medicine], called superdoc predictor in this post.

This could be solved by stack a model with a logistic regression (that enters the expert opinion) as described on page 26 in this paper:

Afshar P, Mohammadi A, Plataniotis KN, Oikonomou A, Benali H. From Handcrafted to Deep-Learning-Based Cancer Radiomics: Challenges and Opportunities. IEEE Signal Process Mag 2019; 36: 132–60. Available here

I've tried this here without considering overfitting (I did not apply out of fold predictions of the lower learner):

Example data

# library
library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)

# get example data
data(PimaIndiansDiabetes, package="mlbench")
data <- PimaIndiansDiabetes

# add the super doctors opinion to the data
set.seed(2323)
data %>% 
  rowwise() %>% 
  mutate(superdoc=case_when(diabetes=="pos" ~ as.numeric(sample(0:2,1)), TRUE~ 0)) -> data

# separate the data in a training set and test set
train.data <- data[1:550,]
test.data <- data[551:768,]

Stacked models without considering out of fold predictions:

# elastic net regression (without the superdoc's opinion)
set.seed(2323)
model <- train(
  diabetes ~., data = train.data %>% select(-superdoc), method = "glmnet",
  trControl = trainControl("repeatedcv",
                           number = 10,
                           repeats=10,
                           classProbs = TRUE,
                           savePredictions = TRUE,
                           summaryFunction = twoClassSummary),
  tuneLength = 10,
  metric="ROC" #ROC metric is in twoClassSummary
)


# extract the coefficients for the best alpha and lambda  
coef(model$finalModel, model$finalModel$lambdaOpt) -> coeffs
tidy(coeffs) %>% tibble() -> coeffs

coef.interc = coeffs %>% filter(row=="(Intercept)") %>% pull(value)
coef.pregnant = coeffs %>% filter(row=="pregnant") %>% pull(value)
coef.glucose = coeffs %>% filter(row=="glucose") %>% pull(value)
coef.pressure = coeffs %>% filter(row=="pressure") %>% pull(value)
coef.mass = coeffs %>% filter(row=="mass") %>% pull(value)
coef.pedigree = coeffs %>% filter(row=="pedigree") %>% pull(value)
coef.age = coeffs %>% filter(row=="age") %>% pull(value)


# combine the model with the superdoc's opinion in a logistic regression model
finalmodel = glm(diabetes ~ superdoc + I(coef.interc + coef.pregnant*pregnant + coef.glucose*glucose + coef.pressure*pressure + coef.mass*mass + coef.pedigree*pedigree + coef.age*age),family=binomial, data=train.data)


# make predictions on the test data
predict(finalmodel,test.data, type="response") -> predictions


# check the AUC of the model in the test data
roc(test.data$diabetes,predictions, ci=TRUE) 
#> Setting levels: control = neg, case = pos
#> Setting direction: controls < cases
#> 
#> Call:
#> roc.default(response = test.data$diabetes, predictor = predictions,     ci = TRUE)
#> 
#> Data: predictions in 145 controls (test.data$diabetes neg) < 73 cases (test.data$diabetes pos).
#> Area under the curve: 0.9345
#> 95% CI: 0.8969-0.9721 (DeLong)

Now I would like to consider out of fold predictions using the mlr3 package family according to this very helpful post: Tuning a stacked learner

#library
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(mlr3filters)
library(mlr3tuning)
library(paradox)
library(glmnet)

# creat elastic net regression
glmnet_lrn =  lrn("classif.cv_glmnet", predict_type = "prob")

# create the learner out-of-bag predictions
glmnet_cv1 = po("learner_cv", glmnet_lrn, id = "glmnet") #I could not find a setting to filter the predictors (ie, not send the superdoc predictor here)

# summarize steps 
level0 = gunion(list(
  glmnet_cv1,
  po("nop", id = "only_superdoc_predictor")))  %>>% #I could not find a setting to send only the superdoc predictor to "union1"
  po("featureunion", id = "union1")


# final logistic regression
log_reg_lrn = lrn("classif.log_reg", predict_type = "prob")

# combine ensemble model
ensemble = level0 %>>% log_reg_lrn
ensemble$plot(html = FALSE)

Created on 2021-03-15 by the reprex package (v1.0.0)

My question (I am rather new to the mlr3 package family)

  1. is the mlr3 package family well suited for the ensemble model I try to build?
  2. if yes, how cold I finalize the ensemle model and make predictions on the test.data
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I think mlr3 / mlr3pipelines is well suited for your task. It appears that what you are missing is mainly the PipeOpSelect / po("select"), which lets you extract features based on their name or other properties and makes use of Selector objects. Your code should probably look something like

library("mlr3")
library("mlr3pipelines")
library("mlr3learners")

# creat elastic net regression
glmnet_lrn = lrn("classif.cv_glmnet", predict_type = "prob")

# create the learner out-of-bag predictions
glmnet_cv1 = po("learner_cv", glmnet_lrn, id = "glmnet")

# PipeOp that drops 'superdoc', i.e. selects all except 'superdoc'
# (ID given to avoid ID clash with other selector)
drop_superdoc = po("select", id = "drop.superdoc",
  selector = selector_invert(selector_name("superdoc")))

# PipeOp that selects 'superdoc' (and drops all other columns)
select_superdoc = po("select", id = "select.superdoc",
  selector = selector_name("superdoc"))

# superdoc along one path, the fitted model along the other
stacking_layer = gunion(list(
  select_superdoc,
  drop_superdoc %>>% glmnet_cv1
)) %>>% po("featureunion", id = "union1")

# final logistic regression
log_reg_lrn = lrn("classif.log_reg", predict_type = "prob")

# combine ensemble model
ensemble = stacking_layer %>>% log_reg_lrn

This is what it looks like:

ensemble$plot(html = FALSE)

The stacking graph.

To train and evaluate the model, we need to create Task objects:

train.task <- TaskClassif$new("train.data", train.data, target = "diabetes")
test.task <- TaskClassif$new("test.data", test.data, target = "diabetes")

The model can now be trained, can then be used for prediction, and the quality of the prediction can be evaluated. This works best if we turn the ensemble into a Learner:

elearner = as_learner(ensemble)
# Train the Learner:
elearner$train(train.task)
# (The training may give a warning because the glm gets the colinear features:
# The positive and the negative probabilities)

Get the prediction on the test set:

prediction = elearner$predict(test.task)
print(prediction)
#> <PredictionClassif> for 218 observations:
#>     row_ids truth response  prob.neg   prob.pos
#>           1   neg      neg 0.9417067 0.05829330
#>           2   neg      neg 0.9546343 0.04536566
#>           3   neg      neg 0.9152019 0.08479810
#> ---                                            
#>         216   neg      neg 0.9147406 0.08525943
#>         217   pos      neg 0.9078216 0.09217836
#>         218   neg      neg 0.9578515 0.04214854

The prediction was made on a Task, so it can be used directly measure performance against ground truth, e.g. using the "classif.auc" Measure:

msr("classif.auc")$score(prediction)
#> [1] 0.9308455

Two notes here:

  1. You have split up your data into training and test set manually. mlr3 gives you the possibility to do resampling automatically, based on a single Task object. This can then go beyond simple train-test splits. Using the data from the question, and doing a 10-fold cross-validation would look like this:
    all.task <- TaskClassif$new("all.data", data, target = "diabetes")
    rr = resample(all.task, elearner, rsmp("cv"))  # will take some time
    rr$aggregate(msr("classif.auc"))
    #> classif.auc 
    #>   0.9366438
    
  2. I have shown how to construct the graph using the po("select") PipeOps, because it is fully general: You can choose to have some feature both in the glmnet_lrn Learner, as well as in the log_reg_lrn directly, by playing around with the selector values. If all you want to do is really to "divert" a feature from a single operation, you can also use the affect_columns to a Selector that selects the column you want. The following creates a (linear) graph that does exactly the same, but is less flexible:
    glmnet_cv1_nosuperdoc = po("learner_cv", glmnet_lrn, id = "glmnet",
      affect_columns = selector_invert(selector_name("superdoc")))
    ensemble2 = glmnet_cv1_nosuperdoc %>>% log_reg_lrn
    e2learner = as_learner(ensemble2)
    # etc.
    

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...