Introduction

Background: The data set is a simulated data set. It represents the data on COVID+ patients (i.e. patients diagnosed with COVID) who were hospitalized in the New York Presbyterian hospital (Weill Cornell and Lower Manhattan campus). Some COVID+ patients suffer from acute respiratory distress and have acute difficulties in breathing. Such critically ill patients then require artificial respiratory support through an invasive mechanical ventilator, a process known as intubation. Since there was a surge in COVID+ cases in New York City during the period of March, the hospital authorities were concerned that there might be a shortage in availability of mechanical ventilators. The authorities wanted to predict the need for intubation within 5 days of hospitalization for each patient in order to better manage their resources. The goal is to build a predictive model that will predict the outcome of intubation within 5 days of hospitalization.

Data: There are two data files “baseline.csv” and “labs and vitals.csv”.

baselines <- read.csv("baselines.csv")
baselines <- clean_names(baselines)
vitals <- read.csv("lab and vitals.csv")
  • The baseline.csv file contains patient level information (one row per patient) on baseline clinical measures that were determined at the time of admission to the hospital. The measures include demographic variables, relevant clinical history and certain diagnostic tests.

  • The labs and vitals file contain data on vitals signs of patients (e.g. blood pressure, heart rate, SpO2 or blood oxygen levels, etc). The vital signs are measured multiple times for each patient during the course of hospitalization. There number of times it is measured per individual also varies with each individual. The vitals data is in a long format i.e. there are multiple rows for a patient.

  • A variable dictionary is also available to explain the variables.

This report explores the two following two goals:

  1. Feature engineering: Extract features from the longitudinal vital signs measure so that you have one value for patient.

  2. Predictive modeling: Using all baseline variables and features extracted from labs and vitals to build a predictive model predicting the binary event of intubation within 5 days of hospitalization.

Feature Engineering

Check for missing values:

apply(vitals, 2, function(x) sum(is.na(x)))
##       name    subject      value time_stamp 
##          0          0      85815          0
apply(vitals, 2, function(x) paste0(round(sum(is.na(x))/nrow(vitals), 3)*100, "%")) 
##       name    subject      value time_stamp 
##       "0%"       "0%"    "12.1%"       "0%"

It appears there are 85815 missing values in vitals data, which makes up of 12.1% of the observations.

Explore various aspects of the distribution of each vital sign (mean, standard deviation, max, min, median) to create featuers. Now we have multiple features per vital sign with one value per patient for each feature, and take a look at the head of it:

vitals %>%
  group_by(subject, name) %>%
  summarise(mean.val = mean(value, na.rm = T), sd.val = sd(value, na.rm = T),
            max.val = max(value, na.rm = T), min.val = min(value, na.rm = T),
            median.val = median(value, na.rm = T),
            .groups = "drop") -> vitals_feature

head(vitals_feature)%>%
  kable() %>%
  kable_styling()
subject name mean.val sd.val max.val min.val median.val
655528 s_bp_noninvasive (d) 61.16246 1.629412 65.35887 55.86459 61.03558
655528 vs_bp_noninvasive (s) 129.90207 2.689424 134.88518 124.90907 129.49549
655528 vs_hr_hr 72.80166 2.611181 77.40824 66.68079 73.43356
655528 xp_resp_rate_pt 30.40044 1.675897 33.97693 27.39469 29.94419
655528 xp_resp_spo2 91.44550 3.070444 99.26152 85.85129 90.91147
729545 s_bp_noninvasive (d) 58.77790 1.945223 62.79768 54.33671 58.92815
vital_summary <- spread(vitals_feature[, 1:3], name, mean.val)

Now let’s visulize the distribution of the mean and standard deviation values for each vital signs:

ggplot(data = vitals_feature, aes(x = name, y = mean.val)) + 
  geom_boxplot() + 
  coord_flip() + 
  ggtitle("Boxplot of the mean values for vital signs") -> p1

ggplot(data = vitals_feature, aes(x = name, y = sd.val)) + 
  geom_boxplot() + 
  coord_flip() + 
  ggtitle("Boxplot of the standard deviation values for vital signs ") ->p2
grid.arrange(p1, p2, ncol = 1)

par(mfrow = c(2, 3))
for (i in unique(vitals_feature$name)){
  hist(vitals_feature$mean.val[vitals_feature$name == i], 
        main = paste0("Distribution of mean value for\n", i), xlab = i)
}

par(mfrow = c(2, 3))

for (i in unique(vitals_feature$name)){
  hist(vitals_feature$sd.val[vitals_feature$name == i], 
        main = paste0("Distribution of standard deviation for\n", i), xlab = i)
}

It appears the distributions of mean values per patient for each feature are quite normal, while the distribution of standard deviation per patient for each feature are all a bit right skewed. It seems mean is a better measurement in presenting patients’ vital signs since it has a wide spread and normal distribution.

Predictive modeling

First, merge the baseline data and vital data summary into one dataframe by patient medical record number. After dropping all the missing values and merging the two datasets, the resulting dataset has 1345 observations with 30 features for each subject. The table 1 for each features is displayed in the appendix.

merge.data.frame(baselines, vital_summary, by.x = "mrn", 
                 by.y = "subject")[, -1] -> dat
dat <- clean_names(dat)
# sum(is.na(dat))
dat[, c(2, 4:23, 25, 26)] <- lapply(dat[, c(2, 4:23, 25, 26)], as.factor)
# str(dat)

The goal is to predict the binary event of intubation within 5 days of hospitalization. I randomly divided data into 10 parts and calculated the 10-fold cv errors with the following methods: k-NN, LDA, QDA, GLM, Elastic Net Regression, Tree-based methods (including one single tree with prunning, random forest, bagged tree, gradient boosted tree and adaboost tree) and SVM (both linear SVM and non-linear SVM).

n <- nrow(dat)
p <- ncol(dat) - 1
folds <- 10
id <- rep(1:folds, each = n/folds)
id <- c(id, rep(folds, n - length(id)))
set.seed(135)
id <- sample(id)
np <- np.forward <- np.backward <- numeric(folds)
s12 <- s13 <-numeric(folds)
best.features <- list(folds)

for (i in 1:folds){
  training <- dat[id != i, ]
  
  # Exhaustive best subset selection
  regfit.full <- regsubsets(event ~., training, nvmax = 31)
  np[i] <- which.min(summary(regfit.full)$cp)
  
  # Forward best subset selection
  regfit.forward <- regsubsets(event ~., training, nvmax = 31, 
                               method = "forward")
  np.forward[i] <- which.min(summary(regfit.forward)$cp)

  # Backward best subset selection
  regfit.backward <- regsubsets(event ~., training, nvmax = 31, 
                                method =   "backward")
  np.backward[i] <- which.min(summary(regfit.backward)$cp)
 # best.features[i] <- paste0(names(coef(regfit.full, np[i])[-1]), collapse = ", ")
  s12[i] <- sum(names(coef(regfit.full, np[i])[-1])!= 
               names(coef(regfit.forward, np[i])[-1]))
  s13[i] <- sum(names(coef(regfit.full, np[i])[-1])!= 
               names(coef(regfit.backward, np[i])[-1]))
}
ans <- data.frame(np, np.forward, np.backward, s12+s13)
colnames(ans)[4] <- c("different selected features")

Bset subset selections were conducted within each validation fold, and Cp was used as a measurment of in-sample error. It appears the features selected under each training data are the same no matter which selection method (Exhaustive, forward or backward) was used:

ans %>%
  kable() %>%
  kable_styling()
np np.forward np.backward different selected features
12 12 12 0
12 12 12 0
14 14 14 0
12 12 12 0
12 12 12 0
12 12 12 0
13 13 13 0
12 12 12 0
13 13 13 0
12 12 12 0
form <- list(folds)
form[1] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + first_cxr_results_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[4] <- form[6] <- form[10] <- form[1]
form[2] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + any_immunosuppression + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[5] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + first_cxr_results_3_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[8] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + symptoms_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[3] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + any_immunosuppression + symptoms_8_factor + first_cxr_results_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[7] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + any_immunosuppression + first_cxr_results_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[9] <- "event ~ age + bmi + hypoxia_ed_factor + smoke_vape + dm_factor + cancer + first_cxr_results_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
##Create function for misclassification rate from KNN; use default k=1
KNN.error <- function(Train, Test, k=1, form1 = "") {
  if (form1 != "") ex <- labels(terms(formula(form1))) else 
    ex <- names(Train[, which(names(Train) !="event")])
  
  Train.KNN <- Train[, which(names(Train) %in% ex)]
  Train.KNN[,] <- lapply(Train.KNN, function (x) {
    if(is.factor(x)) as.integer(x) else (x)
    })
  Train.KNN[,] <- apply(Train.KNN, 2, scale)
  
  Test.KNN <- Test[,which(names(Test) %in% ex)]
  Test.KNN[,] <- lapply(Test.KNN, function (x) {
    if(is.factor(x))as.integer(x) else (x)
    })
  Test.KNN[,] <- apply(Test.KNN, 2, scale)
  
  Test.class <- Test$event
  Train.class <- Train$event
  
  pred.class=knn(Train.KNN, Test.KNN, Train.class, k=k)
  mean(Test.class != pred.class)
}

k-NN

10-fold cross validation was conducted for knn. Within each fold, two parallel k-NN models were conducted: one included all features, while the other only considerded features derived from best subset selection. K was selected from 1 to 15 via cv.

err.knn1 <- err.knn2 <- numeric(folds)
for (i in 1:10){
  training <- dat[id != i, ]
  test <- dat[id == i, ]
  k <- which.min(sapply(1:15, KNN.error, Train = training, Test = test))
  err.knn1[i] <- KNN.error(training, test, k)

  k <- which.min(sapply(1:15, KNN.error, Train = training, Test = test, form[[i]]))
  err.knn2[i] <- KNN.error(training, test, k, form[[i]])
}

err.knn <- data.frame(folds = 1:folds, 
                      All = err.knn1, Best = err.knn2)
ggplot(err.knn, aes(x = folds)) +
  geom_line(aes(y = All, col = "All features")) + 
  geom_line(aes(y = Best, col = "Best subset selected features"))+
  labs(col = "Festures")+
  ylab("Error rate")  

apply(err.knn[, -1], 2, mean)
##       All      Best 
## 0.3076935 0.2565231

It appears the k-NN model based on the best subset selection have a relatively lower error rate of 25.7% compared to the one with all features.

GLM

10-fold cross validation was conducted for GLM. Within each fold, glm was conducted with all 30 features and stepwise selection method was carried out to select features.

err.glm <- numeric(folds)
for (i in 1:10){
  training <- dat[id != i, ]
  test <- dat[id == i, ]

  # GLM
  fit.glm <- glm(event ~ ., data = training, family = "binomial")
  fit.glm <- step(fit.glm, direction = "both", trace = 0)
  pred.glm <- predict(fit.glm, newdata = test, type = "response")
  pred.glm <- ifelse(pred.glm > .5, "Yes", "No")
  err.glm[i] <- mean(pred.glm!=test$event)
}
mean(err.glm)
## [1] 0.1909857

The 10-fold cv risk for GLM is 19.1%.

LDA & QDA

10-fold cross validation was conducted for lda and qda. Within each fold, 2 parallel datasets were used for both lda and qda: one included all features while the other only considerded features derived from best subset selection.

err.lda1 <- err.qda1 <-  err.lda2 <- err.qda2 <-numeric(folds)
for (i in 1:10){
  training <- dat[id != i, ]
  test <- dat[id == i, ]

  # LDA
  fit.lda <- lda(event ~ ., data = training)
  pred.lda <- predict(fit.lda, newdata = test)$class
  err.lda1[i] <- mean(pred.lda!=test$event)

  # QDA
  fit.qda <- qda(event ~ ., data = training)
  pred.qda <- predict(fit.qda, newdata = test)$class
  err.qda1[i] <- mean(pred.qda!=test$event)
  
  # LDA
  fit.lda <- lda(formula(form[[i]]), data = training)
  pred.lda <- predict(fit.lda, newdata = test)$class
  err.lda2[i] <- mean(pred.lda!=test$event)

  # QDA
  fit.qda <- qda(formula(form[[i]]), data = training)
  pred.qda <- predict(fit.qda, newdata = test)$class
  err.qda2[i] <- mean(pred.qda!=test$event)
}

err.lda.qda <- data.frame(folds = 1:folds, 
                          lda.all = err.lda1, lda.best = err.lda2, 
                          qda.all = err.qda1, qda.best = err.qda2)
#a <- cbind(err.lda.qda, err.lda2, err.qda2)
# apply(err.lda.qda[, -1], 2, mean)
ggplot(err.lda.qda, aes(x = folds)) +
  geom_line(aes(y = lda.all, col = "LDA - All features")) + 
  geom_line(aes(y = lda.best, col = "LDA - Best subset selection")) + 
  ylab("Error rate") +
  labs(col = "Festures")+
  ggtitle("LDA") -> p1
ggplot(err.lda.qda, aes(x = folds)) +
  geom_line(aes(y = qda.all, col = "QDA - All features")) + 
  geom_line(aes(y = qda.best, col = "QDA - Best subset selection")) + 
  ylab("Error rate") +
  labs(col = "Festures")+
  ggtitle("QDA") -> p2
grid.arrange(p1, p2, ncol = 1)

It appears the performances of the two LDA models with different datasets do not differ much. In contrast, QDA has a lower error rate when fitted with best subset selection. Now take a look at the 10-fold cv error rates for each of the four models:

apply(err.lda.qda[, -1], 2, mean)
##   lda.all  lda.best   qda.all  qda.best 
## 0.1924246 0.1880006 0.2483947 0.2139858

It appears both lda and qda have lower cv error rates with the smaller dataset. LDA with Best Subset Selection has the smallest cv risk of 18.8%.

Elastic Net Regression

10-fold cross validation was conducted for Elastic Net regression. Within each fold, alpha was set as 0.5 to benefit from both lasso and ridge regressions. Lambda was selected from 10-fold cv.

err.net <- numeric(folds)
for(i in 1:folds){
  training <- dat[id != i, ]
  test <- dat[id == i, ]
  
  x <- model.matrix(event~.,training)
  y <- training$event
  enet.cv <- cv.glmnet(x,y,family = "binomial",alpha=0.5,nfolds=10)
  lambda.min<-enet.cv$lambda.min
  fit.net <- glmnet(x,y,family="binomial",lambda = lambda.min,alpha=0.5)
  #lm(event~.,cv_tst)
  
  newX <- model.matrix(event~., test)
  pred.net = predict(fit.net,newx=newX,s=lambda.min)
  pred.net = ifelse(pred.net>0.5,"Yes","No")
  err.net[i]<-mean(test$event != pred.net)
}
plot(enet.cv)

mean(err.net)
## [1] 0.2036454

Take a look at one deviance-log(lambda) plot from one CV folder, log(lambda) was chosed from a range of -8 to -1 and the corresponding deviance ranged from 0.8 to 1.4.
The cv risk for Elastiv Net regression is 20.3%.

Tree-based Method

10-fold cross validation was conducted for five tree based methods. Within each fold, one single tree, random forest, bagged tree, gradient boosted tree and adaboost tree were conducted with all 30 features.
One single Tree: Cross-validation was carried out to prune the parameters by choosing the number of splits.
Random Forest: I chose at most \(m = \sqrt{p}\) parameters could be used within each fold. 100 trees were constructed.
Bagged Trees: I chose at most \(m = p\) parameters could be used within each fold. 100 trees were constructed.
Gradient boosted model: I set the depth = 1 since it is the most common situation. 100 trees were constructed with a shrinkage parameter of 0.01.
Adaptive boosted model: I set the depth = 1 since it is the most common situation. 100 trees were constructed with a shrinkage parameter of 0.01.

err.prune.tree <- err.rf <- err.bagged <- err.gbm <- err.adaboost <- numeric(folds)

for (i in 1:10){
  training <- dat[id != i, ]
  test <- dat[id == i, ]
  
  # Prune Tree
  fit.tree <- tree(event ~ ., data = training)
  tree.cv <- cv.tree(fit.tree, FUN = prune.tree, K = 10)
  fit.tree <- prune.tree(fit.tree, best = tree.cv$size[which.min(tree.cv$dev)])
  pred.tree <- predict(fit.tree, newdata = test, type = "class")
  err.prune.tree[i] <- mean(pred.tree != test$event)

  # Random Forest with m = sqrt(p)
  fit.RandomForest <- randomForest(event ~ ., training, 
                                   mtry = sqrt(p), importance = T, 
                                   ntrees = 100)
  pred.rf <- predict(fit.RandomForest, newdata = test, type = "class")
  err.rf[i] <- mean(pred.rf != test$event)
  
  # Bagged Tree with m = p
  fit.bagged <- randomForest(event ~ ., training, 
                             mtry = p, importance = T, 
                             ntrees = 100)
  pred.bagged <- predict(fit.bagged, newdata = test, type = "class")
  err.bagged[i] <- mean(pred.bagged != test$event)
  
  # gradient boosted model
  fit.boosting <- gbm(ifelse(event == "Yes", 1, 0) ~ ., 
                      data = training,
                      n.trees = 100, 
                      distribution = "bernoulli",
                      interaction.depth = 1,
                      shrinkage = .01)
  pred.boosting <- predict(fit.boosting, test, n.trees = 100,
                           type = "response")
  pred.boosting <- ifelse(pred.boosting <.5, "No", "Yes")
  err.gbm <- mean(pred.boosting != test$event)
  
  # adaboost
  fit.adaboost <- gbm(ifelse(event == "Yes", 1, 0) ~ ., 
                      data = training,
                      n.trees = 100, 
                      distribution = "adaboost",
                      interaction.depth = 1,
                      shrinkage = .01)
  pred.adaboost <- predict(fit.adaboost, test, n.trees = 100,
                           type = "response")
  pred.adaboost <- ifelse(pred.adaboost <.5, "No", "Yes")
  err.adaboost[i] <- mean(pred.adaboost != test$event)
}

err.tree <- data.frame(err.prune.tree, err.rf, err.bagged, err.gbm, err.adaboost)
apply(err.tree, 2, mean)
## err.prune.tree         err.rf     err.bagged        err.gbm   err.adaboost 
##      0.3010308      0.2013798      0.2223290      0.3165468      0.2906904

Compare the 10-fold cv risk for the 5 tree-based methods, it appears random forest with m = sqrt(p) and bagged tree with m = p have relatively lower risks compared to other methods (These two risks are both around 20%).

SVM

10-fold cross validation was conducted for two svm models. Within each fold, both linear svm and non-linear svm were conducted with all 30 features. Cross validations were carried out to select the suitable cost and gamma. The parameter cost was selected from a range of (.01, .1, 1, 2, 4, 8, 16, 32). The parameter gamma was selected from a range of (.001, .005, .01, .05, .1, 1).

err.linear.svm <- err.nonlinear.svm <- numeric(folds)

for (i in 1:10){
  training <- dat[id != i, ]
  test <- dat[id == i, ]
  # Linear SVM
  svm.tune.linear <- tune(svm, event ~ ., data = training, kernel = "linear",
                        ranges = list(cost = c(.01, .1, 1, 2, 4, 8, 16, 32),
                                      gamma = c(0.1, 0.1, 1, 2, 3, 4)),
                        tunecontrol = tune.control(sampling = "fix"))
  svm.tune.linear$best.parameters$cost
  fit.svm.linear <- svm(event ~ ., data = training,
                      kernel = "linear", probability = T,
                      cost = svm.tune.linear$best.parameters$cost,
                      gamma = svm.tune.linear$best.parameters$gamma)
  pred.linearsvm <- predict(fit.svm.linear, newdata = test, type = "class")
  err.linear.svm[i] <- mean(pred.linearsvm != test$event)

  # Non-linear SVM
  svm.tune.nonlinear <- tune(svm, event~., data = training, kernel = "radial",
                   ranges = list(cost = c(.01, .1, 1, 2, 4, 8, 16, 32),
                                 gamma = c(.001, .005, .01, .05, .1, 1)),
                   tunecontrol = tune.control(sampling = "fix"))

  fit.svm.nonlinear <- svm(event ~ ., data = training, kernel = "radial",
                           cost = svm.tune.nonlinear$best.parameters$cost,
                           gamma = svm.tune.nonlinear$best.parameters$gamma)
  pred.nonlinearsvm <- predict(fit.svm.nonlinear, newdata = test, type = "class")
  err.nonlinear.svm[i] <- mean(pred.nonlinearsvm != test$event)
} 
err.svm <- data.frame(err.linear.svm, err.nonlinear.svm)
apply(err.svm, 2, mean)
##    err.linear.svm err.nonlinear.svm 
##         0.1946902         0.1908783

It appeares linear svm and non-linear svm have similar performances with similar risks, both are around 20%.

# **Neural Network**
# err.nn <- numeric(folds)
# for (i in 1:folds){
#   dat1 <- dat[, which(names(dat) !="event")]
#   dat1[,] <- lapply(dat1, function (x) {
#     if(is.factor(x)) as.integer(x) else (x)
#     })
#   dat1$event <- dat$event
#   
#   test <- dat1[id == i, ]
#   training <- dat1[id != i, ]
#   
#   fit.nn <- neuralnet(event ~ ., data = training, hidden = 10,
#             act.fct = "logistic", linear.output = F, threshold = .4)
#   predict(fit.nn, newdata = test)
#   pred.nn <- ifelse(predict(fit.nn, newdata = test)<.5, 0, 1)
#   err.nn[i] <- mean(pred.nn != test$event)
# }

Conclusion

Compare the 10-fold cv risks for all above methods:

ans <- data.frame(fold = 1:10, 
                  knn.all = err.knn1, knn.best = err.knn2, 
                  glm = err.glm, 
                  lda.all = err.lda1, lda.best = err.lda2, 
                  qda.all = err.qda1, qda.best = err.qda2, 
                  ElasticNet = err.net,
                  single.tree = err.prune.tree, random.forest = err.rf, 
                  bagged.tree = err.bagged, gbm = err.gbm, adaboost = err.adaboost,
                  linear.svm = err.linear.svm, nonlinear.svm = err.nonlinear.svm)

apply(ans[, -1], 2, mean) %>%
  as.data.frame() -> ans.tab
colnames(ans.tab) <- "cv.error"
ans.tab %>%
  kable() %>%
  kable_styling() 
cv.error
knn.all 0.3076935
knn.best 0.2565231
glm 0.1909857
lda.all 0.1924246
lda.best 0.1880006
qda.all 0.2483947
qda.best 0.2139858
ElasticNet 0.2036454
single.tree 0.3010308
random.forest 0.2013798
bagged.tree 0.2223290
gbm 0.3165468
adaboost 0.2906904
linear.svm 0.1946902
nonlinear.svm 0.1908783

Among all these methods, only the following 5 models have 10-fold cv risks lower than 20%:

ans.tab %>%
  filter(cv.error < .2) %>%
  kable() %>%
  kable_styling()
cv.error
glm 0.1909857
lda.all 0.1924246
lda.best 0.1880006
linear.svm 0.1946902
nonlinear.svm 0.1908783

It appears glm, lda and svm has the best performance compared to the others with 10-fold cv risks lower than 20%. It is worth noting that glm, lda and linear svm all provid linear (or almost linear) boundaries for classification. It seems linear classifier is the most suitable model for our data.
Thus I would propose Linear Discriminate Analysts to the authorities. Not only because it provides the smallest cross-validation error rate, but also since it has the best performance when the classification boundary is almost linear.

# Exhaustive best subset selection
regfit.full <- regsubsets(event ~., dat, nvmax = 31)
reg.summary <- summary(regfit.full)
np <- which.min(reg.summary$cp)

# Forward best subset selection
regfit.forward <- regsubsets(event ~., dat, nvmax = 31, method = "forward")
np.forward <- which.min(summary(regfit.forward)$cp)

# Exhaustive Best subset selection
regfit.backward <- regsubsets(event ~., dat, nvmax = 31, method = "backward")
np.backward <- which.min(summary(regfit.backward)$cp)

# paste0(names(coef(regfit.full, np)[-1]), collapse = ", ")
plot(1:30, reg.summary$cp, xlab = "number of predictors", ylab = "Cp",
     type = "b", main = "Best subset selection")
points(np, reg.summary$cp[np], col = "red", cex = 2, pch = 20)

Best subset selection was also conducted on the whole dataset to choose the best number of predictors. Cp was used as a measurment of in-sample error. Exhaustive, backward, and forward selection methods all provided the same results: the best subset contains 12 predictors: age, bmi, smoke_vapeYes, dm_factorYes, cancerYes, first_cxr_results_2_factorUnchecked, duration_symptoms, ed_before_order_setYes, s_bp_noninvasive_d, vs_bp_noninvasive_s, vs_hr_hr, xp_resp_rate_pt. These features are important for prediction. To be specific, the patient level information includes age, bmi, ever somker/vaper, diabetes mellitus, any cancer,Initial Chest X-ray Findings (choice=Clear), duration of symptoms and admitted to ED before lab order set change, and the vital signs include blood pressure, heart rate, and respiratory rate are important for predictions.

form.final <- formula("event ~ age + bmi +smoke_vape +dm_factor +cancer +first_cxr_results_2_factor +duration_symptoms +ed_before_order_set +s_bp_noninvasive_d +vs_bp_noninvasive_s +vs_hr_hr +xp_resp_rate_pt")
fit.final <- lda(form.final, data = dat)
# ROC curve
rocplot <- function(truth, pred, ...){
  predob = prediction(pred, truth)
  perf = performance(predob, "tpr", "fpr")
  plot(perf, colorize=F, ...) 
  title("ROC Plot")
  area = auc(truth, pred)
  area = format(round(area, 3), nsmall = 4)
  text(x = .8, y = .1, labels = paste("AUC = ", area))
  segments(x0 = 0, y0 = 0, x1 = 1, y1 = 1, col = "gray", lty = 2)
}
final.pred <- predict(fit.final, dat, type = "response")
rocplot(dat$event, final.pred$posterior[,2])

The ROC plot for final model is shown above with an auc = 0.901, which looks pretty good. The coefficients as well as the exponential terms of the coefficients for each features in the final model are presented bellow:

ans <- data.frame(coef(fit.final))
colnames(ans) <- "coefficient"
rownames(ans) <- c("Age", "BMI", "Smoke Vape(Yes)", 
                   "Diabetes Mellitus(Yes)", "Any Cancer(Yes)",
                   "Initial Chest X-ray Findings (choice=Clear)(Unchecked)", 
                   "Duration of Symptoms",
                   "Admitted to ED before lab order set change(Yes)",
                   "Blood Pressure Noninvasive d",
                   "Blood Pressure Noninvasive s",
                   "Heart Rate",
                   "Respiratory Rate")
ans$exp.coefficient <- exp(ans$coefficient)
ans %>%
  kable() %>%
  kable_styling()
coefficient exp.coefficient
Age -0.0413109 0.9595308
BMI -0.0884343 0.9153632
Smoke Vape(Yes) 0.4422673 1.5562317
Diabetes Mellitus(Yes) 0.2095303 1.2330987
Any Cancer(Yes) -0.5455469 0.5795248
Initial Chest X-ray Findings (choice=Clear)(Unchecked) -0.1609632 0.8513234
Duration of Symptoms -0.0451409 0.9558628
Admitted to ED before lab order set change(Yes) -0.2024069 0.8167625
Blood Pressure Noninvasive d -0.4361270 0.6465356
Blood Pressure Noninvasive s -0.3503263 0.7044582
Heart Rate -0.7177642 0.4878417
Respiratory Rate -0.0786959 0.9243210

The result shows that all other coefficients except for ever smoker/vaper and Diabetes Mellitus are negative. A negative coefficient (i.e., exp.coeffieicnt < 1) means that this factor is negatively associated with the need for intubation, while a positive coefficient (i.e., exp.coeffieicnt > 1) means this factor is positively associated with the need for intubation. To be specific:

  • For continuous predictors, take “Age” for example: the coefficient for age is -0.041 and the exponential coefficient for age is 0.96, which means the odds of getting intubation for a patient would decrease by a factor of 0.96 with one unit increase in age, controlling for other covariates.

  • For categorical predictors, take “Smoke Vape” for example: the coefficient for Smoke Vape is 0.442 and its exponential is 1.56, which means the odds of getting intubation for a patient would increase by a factor of 1.56 if this patient had ever smoker/vaper than did not, controlling for other covariates.

Appendix

Table 1

library(table1)
rndr <- function(x, name, ...){
  if (length(x) == 0){
    y <- dat1[[name]]
    s <- rep("", length(render.default(x = y, name = name)))
    if (is.numeric(y)){
      p <- t.test(y ~ dat1$event_p)$p.value
    } else{
      p <- chisq.test(table(y, droplevels(dat1$event_p)))$p.value
    }
    s[2] <- sub("<", "&lt;", format.pval(p, digits = 3, eps = 0.001))
    s
  } else{
    render.default(x = x, name = name, ...)
  }
}

rndr.strat <- function(label, n, ...){
  ifelse(n == 0, label, render.strat.default(label, n, ...))
}
dat1 <- dat
colnames(dat1) <- c("age","sex","BMI","hypoxia_ed","smoke_vape",
                    "diabetes","hypertension",
        "pulmonary_disease","CKD","ESRD","CAD","cancer",
        "immunosuppression","Fever","Cough",
        "diarrhea","nausea","myalgias","dyspnea",
        "csr_clear","csr_unilateral infiltrate",
        "csr_bilateral_infiltrates","csr_pleural_effusion","duration",
        "Ed_before_order","event", "blood_pressure_noninvasive_d", 
        "blood pressure noninvasive s","heart rate", 
        "Respiratory Rate", "SpO2")
dat1 <- clean_names(dat1)
formu <- paste0("~ ", paste0(colnames(dat1)[-26], collapse = " + "), "| event_p") 
dat1$event_p <- factor(dat$event, levels = c("No", "Yes", "p-val"), labels = c("No", "Yes", "p-val"))
table1(formula(formu), data = dat1, droplevels = F, render = rndr, render.strat = rndr.strat, overall = F)
No
(N=700)
Yes
(N=645)
p-val
age
Mean (SD) 71.7 (15.8) 58.6 (16.4) <0.001
Median [Min, Max] 73.5 [-2.08, 114] 59.3 [5.41, 95.1]
sex
Female 296 (42.3%) 261 (40.5%) 0.534
Male 404 (57.7%) 384 (59.5%)
bmi
Mean (SD) 29.5 (6.95) 26.0 (5.44) <0.001
Median [Min, Max] 28.5 [15.9, 58.9] 26.0 [9.86, 50.8]
hypoxia_ed
No 319 (45.6%) 295 (45.7%) 0.995
Yes 381 (54.4%) 350 (54.3%)
smoke_vape
No 523 (74.7%) 441 (68.4%) 0.0118
Yes 177 (25.3%) 204 (31.6%)
diabetes
No 506 (72.3%) 421 (65.3%) 0.00657
Yes 194 (27.7%) 224 (34.7%)
hypertension
No 310 (44.3%) 287 (44.5%) 0.982
Yes 390 (55.7%) 358 (55.5%)
pulmonary_disease
Checked 33 (4.7%) 34 (5.3%) 0.731
Unchecked 667 (95.3%) 611 (94.7%)
ckd
Checked 33 (4.7%) 24 (3.7%) 0.443
Unchecked 667 (95.3%) 621 (96.3%)
esrd
Checked 36 (5.1%) 44 (6.8%) 0.236
Unchecked 664 (94.9%) 601 (93.2%)
cad
No 576 (82.3%) 533 (82.6%) 0.923
Yes 124 (17.7%) 112 (17.4%)
cancer
No 631 (90.1%) 612 (94.9%) 0.00148
Yes 69 (9.9%) 33 (5.1%)
immunosuppression
unknown/No 684 (97.7%) 621 (96.3%) 0.165
Yes 16 (2.3%) 24 (3.7%)
fever
Checked 433 (61.9%) 431 (66.8%) 0.0656
Unchecked 267 (38.1%) 214 (33.2%)
cough
Checked 483 (69.0%) 449 (69.6%) 0.854
Unchecked 217 (31.0%) 196 (30.4%)
diarrhea
Checked 168 (24.0%) 173 (26.8%) 0.26
Unchecked 532 (76.0%) 472 (73.2%)
nausea
Checked 144 (20.6%) 126 (19.5%) 0.685
Unchecked 556 (79.4%) 519 (80.5%)
myalgias
Checked 147 (21.0%) 113 (17.5%) 0.122
Unchecked 553 (79.0%) 532 (82.5%)
dyspnea
Checked 422 (60.3%) 398 (61.7%) 0.633
Unchecked 278 (39.7%) 247 (38.3%)
csr_clear
Checked 85 (12.1%) 83 (12.9%) 0.749
Unchecked 615 (87.9%) 562 (87.1%)
csr_unilateral_infiltrate
Checked 88 (12.6%) 72 (11.2%) 0.476
Unchecked 612 (87.4%) 573 (88.8%)
csr_bilateral_infiltrates
Checked 497 (71.0%) 484 (75.0%) 0.109
Unchecked 203 (29.0%) 161 (25.0%)
csr_pleural_effusion
Checked 34 (4.9%) 40 (6.2%) 0.337
Unchecked 666 (95.1%) 605 (93.8%)
duration
Mean (SD) 9.53 (5.67) 8.21 (4.43) <0.001
Median [Min, Max] 9.00 [1.00, 35.0] 9.00 [1.00, 27.0]
ed_before_order
No 398 (56.9%) 405 (62.8%) 0.0307
Yes 302 (43.1%) 240 (37.2%)
blood_pressure_noninvasive_d
Mean (SD) 60.2 (0.977) 59.8 (0.985) <0.001
Median [Min, Max] 60.1 [55.9, 63.0] 59.8 [55.7, 63.3]
blood_pressure_noninvasive_s
Mean (SD) 130 (0.960) 130 (0.997) <0.001
Median [Min, Max] 130 [127, 133] 130 [126, 133]
heart_rate
Mean (SD) 75.3 (0.924) 74.6 (0.917) <0.001
Median [Min, Max] 75.3 [72.4, 78.1] 74.6 [71.0, 77.3]
respiratory_rate
Mean (SD) 30.0 (1.01) 30.0 (0.986) 0.721
Median [Min, Max] 30.0 [26.9, 33.8] 29.9 [27.3, 33.3]
sp_o2
Mean (SD) 92.5 (0.982) 92.5 (0.981) 0.947
Median [Min, Max] 92.5 [89.2, 95.3] 92.5 [89.2, 95.2]

It appears at a 0.1 significance level, we only have 11 significant predictors: age, bmi, smoke_vape, dm_factor, cancer, symptoms_1_factor, duration_symptoms, ed_before_order_set, s_bp_noninvasive_d, vs_bp_noninvasive_s and vs_hr_hr.