Introduction

Background: The data set is a simulated data set. It represents the data on COVID+ patients (i.e. patients diagnosed with COVID) who were hospitalized in the New York Presbyterian hospital (Weill Cornell and Lower Manhattan campus). Some COVID+ patients suffer from acute respiratory distress and have acute difficulties in breathing. Such critically ill patients then require artificial respiratory support through an invasive mechanical ventilator, a process known as intubation. Since there was a surge in COVID+ cases in New York City during the period of March, the hospital authorities were concerned that there might be a shortage in availability of mechanical ventilators. The authorities wanted to predict the need for intubation within 5 days of hospitalization for each patient in order to better manage their resources. The goal is to build a predictive model that will predict the outcome of intubation within 5 days of hospitalization.

Data: There are two data files “baseline.csv” and “labs and vitals.csv”.

baselines <- read.csv("baselines.csv")
baselines <- clean_names(baselines)
vitals <- read.csv("lab and vitals.csv")

The baseline.csv file contains patient level information (one row per patient) on baseline clinical measures that were determined at the time of admission to the hospital. The measures include demographic variables, relevant clinical history and certain diagnostic tests.
The labs and vitals file contain data on vitals signs of patients (e.g. blood pressure, heart rate, SpO2 or blood oxygen levels, etc). The vital signs are measured multiple times for each patient during the course of hospitalization. There number of times it is measured per individual also varies with each individual. The vitals data is in a long format i.e. there are multiple rows for a patient.
A variable dictionary is also available to explain the variables.

This report explores the two following two goals:

Feature engineering: Extract features from the longitudinal vital signs measure so that you have one value for patient.
Predictive modeling: Using all baseline variables and features extracted from labs and vitals to build a predictive model predicting the binary event of intubation within 5 days of hospitalization.

Feature Engineering

Check for missing values:

apply(vitals, 2, function(x) sum(is.na(x)))

##       name    subject      value time_stamp 
##          0          0      85815          0

apply(vitals, 2, function(x) paste0(round(sum(is.na(x))/nrow(vitals), 3)*100, "%"))

##       name    subject      value time_stamp 
##       "0%"       "0%"    "12.1%"       "0%"

It appears there are 85815 missing values in vitals data, which makes up of 12.1% of the observations.

Explore various aspects of the distribution of each vital sign (mean, standard deviation, max, min, median) to create featuers. Now we have multiple features per vital sign with one value per patient for each feature, and take a look at the head of it:

vitals %>%
  group_by(subject, name) %>%
  summarise(mean.val = mean(value, na.rm = T), sd.val = sd(value, na.rm = T),
            max.val = max(value, na.rm = T), min.val = min(value, na.rm = T),
            median.val = median(value, na.rm = T),
            .groups = "drop") -> vitals_feature

head(vitals_feature)%>%
  kable() %>%
  kable_styling()

subject	name	mean.val	sd.val	max.val	min.val	median.val
655528	s_bp_noninvasive (d)	61.16246	1.629412	65.35887	55.86459	61.03558
655528	vs_bp_noninvasive (s)	129.90207	2.689424	134.88518	124.90907	129.49549
655528	vs_hr_hr	72.80166	2.611181	77.40824	66.68079	73.43356
655528	xp_resp_rate_pt	30.40044	1.675897	33.97693	27.39469	29.94419
655528	xp_resp_spo2	91.44550	3.070444	99.26152	85.85129	90.91147
729545	s_bp_noninvasive (d)	58.77790	1.945223	62.79768	54.33671	58.92815

vital_summary <- spread(vitals_feature[, 1:3], name, mean.val)

Now let’s visulize the distribution of the mean and standard deviation values for each vital signs:

ggplot(data = vitals_feature, aes(x = name, y = mean.val)) + 
  geom_boxplot() + 
  coord_flip() + 
  ggtitle("Boxplot of the mean values for vital signs") -> p1

ggplot(data = vitals_feature, aes(x = name, y = sd.val)) + 
  geom_boxplot() + 
  coord_flip() + 
  ggtitle("Boxplot of the standard deviation values for vital signs ") ->p2
grid.arrange(p1, p2, ncol = 1)

par(mfrow = c(2, 3))
for (i in unique(vitals_feature$name)){
  hist(vitals_feature$mean.val[vitals_feature$name == i], 
        main = paste0("Distribution of mean value for\n", i), xlab = i)
}

par(mfrow = c(2, 3))

for (i in unique(vitals_feature$name)){
  hist(vitals_feature$sd.val[vitals_feature$name == i], 
        main = paste0("Distribution of standard deviation for\n", i), xlab = i)
}

It appears the distributions of mean values per patient for each feature are quite normal, while the distribution of standard deviation per patient for each feature are all a bit right skewed. It seems mean is a better measurement in presenting patients’ vital signs since it has a wide spread and normal distribution.

Predictive modeling

First, merge the baseline data and vital data summary into one dataframe by patient medical record number. After dropping all the missing values and merging the two datasets, the resulting dataset has 1345 observations with 30 features for each subject. The table 1 for each features is displayed in the appendix.

merge.data.frame(baselines, vital_summary, by.x = "mrn", 
                 by.y = "subject")[, -1] -> dat
dat <- clean_names(dat)
# sum(is.na(dat))
dat[, c(2, 4:23, 25, 26)] <- lapply(dat[, c(2, 4:23, 25, 26)], as.factor)
# str(dat)

The goal is to predict the binary event of intubation within 5 days of hospitalization. I randomly divided data into 10 parts and calculated the 10-fold cv errors with the following methods: k-NN, LDA, QDA, GLM, Elastic Net Regression, Tree-based methods (including one single tree with prunning, random forest, bagged tree, gradient boosted tree and adaboost tree) and SVM (both linear SVM and non-linear SVM).

n <- nrow(dat)
p <- ncol(dat) - 1
folds <- 10
id <- rep(1:folds, each = n/folds)
id <- c(id, rep(folds, n - length(id)))
set.seed(135)
id <- sample(id)

np <- np.forward <- np.backward <- numeric(folds)
s12 <- s13 <-numeric(folds)
best.features <- list(folds)

for (i in 1:folds){
  training <- dat[id != i, ]
  
  # Exhaustive best subset selection
  regfit.full <- regsubsets(event ~., training, nvmax = 31)
  np[i] <- which.min(summary(regfit.full)$cp)
  
  # Forward best subset selection
  regfit.forward <- regsubsets(event ~., training, nvmax = 31, 
                               method = "forward")
  np.forward[i] <- which.min(summary(regfit.forward)$cp)

  # Backward best subset selection
  regfit.backward <- regsubsets(event ~., training, nvmax = 31, 
                                method =   "backward")
  np.backward[i] <- which.min(summary(regfit.backward)$cp)
 # best.features[i] <- paste0(names(coef(regfit.full, np[i])[-1]), collapse = ", ")
  s12[i] <- sum(names(coef(regfit.full, np[i])[-1])!= 
               names(coef(regfit.forward, np[i])[-1]))
  s13[i] <- sum(names(coef(regfit.full, np[i])[-1])!= 
               names(coef(regfit.backward, np[i])[-1]))
}
ans <- data.frame(np, np.forward, np.backward, s12+s13)
colnames(ans)[4] <- c("different selected features")

Bset subset selections were conducted within each validation fold, and Cp was used as a measurment of in-sample error. It appears the features selected under each training data are the same no matter which selection method (Exhaustive, forward or backward) was used:

ans %>%
  kable() %>%
  kable_styling()

np	np.forward	np.backward
12	12	12
12	12	12
14	14	14
12	12	12
12	12	12
12	12	12
13	13	13
12	12	12
13	13	13
12	12	12

form <- list(folds)
form[1] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + first_cxr_results_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[4] <- form[6] <- form[10] <- form[1]
form[2] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + any_immunosuppression + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[5] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + first_cxr_results_3_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[8] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + symptoms_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[3] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + any_immunosuppression + symptoms_8_factor + first_cxr_results_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[7] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + any_immunosuppression + first_cxr_results_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[9] <- "event ~ age + bmi + hypoxia_ed_factor + smoke_vape + dm_factor + cancer + first_cxr_results_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"

##Create function for misclassification rate from KNN; use default k=1
KNN.error <- function(Train, Test, k=1, form1 = "") {
  if (form1 != "") ex <- labels(terms(formula(form1))) else 
    ex <- names(Train[, which(names(Train) !="event")])
  
  Train.KNN <- Train[, which(names(Train) %in% ex)]
  Train.KNN[,] <- lapply(Train.KNN, function (x) {
    if(is.factor(x)) as.integer(x) else (x)
    })
  Train.KNN[,] <- apply(Train.KNN, 2, scale)
  
  Test.KNN <- Test[,which(names(Test) %in% ex)]
  Test.KNN[,] <- lapply(Test.KNN, function (x) {
    if(is.factor(x))as.integer(x) else (x)
    })
  Test.KNN[,] <- apply(Test.KNN, 2, scale)
  
  Test.class <- Test$event
  Train.class <- Train$event
  
  pred.class=knn(Train.KNN, Test.KNN, Train.class, k=k)
  mean(Test.class != pred.class)
}

k-NN

10-fold cross validation was conducted for knn. Within each fold, two parallel k-NN models were conducted: one included all features, while the other only considerded features derived from best subset selection. K was selected from 1 to 15 via cv.

err.knn1 <- err.knn2 <- numeric(folds)
for (i in 1:10){
  training <- dat[id != i, ]
  test <- dat[id == i, ]
  k <- which.min(sapply(1:15, KNN.error, Train = training, Test = test))
  err.knn1[i] <- KNN.error(training, test, k)

  k <- which.min(sapply(1:15, KNN.error, Train = training, Test = test, form[[i]]))
  err.knn2[i] <- KNN.error(training, test, k, form[[i]])
}

err.knn <- data.frame(folds = 1:folds, 
                      All = err.knn1, Best = err.knn2)

ggplot(err.knn, aes(x = folds)) +
  geom_line(aes(y = All, col = "All features")) + 
  geom_line(aes(y = Best, col = "Best subset selected features"))+
  labs(col = "Festures")+
  ylab("Error rate")

apply(err.knn[, -1], 2, mean)

##       All      Best 
## 0.3076935 0.2565231

It appears the k-NN model based on the best subset selection have a relatively lower error rate of 25.7% compared to the one with all features.

GLM

10-fold cross validation was conducted for GLM. Within each fold, glm was conducted with all 30 features and stepwise selection method was carried out to select features.

err.glm <- numeric(folds)
for (i in 1:10){
  training <- dat[id != i, ]
  test <- dat[id == i, ]

  # GLM
  fit.glm <- glm(event ~ ., data = training, family = "binomial")
  fit.glm <- step(fit.glm, direction = "both", trace = 0)
  pred.glm <- predict(fit.glm, newdata = test, type = "response")
  pred.glm <- ifelse(pred.glm > .5, "Yes", "No")
  err.glm[i] <- mean(pred.glm!=test$event)
}
mean(err.glm)

## [1] 0.1909857

The 10-fold cv risk for GLM is 19.1%.

LDA & QDA

10-fold cross validation was conducted for lda and qda. Within each fold, 2 parallel datasets were used for both lda and qda: one included all features while the other only considerded features derived from best subset selection.

err.lda1 <- err.qda1 <-  err.lda2 <- err.qda2 <-numeric(folds)
for (i in 1:10){
  training <- dat[id != i, ]
  test <- dat[id == i, ]

  # LDA
  fit.lda <- lda(event ~ ., data = training)
  pred.lda <- predict(fit.lda, newdata = test)$class
  err.lda1[i] <- mean(pred.lda!=test$event)

  # QDA
  fit.qda <- qda(event ~ ., data = training)
  pred.qda <- predict(fit.qda, newdata = test)$class
  err.qda1[i] <- mean(pred.qda!=test$event)
  
  # LDA
  fit.lda <- lda(formula(form[[i]]), data = training)
  pred.lda <- predict(fit.lda, newdata = test)$class
  err.lda2[i] <- mean(pred.lda!=test$event)

  # QDA
  fit.qda <- qda(formula(form[[i]]), data = training)
  pred.qda <- predict(fit.qda, newdata = test)$class
  err.qda2[i] <- mean(pred.qda!=test$event)
}

err.lda.qda <- data.frame(folds = 1:folds, 
                          lda.all = err.lda1, lda.best = err.lda2, 
                          qda.all = err.qda1, qda.best = err.qda2)
#a <- cbind(err.lda.qda, err.lda2, err.qda2)
# apply(err.lda.qda[, -1], 2, mean)

ggplot(err.lda.qda, aes(x = folds)) +
  geom_line(aes(y = lda.all, col = "LDA - All features")) + 
  geom_line(aes(y = lda.best, col = "LDA - Best subset selection")) + 
  ylab("Error rate") +
  labs(col = "Festures")+
  ggtitle("LDA") -> p1
ggplot(err.lda.qda, aes(x = folds)) +
  geom_line(aes(y = qda.all, col = "QDA - All features")) + 
  geom_line(aes(y = qda.best, col = "QDA - Best subset selection")) + 
  ylab("Error rate") +
  labs(col = "Festures")+
  ggtitle("QDA") -> p2
grid.arrange(p1, p2, ncol = 1)

It appears the performances of the two LDA models with different datasets do not differ much. In contrast, QDA has a lower error rate when fitted with best subset selection. Now take a look at the 10-fold cv error rates for each of the four models:

apply(err.lda.qda[, -1], 2, mean)

##   lda.all  lda.best   qda.all  qda.best 
## 0.1924246 0.1880006 0.2483947 0.2139858

It appears both lda and qda have lower cv error rates with the smaller dataset. LDA with Best Subset Selection has the smallest cv risk of 18.8%.

Elastic Net Regression

10-fold cross validation was conducted for Elastic Net regression. Within each fold, alpha was set as 0.5 to benefit from both lasso and ridge regressions. Lambda was selected from 10-fold cv.

err.net <- numeric(folds)
for(i in 1:folds){
  training <- dat[id != i, ]
  test <- dat[id == i, ]
  
  x <- model.matrix(event~.,training)
  y <- training$event
  enet.cv <- cv.glmnet(x,y,family = "binomial",alpha=0.5,nfolds=10)
  lambda.min<-enet.cv$lambda.min
  fit.net <- glmnet(x,y,family="binomial",lambda = lambda.min,alpha=0.5)
  #lm(event~.,cv_tst)
  
  newX <- model.matrix(event~., test)
  pred.net = predict(fit.net,newx=newX,s=lambda.min)
  pred.net = ifelse(pred.net>0.5,"Yes","No")
  err.net[i]<-mean(test$event != pred.net)
}
plot(enet.cv)

mean(err.net)

## [1] 0.2036454

Take a look at one deviance-log(lambda) plot from one CV folder, log(lambda) was chosed from a range of -8 to -1 and the corresponding deviance ranged from 0.8 to 1.4.
The cv risk for Elastiv Net regression is 20.3%.

Tree-based Method

10-fold cross validation was conducted for five tree based methods. Within each fold, one single tree, random forest, bagged tree, gradient boosted tree and adaboost tree were conducted with all 30 features.
One single Tree: Cross-validation was carried out to prune the parameters by choosing the number of splits.
Random Forest: I chose at most \(m = \sqrt{p}\) parameters could be used within each fold. 100 trees were constructed.
Bagged Trees: I chose at most \(m = p\) parameters could be used within each fold. 100 trees were constructed.
Gradient boosted model: I set the depth = 1 since it is the most common situation. 100 trees were constructed with a shrinkage parameter of 0.01.
Adaptive boosted model: I set the depth = 1 since it is the most common situation. 100 trees were constructed with a shrinkage parameter of 0.01.

err.prune.tree <- err.rf <- err.bagged <- err.gbm <- err.adaboost <- numeric(folds)

for (i in 1:10){
  training <- dat[id != i, ]
  test <- dat[id == i, ]
  
  # Prune Tree
  fit.tree <- tree(event ~ ., data = training)
  tree.cv <- cv.tree(fit.tree, FUN = prune.tree, K = 10)
  fit.tree <- prune.tree(fit.tree, best = tree.cv$size[which.min(tree.cv$dev)])
  pred.tree <- predict(fit.tree, newdata = test, type = "class")
  err.prune.tree[i] <- mean(pred.tree != test$event)

  # Random Forest with m = sqrt(p)
  fit.RandomForest <- randomForest(event ~ ., training, 
                                   mtry = sqrt(p), importance = T, 
                                   ntrees = 100)
  pred.rf <- predict(fit.RandomForest, newdata = test, type = "class")
  err.rf[i] <- mean(pred.rf != test$event)
  
  # Bagged Tree with m = p
  fit.bagged <- randomForest(event ~ ., training, 
                             mtry = p, importance = T, 
                             ntrees = 100)
  pred.bagged <- predict(fit.bagged, newdata = test, type = "class")
  err.bagged[i] <- mean(pred.bagged != test$event)
  
  # gradient boosted model
  fit.boosting <- gbm(ifelse(event == "Yes", 1, 0) ~ ., 
                      data = training,
                      n.trees = 100, 
                      distribution = "bernoulli",
                      interaction.depth = 1,
                      shrinkage = .01)
  pred.boosting <- predict(fit.boosting, test, n.trees = 100,
                           type = "response")
  pred.boosting <- ifelse(pred.boosting <.5, "No", "Yes")
  err.gbm <- mean(pred.boosting != test$event)
  
  # adaboost
  fit.adaboost <- gbm(ifelse(event == "Yes", 1, 0) ~ ., 
                      data = training,
                      n.trees = 100, 
                      distribution = "adaboost",
                      interaction.depth = 1,
                      shrinkage = .01)
  pred.adaboost <- predict(fit.adaboost, test, n.trees = 100,
                           type = "response")
  pred.adaboost <- ifelse(pred.adaboost <.5, "No", "Yes")
  err.adaboost[i] <- mean(pred.adaboost != test$event)
}

err.tree <- data.frame(err.prune.tree, err.rf, err.bagged, err.gbm, err.adaboost)
apply(err.tree, 2, mean)

## err.prune.tree         err.rf     err.bagged        err.gbm   err.adaboost 
##      0.3010308      0.2013798      0.2223290      0.3165468      0.2906904

Compare the 10-fold cv risk for the 5 tree-based methods, it appears random forest with m = sqrt(p) and bagged tree with m = p have relatively lower risks compared to other methods (These two risks are both around 20%).

SVM

10-fold cross validation was conducted for two svm models. Within each fold, both linear svm and non-linear svm were conducted with all 30 features. Cross validations were carried out to select the suitable cost and gamma. The parameter cost was selected from a range of (.01, .1, 1, 2, 4, 8, 16, 32). The parameter gamma was selected from a range of (.001, .005, .01, .05, .1, 1).

err.linear.svm <- err.nonlinear.svm <- numeric(folds)

for (i in 1:10){
  training <- dat[id != i, ]
  test <- dat[id == i, ]
  # Linear SVM
  svm.tune.linear <- tune(svm, event ~ ., data = training, kernel = "linear",
                        ranges = list(cost = c(.01, .1, 1, 2, 4, 8, 16, 32),
                                      gamma = c(0.1, 0.1, 1, 2, 3, 4)),
                        tunecontrol = tune.control(sampling = "fix"))
  svm.tune.linear$best.parameters$cost
  fit.svm.linear <- svm(event ~ ., data = training,
                      kernel = "linear", probability = T,
                      cost = svm.tune.linear$best.parameters$cost,
                      gamma = svm.tune.linear$best.parameters$gamma)
  pred.linearsvm <- predict(fit.svm.linear, newdata = test, type = "class")
  err.linear.svm[i] <- mean(pred.linearsvm != test$event)

  # Non-linear SVM
  svm.tune.nonlinear <- tune(svm, event~., data = training, kernel = "radial",
                   ranges = list(cost = c(.01, .1, 1, 2, 4, 8, 16, 32),
                                 gamma = c(.001, .005, .01, .05, .1, 1)),
                   tunecontrol = tune.control(sampling = "fix"))

  fit.svm.nonlinear <- svm(event ~ ., data = training, kernel = "radial",
                           cost = svm.tune.nonlinear$best.parameters$cost,
                           gamma = svm.tune.nonlinear$best.parameters$gamma)
  pred.nonlinearsvm <- predict(fit.svm.nonlinear, newdata = test, type = "class")
  err.nonlinear.svm[i] <- mean(pred.nonlinearsvm != test$event)
} 
err.svm <- data.frame(err.linear.svm, err.nonlinear.svm)
apply(err.svm, 2, mean)

##    err.linear.svm err.nonlinear.svm 
##         0.1946902         0.1908783

It appeares linear svm and non-linear svm have similar performances with similar risks, both are around 20%.

# **Neural Network**
# err.nn <- numeric(folds)
# for (i in 1:folds){
#   dat1 <- dat[, which(names(dat) !="event")]
#   dat1[,] <- lapply(dat1, function (x) {
#     if(is.factor(x)) as.integer(x) else (x)
#     })
#   dat1$event <- dat$event
#   
#   test <- dat1[id == i, ]
#   training <- dat1[id != i, ]
#   
#   fit.nn <- neuralnet(event ~ ., data = training, hidden = 10,
#             act.fct = "logistic", linear.output = F, threshold = .4)
#   predict(fit.nn, newdata = test)
#   pred.nn <- ifelse(predict(fit.nn, newdata = test)<.5, 0, 1)
#   err.nn[i] <- mean(pred.nn != test$event)
# }

Conclusion

Compare the 10-fold cv risks for all above methods:

ans <- data.frame(fold = 1:10, 
                  knn.all = err.knn1, knn.best = err.knn2, 
                  glm = err.glm, 
                  lda.all = err.lda1, lda.best = err.lda2, 
                  qda.all = err.qda1, qda.best = err.qda2, 
                  ElasticNet = err.net,
                  single.tree = err.prune.tree, random.forest = err.rf, 
                  bagged.tree = err.bagged, gbm = err.gbm, adaboost = err.adaboost,
                  linear.svm = err.linear.svm, nonlinear.svm = err.nonlinear.svm)

apply(ans[, -1], 2, mean) %>%
  as.data.frame() -> ans.tab
colnames(ans.tab) <- "cv.error"
ans.tab %>%
  kable() %>%
  kable_styling()

	cv.error
knn.all	0.3076935
knn.best	0.2565231
glm	0.1909857
lda.all	0.1924246
lda.best	0.1880006
qda.all	0.2483947
qda.best	0.2139858
ElasticNet	0.2036454
single.tree	0.3010308
random.forest	0.2013798
bagged.tree	0.2223290
gbm	0.3165468
adaboost	0.2906904
linear.svm	0.1946902
nonlinear.svm	0.1908783

Among all these methods, only the following 5 models have 10-fold cv risks lower than 20%:

ans.tab %>%
  filter(cv.error < .2) %>%
  kable() %>%
  kable_styling()

	cv.error
glm	0.1909857
lda.all	0.1924246
lda.best	0.1880006
linear.svm	0.1946902
nonlinear.svm	0.1908783

It appears glm, lda and svm has the best performance compared to the others with 10-fold cv risks lower than 20%. It is worth noting that glm, lda and linear svm all provid linear (or almost linear) boundaries for classification. It seems linear classifier is the most suitable model for our data.
Thus I would propose Linear Discriminate Analysts to the authorities. Not only because it provides the smallest cross-validation error rate, but also since it has the best performance when the classification boundary is almost linear.

# Exhaustive best subset selection
regfit.full <- regsubsets(event ~., dat, nvmax = 31)
reg.summary <- summary(regfit.full)
np <- which.min(reg.summary$cp)

# Forward best subset selection
regfit.forward <- regsubsets(event ~., dat, nvmax = 31, method = "forward")
np.forward <- which.min(summary(regfit.forward)$cp)

# Exhaustive Best subset selection
regfit.backward <- regsubsets(event ~., dat, nvmax = 31, method = "backward")
np.backward <- which.min(summary(regfit.backward)$cp)

# paste0(names(coef(regfit.full, np)[-1]), collapse = ", ")
plot(1:30, reg.summary$cp, xlab = "number of predictors", ylab = "Cp",
     type = "b", main = "Best subset selection")
points(np, reg.summary$cp[np], col = "red", cex = 2, pch = 20)

Best subset selection was also conducted on the whole dataset to choose the best number of predictors. Cp was used as a measurment of in-sample error. Exhaustive, backward, and forward selection methods all provided the same results: the best subset contains 12 predictors: age, bmi, smoke_vapeYes, dm_factorYes, cancerYes, first_cxr_results_2_factorUnchecked, duration_symptoms, ed_before_order_setYes, s_bp_noninvasive_d, vs_bp_noninvasive_s, vs_hr_hr, xp_resp_rate_pt. These features are important for prediction. To be specific, the patient level information includes age, bmi, ever somker/vaper, diabetes mellitus, any cancer,Initial Chest X-ray Findings (choice=Clear), duration of symptoms and admitted to ED before lab order set change, and the vital signs include blood pressure, heart rate, and respiratory rate are important for predictions.

form.final <- formula("event ~ age + bmi +smoke_vape +dm_factor +cancer +first_cxr_results_2_factor +duration_symptoms +ed_before_order_set +s_bp_noninvasive_d +vs_bp_noninvasive_s +vs_hr_hr +xp_resp_rate_pt")
fit.final <- lda(form.final, data = dat)
# ROC curve
rocplot <- function(truth, pred, ...){
  predob = prediction(pred, truth)
  perf = performance(predob, "tpr", "fpr")
  plot(perf, colorize=F, ...) 
  title("ROC Plot")
  area = auc(truth, pred)
  area = format(round(area, 3), nsmall = 4)
  text(x = .8, y = .1, labels = paste("AUC = ", area))
  segments(x0 = 0, y0 = 0, x1 = 1, y1 = 1, col = "gray", lty = 2)
}
final.pred <- predict(fit.final, dat, type = "response")
rocplot(dat$event, final.pred$posterior[,2])

The ROC plot for final model is shown above with an auc = 0.901, which looks pretty good. The coefficients as well as the exponential terms of the coefficients for each features in the final model are presented bellow:

ans <- data.frame(coef(fit.final))
colnames(ans) <- "coefficient"
rownames(ans) <- c("Age", "BMI", "Smoke Vape(Yes)", 
                   "Diabetes Mellitus(Yes)", "Any Cancer(Yes)",
                   "Initial Chest X-ray Findings (choice=Clear)(Unchecked)", 
                   "Duration of Symptoms",
                   "Admitted to ED before lab order set change(Yes)",
                   "Blood Pressure Noninvasive d",
                   "Blood Pressure Noninvasive s",
                   "Heart Rate",
                   "Respiratory Rate")
ans$exp.coefficient <- exp(ans$coefficient)
ans %>%
  kable() %>%
  kable_styling()

	coefficient	exp.coefficient
Age	-0.0413109	0.9595308
BMI	-0.0884343	0.9153632
Smoke Vape(Yes)	0.4422673	1.5562317
Diabetes Mellitus(Yes)	0.2095303	1.2330987
Any Cancer(Yes)	-0.5455469	0.5795248
Initial Chest X-ray Findings (choice=Clear)(Unchecked)	-0.1609632	0.8513234
Duration of Symptoms	-0.0451409	0.9558628
Admitted to ED before lab order set change(Yes)	-0.2024069	0.8167625
Blood Pressure Noninvasive d	-0.4361270	0.6465356
Blood Pressure Noninvasive s	-0.3503263	0.7044582
Heart Rate	-0.7177642	0.4878417
Respiratory Rate	-0.0786959	0.9243210

The result shows that all other coefficients except for ever smoker/vaper and Diabetes Mellitus are negative. A negative coefficient (i.e., exp.coeffieicnt < 1) means that this factor is negatively associated with the need for intubation, while a positive coefficient (i.e., exp.coeffieicnt > 1) means this factor is positively associated with the need for intubation. To be specific:

For continuous predictors, take “Age” for example: the coefficient for age is -0.041 and the exponential coefficient for age is 0.96, which means the odds of getting intubation for a patient would decrease by a factor of 0.96 with one unit increase in age, controlling for other covariates.
For categorical predictors, take “Smoke Vape” for example: the coefficient for Smoke Vape is 0.442 and its exponential is 1.56, which means the odds of getting intubation for a patient would increase by a factor of 1.56 if this patient had ever smoker/vaper than did not, controlling for other covariates.

Appendix

Table 1

library(table1)
rndr <- function(x, name, ...){
  if (length(x) == 0){
    y <- dat1[[name]]
    s <- rep("", length(render.default(x = y, name = name)))
    if (is.numeric(y)){
      p <- t.test(y ~ dat1$event_p)$p.value
    } else{
      p <- chisq.test(table(y, droplevels(dat1$event_p)))$p.value
    }
    s[2] <- sub("<", "&lt;", format.pval(p, digits = 3, eps = 0.001))
    s
  } else{
    render.default(x = x, name = name, ...)
  }
}

rndr.strat <- function(label, n, ...){
  ifelse(n == 0, label, render.strat.default(label, n, ...))
}

dat1 <- dat
colnames(dat1) <- c("age","sex","BMI","hypoxia_ed","smoke_vape",
                    "diabetes","hypertension",
        "pulmonary_disease","CKD","ESRD","CAD","cancer",
        "immunosuppression","Fever","Cough",
        "diarrhea","nausea","myalgias","dyspnea",
        "csr_clear","csr_unilateral infiltrate",
        "csr_bilateral_infiltrates","csr_pleural_effusion","duration",
        "Ed_before_order","event", "blood_pressure_noninvasive_d", 
        "blood pressure noninvasive s","heart rate", 
        "Respiratory Rate", "SpO2")
dat1 <- clean_names(dat1)
formu <- paste0("~ ", paste0(colnames(dat1)[-26], collapse = " + "), "| event_p") 
dat1$event_p <- factor(dat$event, levels = c("No", "Yes", "p-val"), labels = c("No", "Yes", "p-val"))
table1(formula(formu), data = dat1, droplevels = F, render = rndr, render.strat = rndr.strat, overall = F)

	No (N=700)	Yes (N=645)	p-val
age
Mean (SD)	71.7 (15.8)	58.6 (16.4)	<0.001
Median [Min, Max]	73.5 [-2.08, 114]	59.3 [5.41, 95.1]
sex
Female	296 (42.3%)	261 (40.5%)	0.534
Male	404 (57.7%)	384 (59.5%)
bmi
Mean (SD)	29.5 (6.95)	26.0 (5.44)	<0.001
Median [Min, Max]	28.5 [15.9, 58.9]	26.0 [9.86, 50.8]
hypoxia_ed
No	319 (45.6%)	295 (45.7%)	0.995
Yes	381 (54.4%)	350 (54.3%)
smoke_vape
No	523 (74.7%)	441 (68.4%)	0.0118
Yes	177 (25.3%)	204 (31.6%)
diabetes
No	506 (72.3%)	421 (65.3%)	0.00657
Yes	194 (27.7%)	224 (34.7%)
hypertension
No	310 (44.3%)	287 (44.5%)	0.982
Yes	390 (55.7%)	358 (55.5%)
pulmonary_disease
Checked	33 (4.7%)	34 (5.3%)	0.731
Unchecked	667 (95.3%)	611 (94.7%)
ckd
Checked	33 (4.7%)	24 (3.7%)	0.443
Unchecked	667 (95.3%)	621 (96.3%)
esrd
Checked	36 (5.1%)	44 (6.8%)	0.236
Unchecked	664 (94.9%)	601 (93.2%)
cad
No	576 (82.3%)	533 (82.6%)	0.923
Yes	124 (17.7%)	112 (17.4%)
cancer
No	631 (90.1%)	612 (94.9%)	0.00148
Yes	69 (9.9%)	33 (5.1%)
immunosuppression
unknown/No	684 (97.7%)	621 (96.3%)	0.165
Yes	16 (2.3%)	24 (3.7%)
fever
Checked	433 (61.9%)	431 (66.8%)	0.0656
Unchecked	267 (38.1%)	214 (33.2%)
cough
Checked	483 (69.0%)	449 (69.6%)	0.854
Unchecked	217 (31.0%)	196 (30.4%)
diarrhea
Checked	168 (24.0%)	173 (26.8%)	0.26
Unchecked	532 (76.0%)	472 (73.2%)
nausea
Checked	144 (20.6%)	126 (19.5%)	0.685
Unchecked	556 (79.4%)	519 (80.5%)
myalgias
Checked	147 (21.0%)	113 (17.5%)	0.122
Unchecked	553 (79.0%)	532 (82.5%)
dyspnea
Checked	422 (60.3%)	398 (61.7%)	0.633
Unchecked	278 (39.7%)	247 (38.3%)
csr_clear
Checked	85 (12.1%)	83 (12.9%)	0.749
Unchecked	615 (87.9%)	562 (87.1%)
csr_unilateral_infiltrate
Checked	88 (12.6%)	72 (11.2%)	0.476
Unchecked	612 (87.4%)	573 (88.8%)
csr_bilateral_infiltrates
Checked	497 (71.0%)	484 (75.0%)	0.109
Unchecked	203 (29.0%)	161 (25.0%)
csr_pleural_effusion
Checked	34 (4.9%)	40 (6.2%)	0.337
Unchecked	666 (95.1%)	605 (93.8%)
duration
Mean (SD)	9.53 (5.67)	8.21 (4.43)	<0.001
Median [Min, Max]	9.00 [1.00, 35.0]	9.00 [1.00, 27.0]
ed_before_order
No	398 (56.9%)	405 (62.8%)	0.0307
Yes	302 (43.1%)	240 (37.2%)
blood_pressure_noninvasive_d
Mean (SD)	60.2 (0.977)	59.8 (0.985)	<0.001
Median [Min, Max]	60.1 [55.9, 63.0]	59.8 [55.7, 63.3]
blood_pressure_noninvasive_s
Mean (SD)	130 (0.960)	130 (0.997)	<0.001
Median [Min, Max]	130 [127, 133]	130 [126, 133]
heart_rate
Mean (SD)	75.3 (0.924)	74.6 (0.917)	<0.001
Median [Min, Max]	75.3 [72.4, 78.1]	74.6 [71.0, 77.3]
respiratory_rate
Mean (SD)	30.0 (1.01)	30.0 (0.986)	0.721
Median [Min, Max]	30.0 [26.9, 33.8]	29.9 [27.3, 33.3]
sp_o2
Mean (SD)	92.5 (0.982)	92.5 (0.981)	0.947
Median [Min, Max]	92.5 [89.2, 95.3]	92.5 [89.2, 95.2]

It appears at a 0.1 significance level, we only have 11 significant predictors: age, bmi, smoke_vape, dm_factor, cancer, symptoms_1_factor, duration_symptoms, ed_before_order_set, s_bp_noninvasive_d, vs_bp_noninvasive_s and vs_hr_hr.

Final_Tianran

Tianran Zhang

7/23/2020