Background: The data set is a simulated data set. It represents the data on COVID+ patients (i.e. patients diagnosed with COVID) who were hospitalized in the New York Presbyterian hospital (Weill Cornell and Lower Manhattan campus). Some COVID+ patients suffer from acute respiratory distress and have acute difficulties in breathing. Such critically ill patients then require artificial respiratory support through an invasive mechanical ventilator, a process known as intubation. Since there was a surge in COVID+ cases in New York City during the period of March, the hospital authorities were concerned that there might be a shortage in availability of mechanical ventilators. The authorities wanted to predict the need for intubation within 5 days of hospitalization for each patient in order to better manage their resources. The goal is to build a predictive model that will predict the outcome of intubation within 5 days of hospitalization.
Data: There are two data files “baseline.csv” and “labs and vitals.csv”.
baselines <- read.csv("baselines.csv")
baselines <- clean_names(baselines)
vitals <- read.csv("lab and vitals.csv")
The baseline.csv file contains patient level information (one row per patient) on baseline clinical measures that were determined at the time of admission to the hospital. The measures include demographic variables, relevant clinical history and certain diagnostic tests.
The labs and vitals file contain data on vitals signs of patients (e.g. blood pressure, heart rate, SpO2 or blood oxygen levels, etc). The vital signs are measured multiple times for each patient during the course of hospitalization. There number of times it is measured per individual also varies with each individual. The vitals data is in a long format i.e. there are multiple rows for a patient.
A variable dictionary is also available to explain the variables.
This report explores the two following two goals:
Feature engineering: Extract features from the longitudinal vital signs measure so that you have one value for patient.
Predictive modeling: Using all baseline variables and features extracted from labs and vitals to build a predictive model predicting the binary event of intubation within 5 days of hospitalization.
Check for missing values:
apply(vitals, 2, function(x) sum(is.na(x)))
## name subject value time_stamp
## 0 0 85815 0
apply(vitals, 2, function(x) paste0(round(sum(is.na(x))/nrow(vitals), 3)*100, "%"))
## name subject value time_stamp
## "0%" "0%" "12.1%" "0%"
It appears there are 85815 missing values in vitals data, which makes up of 12.1% of the observations.
Explore various aspects of the distribution of each vital sign (mean, standard deviation, max, min, median) to create featuers. Now we have multiple features per vital sign with one value per patient for each feature, and take a look at the head of it:
vitals %>%
group_by(subject, name) %>%
summarise(mean.val = mean(value, na.rm = T), sd.val = sd(value, na.rm = T),
max.val = max(value, na.rm = T), min.val = min(value, na.rm = T),
median.val = median(value, na.rm = T),
.groups = "drop") -> vitals_feature
head(vitals_feature)%>%
kable() %>%
kable_styling()
subject | name | mean.val | sd.val | max.val | min.val | median.val |
---|---|---|---|---|---|---|
655528 | s_bp_noninvasive (d) | 61.16246 | 1.629412 | 65.35887 | 55.86459 | 61.03558 |
655528 | vs_bp_noninvasive (s) | 129.90207 | 2.689424 | 134.88518 | 124.90907 | 129.49549 |
655528 | vs_hr_hr | 72.80166 | 2.611181 | 77.40824 | 66.68079 | 73.43356 |
655528 | xp_resp_rate_pt | 30.40044 | 1.675897 | 33.97693 | 27.39469 | 29.94419 |
655528 | xp_resp_spo2 | 91.44550 | 3.070444 | 99.26152 | 85.85129 | 90.91147 |
729545 | s_bp_noninvasive (d) | 58.77790 | 1.945223 | 62.79768 | 54.33671 | 58.92815 |
vital_summary <- spread(vitals_feature[, 1:3], name, mean.val)
Now let’s visulize the distribution of the mean and standard deviation values for each vital signs:
ggplot(data = vitals_feature, aes(x = name, y = mean.val)) +
geom_boxplot() +
coord_flip() +
ggtitle("Boxplot of the mean values for vital signs") -> p1
ggplot(data = vitals_feature, aes(x = name, y = sd.val)) +
geom_boxplot() +
coord_flip() +
ggtitle("Boxplot of the standard deviation values for vital signs ") ->p2
grid.arrange(p1, p2, ncol = 1)
par(mfrow = c(2, 3))
for (i in unique(vitals_feature$name)){
hist(vitals_feature$mean.val[vitals_feature$name == i],
main = paste0("Distribution of mean value for\n", i), xlab = i)
}
par(mfrow = c(2, 3))
for (i in unique(vitals_feature$name)){
hist(vitals_feature$sd.val[vitals_feature$name == i],
main = paste0("Distribution of standard deviation for\n", i), xlab = i)
}
It appears the distributions of mean values per patient for each feature are quite normal, while the distribution of standard deviation per patient for each feature are all a bit right skewed. It seems mean is a better measurement in presenting patients’ vital signs since it has a wide spread and normal distribution.
First, merge the baseline data and vital data summary into one dataframe by patient medical record number. After dropping all the missing values and merging the two datasets, the resulting dataset has 1345 observations with 30 features for each subject. The table 1 for each features is displayed in the appendix.
merge.data.frame(baselines, vital_summary, by.x = "mrn",
by.y = "subject")[, -1] -> dat
dat <- clean_names(dat)
# sum(is.na(dat))
dat[, c(2, 4:23, 25, 26)] <- lapply(dat[, c(2, 4:23, 25, 26)], as.factor)
# str(dat)
The goal is to predict the binary event of intubation within 5 days of hospitalization. I randomly divided data into 10 parts and calculated the 10-fold cv errors with the following methods: k-NN, LDA, QDA, GLM, Elastic Net Regression, Tree-based methods (including one single tree with prunning, random forest, bagged tree, gradient boosted tree and adaboost tree) and SVM (both linear SVM and non-linear SVM).
n <- nrow(dat)
p <- ncol(dat) - 1
folds <- 10
id <- rep(1:folds, each = n/folds)
id <- c(id, rep(folds, n - length(id)))
set.seed(135)
id <- sample(id)
np <- np.forward <- np.backward <- numeric(folds)
s12 <- s13 <-numeric(folds)
best.features <- list(folds)
for (i in 1:folds){
training <- dat[id != i, ]
# Exhaustive best subset selection
regfit.full <- regsubsets(event ~., training, nvmax = 31)
np[i] <- which.min(summary(regfit.full)$cp)
# Forward best subset selection
regfit.forward <- regsubsets(event ~., training, nvmax = 31,
method = "forward")
np.forward[i] <- which.min(summary(regfit.forward)$cp)
# Backward best subset selection
regfit.backward <- regsubsets(event ~., training, nvmax = 31,
method = "backward")
np.backward[i] <- which.min(summary(regfit.backward)$cp)
# best.features[i] <- paste0(names(coef(regfit.full, np[i])[-1]), collapse = ", ")
s12[i] <- sum(names(coef(regfit.full, np[i])[-1])!=
names(coef(regfit.forward, np[i])[-1]))
s13[i] <- sum(names(coef(regfit.full, np[i])[-1])!=
names(coef(regfit.backward, np[i])[-1]))
}
ans <- data.frame(np, np.forward, np.backward, s12+s13)
colnames(ans)[4] <- c("different selected features")
Bset subset selections were conducted within each validation fold, and Cp was used as a measurment of in-sample error. It appears the features selected under each training data are the same no matter which selection method (Exhaustive, forward or backward) was used:
ans %>%
kable() %>%
kable_styling()
np | np.forward | np.backward | different selected features |
---|---|---|---|
12 | 12 | 12 | 0 |
12 | 12 | 12 | 0 |
14 | 14 | 14 | 0 |
12 | 12 | 12 | 0 |
12 | 12 | 12 | 0 |
12 | 12 | 12 | 0 |
13 | 13 | 13 | 0 |
12 | 12 | 12 | 0 |
13 | 13 | 13 | 0 |
12 | 12 | 12 | 0 |
form <- list(folds)
form[1] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + first_cxr_results_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[4] <- form[6] <- form[10] <- form[1]
form[2] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + any_immunosuppression + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[5] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + first_cxr_results_3_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[8] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + symptoms_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[3] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + any_immunosuppression + symptoms_8_factor + first_cxr_results_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[7] <- "event ~ age + bmi + smoke_vape + dm_factor + cancer + any_immunosuppression + first_cxr_results_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
form[9] <- "event ~ age + bmi + hypoxia_ed_factor + smoke_vape + dm_factor + cancer + first_cxr_results_2_factor + duration_symptoms + ed_before_order_set + s_bp_noninvasive_d + vs_bp_noninvasive_s + vs_hr_hr + xp_resp_rate_pt"
##Create function for misclassification rate from KNN; use default k=1
KNN.error <- function(Train, Test, k=1, form1 = "") {
if (form1 != "") ex <- labels(terms(formula(form1))) else
ex <- names(Train[, which(names(Train) !="event")])
Train.KNN <- Train[, which(names(Train) %in% ex)]
Train.KNN[,] <- lapply(Train.KNN, function (x) {
if(is.factor(x)) as.integer(x) else (x)
})
Train.KNN[,] <- apply(Train.KNN, 2, scale)
Test.KNN <- Test[,which(names(Test) %in% ex)]
Test.KNN[,] <- lapply(Test.KNN, function (x) {
if(is.factor(x))as.integer(x) else (x)
})
Test.KNN[,] <- apply(Test.KNN, 2, scale)
Test.class <- Test$event
Train.class <- Train$event
pred.class=knn(Train.KNN, Test.KNN, Train.class, k=k)
mean(Test.class != pred.class)
}
10-fold cross validation was conducted for knn. Within each fold, two parallel k-NN models were conducted: one included all features, while the other only considerded features derived from best subset selection. K was selected from 1 to 15 via cv.
err.knn1 <- err.knn2 <- numeric(folds)
for (i in 1:10){
training <- dat[id != i, ]
test <- dat[id == i, ]
k <- which.min(sapply(1:15, KNN.error, Train = training, Test = test))
err.knn1[i] <- KNN.error(training, test, k)
k <- which.min(sapply(1:15, KNN.error, Train = training, Test = test, form[[i]]))
err.knn2[i] <- KNN.error(training, test, k, form[[i]])
}
err.knn <- data.frame(folds = 1:folds,
All = err.knn1, Best = err.knn2)
ggplot(err.knn, aes(x = folds)) +
geom_line(aes(y = All, col = "All features")) +
geom_line(aes(y = Best, col = "Best subset selected features"))+
labs(col = "Festures")+
ylab("Error rate")
apply(err.knn[, -1], 2, mean)
## All Best
## 0.3076935 0.2565231
It appears the k-NN model based on the best subset selection have a relatively lower error rate of 25.7% compared to the one with all features.
10-fold cross validation was conducted for GLM. Within each fold, glm was conducted with all 30 features and stepwise selection method was carried out to select features.
err.glm <- numeric(folds)
for (i in 1:10){
training <- dat[id != i, ]
test <- dat[id == i, ]
# GLM
fit.glm <- glm(event ~ ., data = training, family = "binomial")
fit.glm <- step(fit.glm, direction = "both", trace = 0)
pred.glm <- predict(fit.glm, newdata = test, type = "response")
pred.glm <- ifelse(pred.glm > .5, "Yes", "No")
err.glm[i] <- mean(pred.glm!=test$event)
}
mean(err.glm)
## [1] 0.1909857
The 10-fold cv risk for GLM is 19.1%.
10-fold cross validation was conducted for lda and qda. Within each fold, 2 parallel datasets were used for both lda and qda: one included all features while the other only considerded features derived from best subset selection.
err.lda1 <- err.qda1 <- err.lda2 <- err.qda2 <-numeric(folds)
for (i in 1:10){
training <- dat[id != i, ]
test <- dat[id == i, ]
# LDA
fit.lda <- lda(event ~ ., data = training)
pred.lda <- predict(fit.lda, newdata = test)$class
err.lda1[i] <- mean(pred.lda!=test$event)
# QDA
fit.qda <- qda(event ~ ., data = training)
pred.qda <- predict(fit.qda, newdata = test)$class
err.qda1[i] <- mean(pred.qda!=test$event)
# LDA
fit.lda <- lda(formula(form[[i]]), data = training)
pred.lda <- predict(fit.lda, newdata = test)$class
err.lda2[i] <- mean(pred.lda!=test$event)
# QDA
fit.qda <- qda(formula(form[[i]]), data = training)
pred.qda <- predict(fit.qda, newdata = test)$class
err.qda2[i] <- mean(pred.qda!=test$event)
}
err.lda.qda <- data.frame(folds = 1:folds,
lda.all = err.lda1, lda.best = err.lda2,
qda.all = err.qda1, qda.best = err.qda2)
#a <- cbind(err.lda.qda, err.lda2, err.qda2)
# apply(err.lda.qda[, -1], 2, mean)
ggplot(err.lda.qda, aes(x = folds)) +
geom_line(aes(y = lda.all, col = "LDA - All features")) +
geom_line(aes(y = lda.best, col = "LDA - Best subset selection")) +
ylab("Error rate") +
labs(col = "Festures")+
ggtitle("LDA") -> p1
ggplot(err.lda.qda, aes(x = folds)) +
geom_line(aes(y = qda.all, col = "QDA - All features")) +
geom_line(aes(y = qda.best, col = "QDA - Best subset selection")) +
ylab("Error rate") +
labs(col = "Festures")+
ggtitle("QDA") -> p2
grid.arrange(p1, p2, ncol = 1)
It appears the performances of the two LDA models with different datasets do not differ much. In contrast, QDA has a lower error rate when fitted with best subset selection. Now take a look at the 10-fold cv error rates for each of the four models:
apply(err.lda.qda[, -1], 2, mean)
## lda.all lda.best qda.all qda.best
## 0.1924246 0.1880006 0.2483947 0.2139858
It appears both lda and qda have lower cv error rates with the smaller dataset. LDA with Best Subset Selection has the smallest cv risk of 18.8%.
10-fold cross validation was conducted for Elastic Net regression. Within each fold, alpha was set as 0.5 to benefit from both lasso and ridge regressions. Lambda was selected from 10-fold cv.
err.net <- numeric(folds)
for(i in 1:folds){
training <- dat[id != i, ]
test <- dat[id == i, ]
x <- model.matrix(event~.,training)
y <- training$event
enet.cv <- cv.glmnet(x,y,family = "binomial",alpha=0.5,nfolds=10)
lambda.min<-enet.cv$lambda.min
fit.net <- glmnet(x,y,family="binomial",lambda = lambda.min,alpha=0.5)
#lm(event~.,cv_tst)
newX <- model.matrix(event~., test)
pred.net = predict(fit.net,newx=newX,s=lambda.min)
pred.net = ifelse(pred.net>0.5,"Yes","No")
err.net[i]<-mean(test$event != pred.net)
}
plot(enet.cv)
mean(err.net)
## [1] 0.2036454
Take a look at one deviance-log(lambda) plot from one CV folder, log(lambda) was chosed from a range of -8 to -1 and the corresponding deviance ranged from 0.8 to 1.4.
The cv risk for Elastiv Net regression is 20.3%.
10-fold cross validation was conducted for five tree based methods. Within each fold, one single tree, random forest, bagged tree, gradient boosted tree and adaboost tree were conducted with all 30 features.
One single Tree: Cross-validation was carried out to prune the parameters by choosing the number of splits.
Random Forest: I chose at most \(m = \sqrt{p}\) parameters could be used within each fold. 100 trees were constructed.
Bagged Trees: I chose at most \(m = p\) parameters could be used within each fold. 100 trees were constructed.
Gradient boosted model: I set the depth = 1 since it is the most common situation. 100 trees were constructed with a shrinkage parameter of 0.01.
Adaptive boosted model: I set the depth = 1 since it is the most common situation. 100 trees were constructed with a shrinkage parameter of 0.01.
err.prune.tree <- err.rf <- err.bagged <- err.gbm <- err.adaboost <- numeric(folds)
for (i in 1:10){
training <- dat[id != i, ]
test <- dat[id == i, ]
# Prune Tree
fit.tree <- tree(event ~ ., data = training)
tree.cv <- cv.tree(fit.tree, FUN = prune.tree, K = 10)
fit.tree <- prune.tree(fit.tree, best = tree.cv$size[which.min(tree.cv$dev)])
pred.tree <- predict(fit.tree, newdata = test, type = "class")
err.prune.tree[i] <- mean(pred.tree != test$event)
# Random Forest with m = sqrt(p)
fit.RandomForest <- randomForest(event ~ ., training,
mtry = sqrt(p), importance = T,
ntrees = 100)
pred.rf <- predict(fit.RandomForest, newdata = test, type = "class")
err.rf[i] <- mean(pred.rf != test$event)
# Bagged Tree with m = p
fit.bagged <- randomForest(event ~ ., training,
mtry = p, importance = T,
ntrees = 100)
pred.bagged <- predict(fit.bagged, newdata = test, type = "class")
err.bagged[i] <- mean(pred.bagged != test$event)
# gradient boosted model
fit.boosting <- gbm(ifelse(event == "Yes", 1, 0) ~ .,
data = training,
n.trees = 100,
distribution = "bernoulli",
interaction.depth = 1,
shrinkage = .01)
pred.boosting <- predict(fit.boosting, test, n.trees = 100,
type = "response")
pred.boosting <- ifelse(pred.boosting <.5, "No", "Yes")
err.gbm <- mean(pred.boosting != test$event)
# adaboost
fit.adaboost <- gbm(ifelse(event == "Yes", 1, 0) ~ .,
data = training,
n.trees = 100,
distribution = "adaboost",
interaction.depth = 1,
shrinkage = .01)
pred.adaboost <- predict(fit.adaboost, test, n.trees = 100,
type = "response")
pred.adaboost <- ifelse(pred.adaboost <.5, "No", "Yes")
err.adaboost[i] <- mean(pred.adaboost != test$event)
}
err.tree <- data.frame(err.prune.tree, err.rf, err.bagged, err.gbm, err.adaboost)
apply(err.tree, 2, mean)
## err.prune.tree err.rf err.bagged err.gbm err.adaboost
## 0.3010308 0.2013798 0.2223290 0.3165468 0.2906904
Compare the 10-fold cv risk for the 5 tree-based methods, it appears random forest with m = sqrt(p) and bagged tree with m = p have relatively lower risks compared to other methods (These two risks are both around 20%).
10-fold cross validation was conducted for two svm models. Within each fold, both linear svm and non-linear svm were conducted with all 30 features. Cross validations were carried out to select the suitable cost and gamma. The parameter cost was selected from a range of (.01, .1, 1, 2, 4, 8, 16, 32). The parameter gamma was selected from a range of (.001, .005, .01, .05, .1, 1).
err.linear.svm <- err.nonlinear.svm <- numeric(folds)
for (i in 1:10){
training <- dat[id != i, ]
test <- dat[id == i, ]
# Linear SVM
svm.tune.linear <- tune(svm, event ~ ., data = training, kernel = "linear",
ranges = list(cost = c(.01, .1, 1, 2, 4, 8, 16, 32),
gamma = c(0.1, 0.1, 1, 2, 3, 4)),
tunecontrol = tune.control(sampling = "fix"))
svm.tune.linear$best.parameters$cost
fit.svm.linear <- svm(event ~ ., data = training,
kernel = "linear", probability = T,
cost = svm.tune.linear$best.parameters$cost,
gamma = svm.tune.linear$best.parameters$gamma)
pred.linearsvm <- predict(fit.svm.linear, newdata = test, type = "class")
err.linear.svm[i] <- mean(pred.linearsvm != test$event)
# Non-linear SVM
svm.tune.nonlinear <- tune(svm, event~., data = training, kernel = "radial",
ranges = list(cost = c(.01, .1, 1, 2, 4, 8, 16, 32),
gamma = c(.001, .005, .01, .05, .1, 1)),
tunecontrol = tune.control(sampling = "fix"))
fit.svm.nonlinear <- svm(event ~ ., data = training, kernel = "radial",
cost = svm.tune.nonlinear$best.parameters$cost,
gamma = svm.tune.nonlinear$best.parameters$gamma)
pred.nonlinearsvm <- predict(fit.svm.nonlinear, newdata = test, type = "class")
err.nonlinear.svm[i] <- mean(pred.nonlinearsvm != test$event)
}
err.svm <- data.frame(err.linear.svm, err.nonlinear.svm)
apply(err.svm, 2, mean)
## err.linear.svm err.nonlinear.svm
## 0.1946902 0.1908783
It appeares linear svm and non-linear svm have similar performances with similar risks, both are around 20%.
# **Neural Network**
# err.nn <- numeric(folds)
# for (i in 1:folds){
# dat1 <- dat[, which(names(dat) !="event")]
# dat1[,] <- lapply(dat1, function (x) {
# if(is.factor(x)) as.integer(x) else (x)
# })
# dat1$event <- dat$event
#
# test <- dat1[id == i, ]
# training <- dat1[id != i, ]
#
# fit.nn <- neuralnet(event ~ ., data = training, hidden = 10,
# act.fct = "logistic", linear.output = F, threshold = .4)
# predict(fit.nn, newdata = test)
# pred.nn <- ifelse(predict(fit.nn, newdata = test)<.5, 0, 1)
# err.nn[i] <- mean(pred.nn != test$event)
# }
Compare the 10-fold cv risks for all above methods:
ans <- data.frame(fold = 1:10,
knn.all = err.knn1, knn.best = err.knn2,
glm = err.glm,
lda.all = err.lda1, lda.best = err.lda2,
qda.all = err.qda1, qda.best = err.qda2,
ElasticNet = err.net,
single.tree = err.prune.tree, random.forest = err.rf,
bagged.tree = err.bagged, gbm = err.gbm, adaboost = err.adaboost,
linear.svm = err.linear.svm, nonlinear.svm = err.nonlinear.svm)
apply(ans[, -1], 2, mean) %>%
as.data.frame() -> ans.tab
colnames(ans.tab) <- "cv.error"
ans.tab %>%
kable() %>%
kable_styling()
cv.error | |
---|---|
knn.all | 0.3076935 |
knn.best | 0.2565231 |
glm | 0.1909857 |
lda.all | 0.1924246 |
lda.best | 0.1880006 |
qda.all | 0.2483947 |
qda.best | 0.2139858 |
ElasticNet | 0.2036454 |
single.tree | 0.3010308 |
random.forest | 0.2013798 |
bagged.tree | 0.2223290 |
gbm | 0.3165468 |
adaboost | 0.2906904 |
linear.svm | 0.1946902 |
nonlinear.svm | 0.1908783 |
Among all these methods, only the following 5 models have 10-fold cv risks lower than 20%:
ans.tab %>%
filter(cv.error < .2) %>%
kable() %>%
kable_styling()
cv.error | |
---|---|
glm | 0.1909857 |
lda.all | 0.1924246 |
lda.best | 0.1880006 |
linear.svm | 0.1946902 |
nonlinear.svm | 0.1908783 |
It appears glm, lda and svm has the best performance compared to the others with 10-fold cv risks lower than 20%. It is worth noting that glm, lda and linear svm all provid linear (or almost linear) boundaries for classification. It seems linear classifier is the most suitable model for our data.
Thus I would propose Linear Discriminate Analysts to the authorities. Not only because it provides the smallest cross-validation error rate, but also since it has the best performance when the classification boundary is almost linear.
# Exhaustive best subset selection
regfit.full <- regsubsets(event ~., dat, nvmax = 31)
reg.summary <- summary(regfit.full)
np <- which.min(reg.summary$cp)
# Forward best subset selection
regfit.forward <- regsubsets(event ~., dat, nvmax = 31, method = "forward")
np.forward <- which.min(summary(regfit.forward)$cp)
# Exhaustive Best subset selection
regfit.backward <- regsubsets(event ~., dat, nvmax = 31, method = "backward")
np.backward <- which.min(summary(regfit.backward)$cp)
# paste0(names(coef(regfit.full, np)[-1]), collapse = ", ")
plot(1:30, reg.summary$cp, xlab = "number of predictors", ylab = "Cp",
type = "b", main = "Best subset selection")
points(np, reg.summary$cp[np], col = "red", cex = 2, pch = 20)
Best subset selection was also conducted on the whole dataset to choose the best number of predictors. Cp was used as a measurment of in-sample error. Exhaustive, backward, and forward selection methods all provided the same results: the best subset contains 12 predictors: age, bmi, smoke_vapeYes, dm_factorYes, cancerYes, first_cxr_results_2_factorUnchecked, duration_symptoms, ed_before_order_setYes, s_bp_noninvasive_d, vs_bp_noninvasive_s, vs_hr_hr, xp_resp_rate_pt. These features are important for prediction. To be specific, the patient level information includes age, bmi, ever somker/vaper, diabetes mellitus, any cancer,Initial Chest X-ray Findings (choice=Clear), duration of symptoms and admitted to ED before lab order set change, and the vital signs include blood pressure, heart rate, and respiratory rate are important for predictions.
form.final <- formula("event ~ age + bmi +smoke_vape +dm_factor +cancer +first_cxr_results_2_factor +duration_symptoms +ed_before_order_set +s_bp_noninvasive_d +vs_bp_noninvasive_s +vs_hr_hr +xp_resp_rate_pt")
fit.final <- lda(form.final, data = dat)
# ROC curve
rocplot <- function(truth, pred, ...){
predob = prediction(pred, truth)
perf = performance(predob, "tpr", "fpr")
plot(perf, colorize=F, ...)
title("ROC Plot")
area = auc(truth, pred)
area = format(round(area, 3), nsmall = 4)
text(x = .8, y = .1, labels = paste("AUC = ", area))
segments(x0 = 0, y0 = 0, x1 = 1, y1 = 1, col = "gray", lty = 2)
}
final.pred <- predict(fit.final, dat, type = "response")
rocplot(dat$event, final.pred$posterior[,2])
The ROC plot for final model is shown above with an auc = 0.901, which looks pretty good. The coefficients as well as the exponential terms of the coefficients for each features in the final model are presented bellow:
ans <- data.frame(coef(fit.final))
colnames(ans) <- "coefficient"
rownames(ans) <- c("Age", "BMI", "Smoke Vape(Yes)",
"Diabetes Mellitus(Yes)", "Any Cancer(Yes)",
"Initial Chest X-ray Findings (choice=Clear)(Unchecked)",
"Duration of Symptoms",
"Admitted to ED before lab order set change(Yes)",
"Blood Pressure Noninvasive d",
"Blood Pressure Noninvasive s",
"Heart Rate",
"Respiratory Rate")
ans$exp.coefficient <- exp(ans$coefficient)
ans %>%
kable() %>%
kable_styling()
coefficient | exp.coefficient | |
---|---|---|
Age | -0.0413109 | 0.9595308 |
BMI | -0.0884343 | 0.9153632 |
Smoke Vape(Yes) | 0.4422673 | 1.5562317 |
Diabetes Mellitus(Yes) | 0.2095303 | 1.2330987 |
Any Cancer(Yes) | -0.5455469 | 0.5795248 |
Initial Chest X-ray Findings (choice=Clear)(Unchecked) | -0.1609632 | 0.8513234 |
Duration of Symptoms | -0.0451409 | 0.9558628 |
Admitted to ED before lab order set change(Yes) | -0.2024069 | 0.8167625 |
Blood Pressure Noninvasive d | -0.4361270 | 0.6465356 |
Blood Pressure Noninvasive s | -0.3503263 | 0.7044582 |
Heart Rate | -0.7177642 | 0.4878417 |
Respiratory Rate | -0.0786959 | 0.9243210 |
The result shows that all other coefficients except for ever smoker/vaper and Diabetes Mellitus are negative. A negative coefficient (i.e., exp.coeffieicnt < 1) means that this factor is negatively associated with the need for intubation, while a positive coefficient (i.e., exp.coeffieicnt > 1) means this factor is positively associated with the need for intubation. To be specific:
For continuous predictors, take “Age” for example: the coefficient for age
is -0.041 and the exponential coefficient for age
is 0.96, which means the odds of getting intubation for a patient would decrease by a factor of 0.96 with one unit increase in age, controlling for other covariates.
For categorical predictors, take “Smoke Vape” for example: the coefficient for Smoke Vape
is 0.442 and its exponential is 1.56, which means the odds of getting intubation for a patient would increase by a factor of 1.56 if this patient had ever smoker/vaper than did not, controlling for other covariates.
Table 1
library(table1)
rndr <- function(x, name, ...){
if (length(x) == 0){
y <- dat1[[name]]
s <- rep("", length(render.default(x = y, name = name)))
if (is.numeric(y)){
p <- t.test(y ~ dat1$event_p)$p.value
} else{
p <- chisq.test(table(y, droplevels(dat1$event_p)))$p.value
}
s[2] <- sub("<", "<", format.pval(p, digits = 3, eps = 0.001))
s
} else{
render.default(x = x, name = name, ...)
}
}
rndr.strat <- function(label, n, ...){
ifelse(n == 0, label, render.strat.default(label, n, ...))
}
dat1 <- dat
colnames(dat1) <- c("age","sex","BMI","hypoxia_ed","smoke_vape",
"diabetes","hypertension",
"pulmonary_disease","CKD","ESRD","CAD","cancer",
"immunosuppression","Fever","Cough",
"diarrhea","nausea","myalgias","dyspnea",
"csr_clear","csr_unilateral infiltrate",
"csr_bilateral_infiltrates","csr_pleural_effusion","duration",
"Ed_before_order","event", "blood_pressure_noninvasive_d",
"blood pressure noninvasive s","heart rate",
"Respiratory Rate", "SpO2")
dat1 <- clean_names(dat1)
formu <- paste0("~ ", paste0(colnames(dat1)[-26], collapse = " + "), "| event_p")
dat1$event_p <- factor(dat$event, levels = c("No", "Yes", "p-val"), labels = c("No", "Yes", "p-val"))
table1(formula(formu), data = dat1, droplevels = F, render = rndr, render.strat = rndr.strat, overall = F)
No (N=700) |
Yes (N=645) |
p-val | |
---|---|---|---|
age | |||
Mean (SD) | 71.7 (15.8) | 58.6 (16.4) | <0.001 |
Median [Min, Max] | 73.5 [-2.08, 114] | 59.3 [5.41, 95.1] | |
sex | |||
Female | 296 (42.3%) | 261 (40.5%) | 0.534 |
Male | 404 (57.7%) | 384 (59.5%) | |
bmi | |||
Mean (SD) | 29.5 (6.95) | 26.0 (5.44) | <0.001 |
Median [Min, Max] | 28.5 [15.9, 58.9] | 26.0 [9.86, 50.8] | |
hypoxia_ed | |||
No | 319 (45.6%) | 295 (45.7%) | 0.995 |
Yes | 381 (54.4%) | 350 (54.3%) | |
smoke_vape | |||
No | 523 (74.7%) | 441 (68.4%) | 0.0118 |
Yes | 177 (25.3%) | 204 (31.6%) | |
diabetes | |||
No | 506 (72.3%) | 421 (65.3%) | 0.00657 |
Yes | 194 (27.7%) | 224 (34.7%) | |
hypertension | |||
No | 310 (44.3%) | 287 (44.5%) | 0.982 |
Yes | 390 (55.7%) | 358 (55.5%) | |
pulmonary_disease | |||
Checked | 33 (4.7%) | 34 (5.3%) | 0.731 |
Unchecked | 667 (95.3%) | 611 (94.7%) | |
ckd | |||
Checked | 33 (4.7%) | 24 (3.7%) | 0.443 |
Unchecked | 667 (95.3%) | 621 (96.3%) | |
esrd | |||
Checked | 36 (5.1%) | 44 (6.8%) | 0.236 |
Unchecked | 664 (94.9%) | 601 (93.2%) | |
cad | |||
No | 576 (82.3%) | 533 (82.6%) | 0.923 |
Yes | 124 (17.7%) | 112 (17.4%) | |
cancer | |||
No | 631 (90.1%) | 612 (94.9%) | 0.00148 |
Yes | 69 (9.9%) | 33 (5.1%) | |
immunosuppression | |||
unknown/No | 684 (97.7%) | 621 (96.3%) | 0.165 |
Yes | 16 (2.3%) | 24 (3.7%) | |
fever | |||
Checked | 433 (61.9%) | 431 (66.8%) | 0.0656 |
Unchecked | 267 (38.1%) | 214 (33.2%) | |
cough | |||
Checked | 483 (69.0%) | 449 (69.6%) | 0.854 |
Unchecked | 217 (31.0%) | 196 (30.4%) | |
diarrhea | |||
Checked | 168 (24.0%) | 173 (26.8%) | 0.26 |
Unchecked | 532 (76.0%) | 472 (73.2%) | |
nausea | |||
Checked | 144 (20.6%) | 126 (19.5%) | 0.685 |
Unchecked | 556 (79.4%) | 519 (80.5%) | |
myalgias | |||
Checked | 147 (21.0%) | 113 (17.5%) | 0.122 |
Unchecked | 553 (79.0%) | 532 (82.5%) | |
dyspnea | |||
Checked | 422 (60.3%) | 398 (61.7%) | 0.633 |
Unchecked | 278 (39.7%) | 247 (38.3%) | |
csr_clear | |||
Checked | 85 (12.1%) | 83 (12.9%) | 0.749 |
Unchecked | 615 (87.9%) | 562 (87.1%) | |
csr_unilateral_infiltrate | |||
Checked | 88 (12.6%) | 72 (11.2%) | 0.476 |
Unchecked | 612 (87.4%) | 573 (88.8%) | |
csr_bilateral_infiltrates | |||
Checked | 497 (71.0%) | 484 (75.0%) | 0.109 |
Unchecked | 203 (29.0%) | 161 (25.0%) | |
csr_pleural_effusion | |||
Checked | 34 (4.9%) | 40 (6.2%) | 0.337 |
Unchecked | 666 (95.1%) | 605 (93.8%) | |
duration | |||
Mean (SD) | 9.53 (5.67) | 8.21 (4.43) | <0.001 |
Median [Min, Max] | 9.00 [1.00, 35.0] | 9.00 [1.00, 27.0] | |
ed_before_order | |||
No | 398 (56.9%) | 405 (62.8%) | 0.0307 |
Yes | 302 (43.1%) | 240 (37.2%) | |
blood_pressure_noninvasive_d | |||
Mean (SD) | 60.2 (0.977) | 59.8 (0.985) | <0.001 |
Median [Min, Max] | 60.1 [55.9, 63.0] | 59.8 [55.7, 63.3] | |
blood_pressure_noninvasive_s | |||
Mean (SD) | 130 (0.960) | 130 (0.997) | <0.001 |
Median [Min, Max] | 130 [127, 133] | 130 [126, 133] | |
heart_rate | |||
Mean (SD) | 75.3 (0.924) | 74.6 (0.917) | <0.001 |
Median [Min, Max] | 75.3 [72.4, 78.1] | 74.6 [71.0, 77.3] | |
respiratory_rate | |||
Mean (SD) | 30.0 (1.01) | 30.0 (0.986) | 0.721 |
Median [Min, Max] | 30.0 [26.9, 33.8] | 29.9 [27.3, 33.3] | |
sp_o2 | |||
Mean (SD) | 92.5 (0.982) | 92.5 (0.981) | 0.947 |
Median [Min, Max] | 92.5 [89.2, 95.3] | 92.5 [89.2, 95.2] |
It appears at a 0.1 significance level, we only have 11 significant predictors: age, bmi, smoke_vape, dm_factor, cancer, symptoms_1_factor, duration_symptoms, ed_before_order_set, s_bp_noninvasive_d, vs_bp_noninvasive_s and vs_hr_hr.