Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In OSIC Pulmonary Fibrosis Progression competition on Kaggle, participants are tasked to determine the likelihood of recovery (prognosis) of several patients affected by a lung disease. For each patient, the maximum volume of air they can exhale after a maximum inhalation (FVC, Forced Vital Capacity) is measured over the weeks, for approximately 1-2 years of time.
In addition, we have the following information about these people:
- A chest computer scan obtained at time
Week=0
- Their age
- Their sex
- Their smoking status: currently smokes, ex-smoker, never smoked
The challenge is to assess the lung function’s health by forecasting the FVC (I’m not asking myself here, if it’s the good or bad way to do that). What I like about this competition, is that there are many ways to approach it. Here’s a non-exhaustive list:
1.One way could be to construct a Statistical/Machine Learning (ML) model on the whole dataset, and study the (conditional) distribution of the FVC, knowing the scan, age, sex, and smoking status. In this first approach we consider that disease evolution can be generalized among categories of patients sharing the same patterns. A Bayesian ML model could capture the uncertainty around predictions, or we could use a more or less sophisticated bootstrapping procedure for the same purpose. Or, even, consider that ML model residuals are irregularly spaced time series.
2.Another way, the quick and dirty one I’ll present here, considers each patient’s case individually. Age, sex, smoking status and the chest scan are not used, but the measurement week is. If we are only interested in forecasting the FVC, the approach will be fine. But if we want to understand how each one of the factors we previously described influence the FVC, either individually or in conjunction, then the first approach is better.
0 – Functions
These are the functions that I use in the analysis. The first one extracts a patient’s information from the whole database, based on his/her identifier. The second one fits a smoothing spline to a patient’s data, and forecasts his/her FVC.
get patient data
suppressPackageStartupMessages(library(dplyr)) # 0 - 1 get patient data ----- get_patient_data <- function(id, train) { df <- dplyr::select(dplyr::filter(train, Patient == id), c(Weeks, FVC)) df$log_Weeks <- log(13 + df$Weeks) # the relative timing of FVC measurements (varies widely) df$log_FVC <- log(df$FVC) # transformed response variable df$Patient <- id return(df) }
fit and forecast FVC
# 0 - 2 fit, predict and plot ----- fit_predict <- function(df, plot_=TRUE) { min_week <- 13 n <- nrow(df) test_seq_week <- seq(-12, 133) log_test_seq_week <- log(min_week + test_seq_week) # Fit a smoothing spline, using Leave-one-out cross-validation for regularization fit_obj <- stats::smooth.spline(x = df$log_Weeks, y = df$log_FVC, cv = TRUE) resids <- residuals(fit_obj) mean_resids <- mean(resids) conf <- max(exp(sd(resids)), 70) # https://www.kaggle.com/c/osic-pulmonary-fibrosis-progression/overview/evaluation preds <- predict(fit_obj, x=log_test_seq_week) res <- list(Weeks_pred = test_seq_week, FVC_pred = exp(preds$y)) conf_sqrt_n <- conf/sqrt(n) ubound <- res$FVC_pred + mean_resids + 1.96*conf_sqrt_n # strong hypothesis lbound <- res$FVC_pred + mean_resids - 1.96*conf_sqrt_n if (plot_) { leg.txt <- c("Measured FVC", "Interpolated/Extrapolated FVC", "95% Confidence interval bound") plot(df$Weeks, df$FVC, col="blue", type="l", lwd=3, xlim = c(-12, 133), ylim = c(min(min(lbound), min(df$FVC)), max(max(ubound), max(df$FVC)) ), xlab = "Week", ylab = "FVC", main = paste0("Patient: ", df$Patient[1])) lines(res$Weeks_pred, res$FVC_pred) lines(res$Weeks_pred, ubound, lty=2, col="red") lines(res$Weeks_pred, lbound, lty=2, col="red") abline(v = max(df$Weeks), lty=2) legend("bottomright", legend = leg.txt, lwd=c(3, 1, 1), lty=c(1, 1, 2), col = c("blue", "black", "red")) } return(invisible(list(res = res, conf = rep(conf, length(res$FVC_pred)), mean = res$FVC_pred, ubound = ubound, lbound = lbound, resids = resids))) }
1 – Import the whole dataset
# Training set data train <- read.csv("~/Documents/Kaggle/OSIC_August2020/train.csv") # Training set snippet print(head(train)) print(tail(train))
2 – Predict FVC for a few patients (4)
# Four patient ids are selected ids <- c("ID00421637202311550012437", "ID00422637202311677017371", "ID00426637202313170790466", "ID00248637202266698862378") #par(mfrow=c(2, 2)) for(i in 1:length(ids)) { # Extract patient's data based on his/her ID (df <- get_patient_data(id=ids[i], train)) # Obtain FVC forecasts, with 95% confidence interval # warnings when repeated measures in the same week suppressWarnings(fit_predict(df, plot_=TRUE)) }
For a quick and dirty baseline model, this one seems to produce quite coherent forecasts, which could be used for decision making. Of course, validation data (unseen by the model) could reveal a whole different truth.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.