Site icon R-bloggers

Boundary Conditions and Anatomy – Exploring Correlated Data Simulation in R

[This article was first published on [R]eliability, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • Today we have been assigned the task of identifying boundary conditions for a benchtop durability test of an implantable, artificial heart valve. In other words, we need to identify credible parameters for a physical test such that our test engineers can challenge the device under severe but realistic geometries and loads. To facilitate this task our clinical team has analyzed images and extracted pressure measurements and geometric features for a sample of n=300 patients. The rest of this post explores what we should do with these data.

    For the sake of simplicity, assume that the three parameters we care about are the ellipticity of the vessel cross section, curvature of the vessel in the vessel region of interest, and the blood pressure. Features such as these are important because they influence both the equilibrium geometry and the magnitude of forces acting on the implantable valve (in other words: the boundary conditions). The image below shows a schematic/example of ellipticity and vessel curvature in the LVOT and aortic valve annulus as observed in CT imaging.1

    There are two main challenges when working with these data:

    • How do we use our sample to simulate the full population?
    • How do we use the simulated, full population to identify groups of interest and recommend boundary conditions for the test

    Here is our dataset.2 It turns out that each of these features can be well described by a lognormal distribution and we will assume that this is confirmed via prior domain knowledge. Large pressures, large ellipticities, and small radius of curvature are all bad because they would represent more extreme geometries and/or loads for the implant to resist without migrating or fracturing.

    As normal, we’re working in R and will lean on the tidyverse packages to accelerate things. I wasn’t intending for this post to go as long as it did so I’m also offering this table of contents to show I bear no malice.

    library(readxl)
    library(knitr)
    library(DiagrammeR)
    library(fitdistrplus)
    library(MASS)
    library(ggrepel)
    library(readxl)
    library(ks)
    library(broom)
    library(ggExtra)
    library(GGally)
    library(car)
    library(rgl)
    library(anySim)
    library(tidyverse)

    Start by reading in the data and taking a look at the format.

    sample_data <- readRDS(file = "sim_anatomy_data.rds")
    sample_data
    ## # A tibble: 300 x 3
    ##    ellip  curv pressure
    ##    <dbl> <dbl>    <dbl>
    ##  1  1.26  4.51     92.7
    ##  2  1.28  5.02    183. 
    ##  3  1.29  4.03    154. 
    ##  4  1.23  2.14    109. 
    ##  5  1.13  3.67    124. 
    ##  6  1.22  2.37    114. 
    ##  7  1.10  3.06    113. 
    ##  8  1.04  2.31    105. 
    ##  9  1.11  5.31    115. 
    ## 10  1.09  2.04    109. 
    ## # ... with 290 more rows

    As expected, 300 rows with our 3 features of interest.

    Key Point:

    Since we are asked to find a worst-case set of values, it might seem reasonable or tempting to extract the maximum value from each group (or maybe something like the 95th percentile) and report those values together as a conservative worst-case for ellipticity, curvature, and pressure. Our test engineers would then set up a benchtop test to challenge our prototype devices in those same conditions to see if they survive. The problem with this approach is that each row of data is from a specific patient, so the variables may be correlated. It could be that those severe, 95th percentile values for each variable never occur together in the same patient. If we choose them together, we would over-test the device and over-design the device, potentially setting the program way behind. We must instead look at the data as a joint distribution and investigate correlations and covariance among the variables.

    Let’s verify the shape of the distributions for each feature and see if there is any correlation between the variables:

    ellip_curv_plt <- sample_data %>%
      ggplot(aes(x = ellip, y = curv)) +
      geom_point(alpha = .5) +
      labs(
        title = "Patient Data From n=300 Scans",
        subtitle = "Vessel Ellipticity and Vessel Curvature Joint Distribution",
        x = "Ellipticity",
        y = "Curvature (mm)"
      )
    
    ellip_pressure_plt <- sample_data %>%
      ggplot(aes(x = ellip, y = pressure)) +
      geom_point(alpha = .5, color = "firebrick") +
      labs(
        title = "Patient Data From n=300 Scans",
        subtitle = "Vessel Ellipticity and Blood Pressure Joint Distribution",
        x = "Ellipticity",
        y = "Pressure (mm Hg)"
      )
    
    curv_pressure_plt <- sample_data %>%
      ggplot(aes(x = curv, y = pressure)) +
      geom_point(alpha = .5, color = "limegreen") +
      labs(
        title = "Patient Data From n=300 Scans",
        subtitle = "Vessel Curvature and Blood Pressure Joint Distribution",
        x = "Curvature (mm)",
        y = "Pressure (mm Hg"
      )
    
    ellip_curv_mplt <- ggExtra::ggMarginal(ellip_curv_plt, type = "density", fill = "#2c3e50", alpha = .5)
    ellip_pressure_mplt <- ggExtra::ggMarginal(ellip_pressure_plt, type = "density", fill = "firebrick", alpha = .5)
    curv_pressure_mplt <- ggExtra::ggMarginal(curv_pressure_plt, type = "density", fill = "limegreen", alpha = .5)

    Correlations in the Original Dataset

    ggcorr() from the GGally package is very convenient for visualizing correlations.

    sample_data %>% ggcorr(
      high = "#20a486ff",
      low = "#fde725ff",
      label = TRUE,
      hjust = .75,
      size = 3,
      label_size = 3,
      label_round = 3,
      nbreaks = 3
    ) +
      labs(
        title = "Correlation Matrix - n=300 Patient Set",
        subtitle = "Pearson Method Using Pairwise Observations"
      )

    We see that there are some positive correlations in this dataset.

    To build out the sample into a simulated population we will fit a MLE estimate and use the model to push out a lot of predictions. If each variable was independent the job would be easy – just execute a few rlnorm()s and bind them together. The job is more challenging when the variables are correlated because they must be simulated all at once. I will show 2 approaches in the sections below. Note that the 2nd approach is more efficient but it helped me to walk through the first one to understand the workflow. If you are impatient I would skip to the section on approach 2.

    Approach 1 – Manually Transform Everything to Normal

    For cases where each variable is normal or can be easily transformed there is a straightforward and relatively simple workflow to generate simulated joint distribution using the mvrnorm() function from the MASS package:

    In our case, we can easily convert our data from lognormal to normal and will then be able to use the mvrnorm() function to draw from a multivariate normal distribution and then undo the transformation later to recover a simulated population with desired correlations (as shown above). From there we can identify patients of interest, whether they be extreme challenging cases or a central, common group.

    Per the workflow above, start by fitting the native data to lognormal distributions using fitdist() and extract the parameters. Storing all the parameters as objects is a bit tedious and I only do it here so we can make a nice summary table of everything at the end.

    Step 1 – Fit Distributions to Each Variable

    ellip_fit <- fitdist(sample_data$ellip, "lnorm")
    curv_fit <- fitdist(sample_data$curv, "lnorm")
    pressure_fit <- fitdist(sample_data$pressure, "lnorm")
    
    # store lognormal parameters of original data
    ellip_meanlog <- ellip_fit$estimate[["meanlog"]]
    ellip_sdlog <- ellip_fit$estimate[["sdlog"]]
    curv_meanlog <- curv_fit$estimate[["meanlog"]]
    curv_sdlog <- curv_fit$estimate[["sdlog"]]
    pressure_meanlog <- pressure_fit$estimate[["meanlog"]]
    pressure_sdlog <- pressure_fit$estimate[["sdlog"]]
    
    # store correlations in original data
    cor_ec <- cor(x = sample_data$ellip, y = sample_data$curv)
    cor_ep <- cor(x = sample_data$ellip, y = sample_data$pressure)
    cor_cp <- cor(x = sample_data$curv, y = sample_data$pressure)
    
    # store covariances in original data
    cov_ellip_curv <- cov(x = sample_data$ellip, y = sample_data$curv)
    cov_ellip_ellip <- cov(x = sample_data$ellip, y = sample_data$ellip)
    cov_curv_curv <- cov(x = sample_data$curv, y = sample_data$curv)
    cov_ellip_pressure <- cov(x = sample_data$ellip, y = sample_data$pressure)
    cov_pressure_pressure <- cov(x = sample_data$pressure, y = sample_data$pressure)
    cov_curv_pressure <- cov(x = sample_data$curv, y = sample_data$pressure)
    
    # summarize the parameters and reshape a bit
    original_data_param_tbl <- tibble(
      ellip_meanlog = ellip_meanlog,
      ellip_sdlog = ellip_sdlog,
      curv_meanlog = curv_meanlog,
      curv_sdlog = curv_sdlog,
      pressure_meanlog = pressure_meanlog,
      pressure_sdlog = pressure_sdlog,
      ellip_curv_correlation = cor_ec,
      ellip_pressure_correlation = cor_ep,
      curv_pressure_correlation = cor_cp,
      ellip_ellip_covariance = cov_ellip_ellip,
      ellip_curv_covariance = cov_ellip_curv,
      curv_curv_covariance = cov_curv_curv,
      ellip_pressure_covariance = cov_ellip_pressure,
      pressure_pressure_covariance = cov_pressure_pressure,
      curv_pressure_covariance = cov_curv_pressure
    ) %>%
      pivot_longer(cols = everything(), names_to = "feature", values_to = "value") %>%
      mutate(dataset = "original_data") %>%
      mutate_if(is.character, as_factor)
    
    # View summary table of original data
    original_data_param_tbl %>%
      kable(align = "c", digits = 3)
    feature value dataset
    ellip_meanlog 0.193 original_data
    ellip_sdlog 0.064 original_data
    curv_meanlog 1.158 original_data
    curv_sdlog 0.309 original_data
    pressure_meanlog 4.783 original_data
    pressure_sdlog 0.191 original_data
    ellip_curv_correlation 0.268 original_data
    ellip_pressure_correlation 0.369 original_data
    curv_pressure_correlation 0.213 original_data
    ellip_ellip_covariance 0.006 original_data
    ellip_curv_covariance 0.022 original_data
    curv_curv_covariance 1.157 original_data
    ellip_pressure_covariance 0.659 original_data
    pressure_pressure_covariance 530.683 original_data
    curv_pressure_covariance 5.285 original_data

    Step 2 – Transform all variables to normal

    A simple log operation brings the lognormal variable to normal.

    # transform original, lognormal data to normal
    normal_sample_data <- sample_data %>%
      mutate(
        n_ellip = log(ellip),
        n_curv = log(curv),
        n_pressure = log(pressure)
      )
    
    normal_sample_data %>%
      head() %>%
      kable(align = "c", digits = 3)
    ellip curv pressure n_ellip n_curv n_pressure
    1.255 4.506 92.739 0.228 1.505 4.530
    1.285 5.019 182.970 0.251 1.613 5.209
    1.289 4.027 153.858 0.254 1.393 5.036
    1.234 2.139 108.669 0.210 0.760 4.688
    1.133 3.673 123.633 0.125 1.301 4.817
    1.219 2.373 113.944 0.198 0.864 4.736

    Step 3 – Fit normal distributions to each transformed variable

    We don’t actually have to formally fit normal distributions since it is convenient to obtain the mean and standard deviation at any time using the mean() or sd() functions. But we will extract and store correlations and covariances for the simulation to come.

    # get correlations of transformed, normal data
    ncor_ec <- cor(
      x = normal_sample_data$n_ellip,
      normal_sample_data$n_curv
    )
    ncor_ep <- cor(
      x = normal_sample_data$n_ellip,
      normal_sample_data$n_pressure
    )
    ncor_cp <- cor(
      x = normal_sample_data$n_curv,
      normal_sample_data$n_pressure
    )
    
    # get covariance of transformed, normal data
    n_cov_ellip_curv <- cov(
      x = normal_sample_data$n_ellip,
      y = normal_sample_data$n_curv
    )
    n_cov_ellip_ellip <- cov(
      x = normal_sample_data$n_ellip,
      y = normal_sample_data$n_ellip
    )
    n_cov_curv_curv <- cov(
      x = normal_sample_data$n_curv,
      y = normal_sample_data$n_curv
    )
    
    n_cov_ellip_pressure <- cov(
      x = normal_sample_data$n_ellip,
      y = normal_sample_data$n_pressure
    )
    n_cov_pressure_pressure <- cov(
      x = normal_sample_data$n_pressure,
      y = normal_sample_data$n_pressure
    )
    n_cov_curv_pressure <- cov(
      x = normal_sample_data$n_curv,
      y = normal_sample_data$n_pressure
    )

    Step 4 – Draw joint distribution using mvrnorm() or equivalent function

    Time to actually draw the correlated values. I store them here in an object called mult_norm.

    # draw from multivariate normal with parameters from transformed normal distributions and correlation
    set.seed(0118)
    
    mult_norm <- as_tibble(MASS::mvrnorm(
      10000, c(
        mean(normal_sample_data$n_ellip),
        mean(normal_sample_data$n_curv),
        mean(normal_sample_data$n_pressure)
      ),
      matrix(c(
        n_cov_ellip_ellip,
        n_cov_ellip_curv,
        n_cov_ellip_pressure,
        n_cov_ellip_curv,
        n_cov_curv_curv,
        n_cov_curv_pressure,
        n_cov_ellip_pressure,
        n_cov_curv_pressure,
        n_cov_pressure_pressure
      ), 3, 3)
    )) %>%
      rename(
        n_ellip_sim = V1,
        n_curv_sim = V2,
        n_pressure_sim = V3
      )

    Step 5 – Back-transform simulated data to original distribution

    Exponentiating the data brings it back to lognormal.

    # convert back to lognormal
    log_norm <- mult_norm %>%
      mutate(
        ellip_sim = exp(n_ellip_sim),
        curv_sim = exp(n_curv_sim),
        pressure_sim = exp(n_pressure_sim)
      )
    
    log_norm %>%
      head() %>%
      kable(align = "c", digits = 3)
    n_ellip_sim n_curv_sim n_pressure_sim ellip_sim curv_sim pressure_sim
    0.254 1.600 5.248 1.290 4.952 190.266
    0.233 1.038 5.107 1.262 2.823 165.178
    0.236 1.152 4.812 1.266 3.165 123.018
    0.313 1.003 5.048 1.368 2.727 155.636
    0.224 1.622 5.192 1.251 5.066 179.912
    0.197 1.486 4.822 1.218 4.422 124.185

    Step 6 – Evaluate parameters and marginal distributions of the back-transfomed data

    # evaluate the marginal distributions of the simulated data
    ellip_sim_fit <- fitdistrplus::fitdist(log_norm$ellip_sim, "lnorm")
    curv_sim_fit <- fitdistrplus::fitdist(log_norm$curv_sim, "lnorm")
    pressure_sim_fit <- fitdistrplus::fitdist(log_norm$pressure_sim, "lnorm")

    Obtain and store the correlation, covariances, and parameters of simulated set:

    # get correlation and covariances of simulated data
    sim_cor_ec <- cor(x = log_norm$ellip_sim, log_norm$curv_sim)
    sim_cor_ep <- cor(x = log_norm$ellip_sim, log_norm$pressure_sim)
    sim_cor_cp <- cor(x = log_norm$curv_sim, log_norm$pressure_sim)
    
    sim_cov_ellip_curv <- cov(x = log_norm$ellip_sim, y = log_norm$curv_sim)
    sim_cov_ellip_ellip <- cov(x = log_norm$ellip_sim, y = log_norm$ellip_sim)
    sim_cov_curv_curv <- cov(x = log_norm$curv_sim, y = log_norm$curv_sim)
    
    sim_cov_ellip_pressure <- cov(x = log_norm$ellip_sim, y = log_norm$pressure_sim)
    sim_cov_pressure_pressure <- cov(x = log_norm$pressure_sim, y = log_norm$pressure_sim)
    sim_cov_curv_pressure <- cov(x = log_norm$curv_sim, y = log_norm$pressure_sim)
    
    # store parameters of simulated data
    ellip_sim_meanlog <- ellip_sim_fit$estimate[["meanlog"]]
    ellip_sim_sdlog <- ellip_sim_fit$estimate[["sdlog"]]
    curv_sim_meanlog <- curv_sim_fit$estimate[["meanlog"]]
    curv_sim_sdlog <- curv_sim_fit$estimate[["sdlog"]]
    pressure_sim_meanlog <- pressure_sim_fit$estimate[["meanlog"]]
    pressure_sim_sdlog <- pressure_sim_fit$estimate[["sdlog"]]
    
    # collect parameters from simulated data
    sim_data_param_tbl <- tibble(
      ellip_meanlog = ellip_sim_meanlog,
      ellip_sdlog = ellip_sim_sdlog,
      curv_meanlog = curv_sim_meanlog,
      curv_sdlog = curv_sim_sdlog,
      pressure_meanlog = pressure_sim_meanlog,
      pressure_sdlog = pressure_sim_sdlog,
    
      ellip_curv_correlation = sim_cor_ec,
      ellip_pressure_correlation = sim_cor_ep,
      curv_pressure_correlation = sim_cor_cp,
    
      ellip_curv_covariance = sim_cov_ellip_curv,
      ellip_ellip_covariance = sim_cov_ellip_ellip,
      curv_curv_covariance = sim_cov_curv_curv,
    
      ellip_pressure_covariance = sim_cov_ellip_pressure,
      pressure_pressure_covariance = sim_cov_pressure_pressure,
      curv_pressure_covariance = sim_cov_curv_pressure
    ) %>%
      pivot_longer(cols = everything(), names_to = "feature", values_to = "value") %>%
      mutate(dataset = "simulated_data") %>%
      mutate_if(is.character, as_factor)
    
    sim_data_param_tbl %>%
      kable(align = "c")
    feature value dataset
    ellip_meanlog 0.1932042 simulated_data
    ellip_sdlog 0.0630117 simulated_data
    curv_meanlog 1.1626798 simulated_data
    curv_sdlog 0.3092643 simulated_data
    pressure_meanlog 4.7878497 simulated_data
    pressure_sdlog 0.1900026 simulated_data
    ellip_curv_correlation 0.2505145 simulated_data
    ellip_pressure_correlation 0.3644292 simulated_data
    curv_pressure_correlation 0.1956149 simulated_data
    ellip_curv_covariance 0.0203344 simulated_data
    ellip_ellip_covariance 0.0058779 simulated_data
    curv_curv_covariance 1.1209300 simulated_data
    ellip_pressure_covariance 0.6534943 simulated_data
    pressure_pressure_covariance 547.0647415 simulated_data
    curv_pressure_covariance 4.8440727 simulated_data

    Compare Original Data to Simulated Data

    A bit more wrangling let’s us compare the feature of the original dataset to the new, simulated population to see if they agree.

    compare_tbl <- bind_rows(original_data_param_tbl, sim_data_param_tbl) %>%
      pivot_wider(id_cols = everything(), names_from = dataset)
    
    compare_tbl %>%
      kable(align = "c", digits = 3)
    feature original_data simulated_data
    ellip_meanlog 0.193 0.193
    ellip_sdlog 0.064 0.063
    curv_meanlog 1.158 1.163
    curv_sdlog 0.309 0.309
    pressure_meanlog 4.783 4.788
    pressure_sdlog 0.191 0.190
    ellip_curv_correlation 0.268 0.251
    ellip_pressure_correlation 0.369 0.364
    curv_pressure_correlation 0.213 0.196
    ellip_ellip_covariance 0.006 0.006
    ellip_curv_covariance 0.022 0.020
    curv_curv_covariance 1.157 1.121
    ellip_pressure_covariance 0.659 0.653
    pressure_pressure_covariance 530.683 547.065
    curv_pressure_covariance 5.285 4.844

    Everything appears to align well, but this sure took a while. Wouldn’t it be nice if there was a faster way than to manually fit and extract values from mvrnorm() ? Fortunately, we’re in the R ecosystem, where somebody smart has usually tackled the problem and provided the tools to the community. The tool that I had success with is the AnySim package.3

    Here’s how to perform the simulation in a much more efficient way. Note: this is how I created the original dataset of 300 that we’ve been working with.

    Approach 2 – AnySim

    The AnySim workflow:

    It may look similar in the flowchart but in practice its way easier than Method 1. Here’s how it works in only a few lines of code:

    set.seed(13)
    # Define the target distribution functions (ICDFs) of each random variable.
    
    ellip_dist <- "qlnorm"
    curv_dist <- "qlnorm"
    pressure_dist <- "qlnorm"
    
    # store the 3 ICDFs in a vector
    dist_vec <- c(ellip_dist, curv_dist, pressure_dist)
    
    # Define the parameters of the target distribution functions - store them in a list
    ellip_params <- list(meanlog = 0.20, sdlog = .067)
    curv_params <- list(meanlog = 1.15, sdlog = 0.3)
    pressure_params <- list(meanlog = 4.80, sdlog = 0.2)
    
    # this is a weird way to do it but I'm following along with an example from AnySim vignette :)
    params_list <- list(NULL)
    params_list[[1]] <- ellip_params
    params_list[[2]] <- curv_params
    params_list[[3]] <- pressure_params
    
    # Define the target correlation matrix.
    corr_matrix <- matrix(c(
      1, 0.21, 0.4,
      0.21, 1, .21,
      0.4, 0.21, 1
    ),
    ncol = 3,
    nrow = 3,
    byrow = T
    )
    # Estimate the parameters of the auxiliary Gaussian model.
    aux_gaussion_param_tbl <- EstCorrRVs(
      R = corr_matrix, dist = dist_vec, params = params_list,
      NatafIntMethod = "GH", NoEval = 9, polydeg = 8
    )
    
    
    # Generate 10000 synthetic realizations of the 3 correlated RVs.
    correlated_ln_draws_tbl <- as_tibble(SimCorrRVs(n = 10000, paramsRVs = aux_gaussion_param_tbl)) %>%
      rename(
        ellip = V1,
        curv = V2,
        pressure = V3
      )
    
    correlated_ln_draws_tbl %>%
      head(10) %>%
      kable(align = "c")
    ellip curv pressure
    1.267618 2.158739 133.3040
    1.198681 4.437176 127.1982
    1.375663 2.649938 160.2052
    1.236829 3.621202 141.5427
    1.318572 2.541340 129.3427
    1.255885 3.236722 146.9361
    1.326279 3.387357 167.8579
    1.240926 4.254830 145.7207
    1.191865 1.853877 87.7240
    1.315273 4.144667 90.5485

    Let’s see if we were able to produce the desired relationships between the variables:

    # Fit synthetic data
    ellip_ln_fit <- tidy(fitdistr(correlated_ln_draws_tbl$ellip, "log-normal")) %>% mutate(
      var = "ellipticity",
      dataset = "sim_draws"
    )
    curv_ln_fit <- tidy(fitdistr(correlated_ln_draws_tbl$curv, "log-normal")) %>% mutate(
      var = "curvature",
      dataset = "sim_draws"
    )
    pressure_ln_fit <- tidy(fitdistr(correlated_ln_draws_tbl$pressure, "log-normal")) %>% mutate(
      var = "pressure",
      dataset = "sim_draws"
    )
    recovered_params_tbl <- bind_rows(ellip_ln_fit, curv_ln_fit, pressure_ln_fit)
    
    # Show target values
    target_tbl <- tibble(
      term = rep(c("meanlog", "sdlog"), 3),
      estimate = c(.20, .067, 1.15, .3, 4.80, .20),
      var = c("ellipticity", "ellipticity", "curvature", "curvature", "pressure", "pressure"),
      dataset = "target values"
    )
    
    # Compare simulated values to target values
    bind_rows(recovered_params_tbl, target_tbl) %>%
      select(-std.error) %>%
      pivot_wider(id_cols = everything(), names_from = dataset, values_from = estimate) %>%
      mutate_if(is.numeric, round, 2) %>%
      kable(align = rep("c"))
    term var sim_draws target values
    meanlog ellipticity 0.20 0.20
    sdlog ellipticity 0.07 0.07
    meanlog curvature 1.15 1.15
    sdlog curvature 0.30 0.30
    meanlog pressure 4.80 4.80
    sdlog pressure 0.20 0.20

    Excellent agreement. Now check to see if the correlations were preserved:

    # Check correlations
    correlated_ln_draws_tbl %>% ggcorr(
      high = "#20a486ff",
      low = "#fde725ff",
      label = TRUE,
      hjust = .75,
      size = 3,
      label_size = 3,
      label_round = 3,
      nbreaks = 3
    ) +
      labs(
        title = "Correlation Matrix",
        subtitle = "Pearson Method Using Pairwise Observations"
      )

    Again, perfect agreement to 2 decimal places. Thank you AnySim!

    Now that I’ve shown 2 ways to preserve the correlation structure in a simulation, we can return to the original question: what are the worst-case (or most common) values of this (simulated) population?4

    Using the Simulated Population to Define Desired Test Conditions and Groups of Interest

    The simulated population is useful because its density properties provide a means to determine how extreme any values of interest are. The trick is that we have to define the boundary of interest based on the question we are trying to answer. Two common questions we have pertain to finding the most common set of patients, or the most extreme patients relative to some point or region of interest. I discuss both below.

    Identify a percentage of worst-case patients relative to some value of interest

    Consider a 2d test case where we are interested in the joint distribution of 2 variables: curvature and pressure. These would be appropriate for something like a migration test, where the pressure wants to pull the implant out of place a curved configuration is worse than straight (like a pipe elbow).

    If we know that the worst-case conditions that are physiologically relevant occur when the curvature is 1 mm (radius-of-curvature) and the pressure is 300 mm Hg (extremely high). How could we identify the worst-case 5% of patients using our simulated population data from above? In this case we don’t need an algorithm, just a bit of geometry.

    We can leverage the standard geometric formula for distance between two points:

    \[ {\displaystyle d(p,q)={\sqrt {(p_{1}-q_{1})^{2}+(p_{2}-q_{2})^{2}}}} \]

    Here’s how to calculate this number for each value in the joint distribution for curvature and pressure:

    # specify theoretical worst-case point in space
    theoretical_worst_curv <- 1
    theoretical_worst_pressure <- 300
    
    # calculated each point's distance from the theoretical worst-case
    d_data <- log_norm %>%
      rowwise() %>%
      mutate(d = ((theoretical_worst_curv - curv_sim)^2 + (theoretical_worst_pressure - pressure_sim)^2)^.5) %>%
      arrange(desc(d)) %>%
      ungroup()
    
    # make ecdf to map points to percentiles of empirical distribution
    d_ECDF_fcn <- ecdf(d_data$d)
    
    # map the ecdf over all the calculated distances to convert them to percentiles
    d_pct_data <- d_data %>%
      mutate(d_pct = map_dbl(d, d_ECDF_fcn)) %>%
      mutate(in_out = case_when(
        d_pct <= .05 ~ "5% Worst-Case Points",
        TRUE ~ "95% Less Severe Population"
      )) %>%
      mutate(in_out = as_factor(in_out))
    
    wc_tbl <- tibble(x = 1, y = 300)

    Now visualize.

    # visualize
    d_plt <- d_pct_data %>%
      ggplot(aes(x = curv_sim, y = pressure_sim, color = in_out)) +
      geom_point(alpha = .5, size = 1) +
      geom_point(aes(x = 1, y = 300), color = "firebrick") +
      geom_label_repel(
        data = wc_tbl, aes(x, y),
        label = "Worst-Case Combination of \n vessel curvature and pressure",
        fill = "firebrick",
        color = "white",
        segment.color = "black",
        segment.size = 1,
        #                   min.segment.length = unit(1, "lines"),
        nudge_y = 50,
        nudge_x = 2
      ) +
      labs(
        title = "Joint Distribution of Curvature and Pressure",
        subtitle = "5% of points nearest to worst-case are identified",
        x = "Radius of Curvature (mm)",
        y = "Blood Presure (mm Hg)"
      ) +
      theme_classic() +
      ylim(c(0, 300)) +
      theme(
        legend.position = "bottom",
        legend.title = element_blank()
      ) +
      scale_color_viridis_d(option = "D", end = .7)
    
    d_plt

    It’s a bit of an illusion since the axes aren’t scaled 1:1. Here’s a look at a scaled version showing the teal points closest to the point of interest.

    Neat! We just split the data into the most severe 5% and least severe 95% based on a known point we want to avoid. You could imagine a similar calculation if we had a line that defined a worst-case boundary like we use for the failure threshold in a Goodman Diagram.5

    In other cases we just want to know the most common values (values in the region of highest probability) based on the multi-dimensional density calculation. In other words, we want to identify contours based on how close the points are to each other, not some arbitrary point in space or line. A kernel density estimate is going to be our tool of choice in all but the most basic cases. The KDE is non-parametric and can accommodate very complex joint distributions and density shapes.6

    Typically, the default contours on a density plot show an output called “level” but they can be converted to denote upper percentages of highest density regions. This is the method I show below.7

    Identify probability contours using ks package and kernel density estimation

    This first chunk converts the data and generated the KDE. The bandwidth parameters controls the “smoothness” or granularity of the estimate and can be hard to specify in multiple dimensions. Hscv() provides a method of determining a reasonable bandwidth through cross-validation; see documentation in footnotes for more information if interested.

    # convert simulated data tibble to matrix
    d3m <- correlated_ln_draws_tbl %>%
      as.matrix()
    
    # cross-validated bandwidth for kd (takes a while to calculate)
    # hscv1 <- Hscv(correlated_ln_draws_tbl)
    # hscv1 %>% write_rds(here::here("hscv1.rds"))
    
    hscv1 <- read_rds(here::here("hscv1.rds"))
    
    # generate kernel density estimate from simulated population
    kd_d3m <- ks::kde(d3m, H = hscv1, compute.cont = TRUE)

    Density percentiles from kde estimate

    The KDE estimate provides density percentiles that can be used to generate the contours the define the density regions in multiple dimensions.

    # see the kde's calculated density thresholds for specified proportions
    cont_vals_tbl <- tidy(kd_d3m$cont) %>%
      mutate(n_row = row_number()) %>%
      mutate(probs = 100 - n_row) %>%
      select(probs, x)
    
    reference_grid_probs_tbl <- cont_vals_tbl %>%
      rename(estimate = x)
    
    reference_grid_probs_tbl %>%
      head(10) %>%
      kable(align = rep("c"))
    probs estimate
    99 0.0332949
    98 0.0321061
    97 0.0310773
    96 0.0303446
    95 0.0295937
    94 0.0288550
    93 0.0282270
    92 0.0276995
    91 0.0271418
    90 0.0266409
    reference_grid_probs_tbl %>%
      tail(10) %>%
      kable(align = rep("c"))
    probs estimate
    10 0.0018141
    9 0.0016195
    8 0.0014430
    7 0.0012654
    6 0.0010786
    5 0.0008795
    4 0.0007005
    3 0.0005288
    2 0.0003976
    1 0.0002657

    KDE estimates in the range of the variables

    By default the KDE provides density estimates for a grid of points that covers the space of the variables.

    kd_grid_estimates <- kd_d3m

    If we want to know the value at each point in the simulated population we use the eval.points argument.

    mc_estimates <- ks::kde(
      x = d3m, H = hscv1,
      compute.cont = TRUE,
      eval.points = correlated_ln_draws_tbl %>% as.matrix()
    )

    Here are a couple different ways to convert the kde object features into a tibble:

    mc_est_tbl_10000 <- tibble(estimate = mc_estimates$estimate) %>%
      bind_cols(correlated_ln_draws_tbl)
    kd_grid_est_tbl_29k <- broom:::tidy.kde(kd_grid_estimates) %>%
      pivot_wider(names_from = variable, values_from = value) %>%
      rename(ellip = x1, curv = x2, pressure = x3) %>%
      select(-obs)
    mc_est_tbl_10000 %>%
      head(10) %>%
      kable(align = "c")
    estimate ellip curv pressure
    0.0140082 1.267618 2.158739 133.3040
    0.0118702 1.198681 4.437176 127.1982
    0.0026530 1.375663 2.649938 160.2052
    0.0194232 1.236829 3.621202 141.5427
    0.0122134 1.318572 2.541340 129.3427
    0.0167723 1.255885 3.236722 146.9361
    0.0055484 1.326279 3.387357 167.8579
    0.0112270 1.240926 4.254830 145.7207
    0.0082627 1.191865 1.853877 87.7240
    0.0019651 1.315273 4.144667 90.5485
    kd_grid_est_tbl_29k %>%
      head(10) %>%
      kable(align = "c")
    estimate ellip curv pressure
    0 0.8880910 0.1331512 31.02793
    0 0.9131427 0.1331512 31.02793
    0 0.9381944 0.1331512 31.02793
    0 0.9632461 0.1331512 31.02793
    0 0.9882978 0.1331512 31.02793
    0 1.0133495 0.1331512 31.02793
    0 1.0384012 0.1331512 31.02793
    0 1.0634528 0.1331512 31.02793
    0 1.0885045 0.1331512 31.02793
    0 1.1135562 0.1331512 31.02793

    Identify the 5% threshold value:

    # 5% contour line from kd grid based on 10k MC data
    percentile_5 <- kd_d3m[["cont"]]["5%"]

    Verify that 5% (500/10,000) values fall below the threshold:

    mc_est_tbl_10000 %>% filter(estimate <= percentile_5)
    ## # A tibble: 500 x 4
    ##     estimate ellip  curv pressure
    ##        <dbl> <dbl> <dbl>    <dbl>
    ##  1 0.000243   1.19  6.67    166. 
    ##  2 0.000371   1.37  2.29     81.7
    ##  3 0.000356   1.22  2.73    200. 
    ##  4 0.000134   1.27  7.91    114. 
    ##  5 0.000560   1.47  3.89    163. 
    ##  6 0.0000732  1.47  4.67    282. 
    ##  7 0.000527   1.29  6.13    185. 
    ##  8 0.000254   1.48  2.44    136. 
    ##  9 0.000644   1.19  6.63    135. 
    ## 10 0.000856   1.43  4.98    156. 
    ## # ... with 490 more rows

    If we wanted to know the nearest probability contour line for every point we could make a function to do so.

    get_probs_fcn <- function(value) {
      t <- reference_grid_probs_tbl %>%
        mutate(value = value) %>%
        mutate(dif = abs(estimate - value)) %>%
        filter(dif == min(dif))
    
      t[[1, 1]]
    }

    Map the function over each value in the dataset.

    # mc_1_to_99_tbl <- mc_est_tbl_10000 %>%
    #   mutate(nearest_prob = map_dbl(estimate, get_probs_fcn))
    
    # mc_1_to_99_tbl %>% write_rds(here::here("mc_1_to_99_tbl.rds"))
    mc_1_to_99_tbl <- read_rds(here::here("mc_1_to_99_tbl.rds"))
    
    mc_1_to_99_tbl
    ## # A tibble: 10,000 x 5
    ##    estimate ellip  curv pressure nearest_prob
    ##       <dbl> <dbl> <dbl>    <dbl>        <dbl>
    ##  1  0.0140   1.27  2.16    133.            59
    ##  2  0.0119   1.20  4.44    127.            52
    ##  3  0.00265  1.38  2.65    160.            14
    ##  4  0.0194   1.24  3.62    142.            75
    ##  5  0.0122   1.32  2.54    129.            53
    ##  6  0.0168   1.26  3.24    147.            68
    ##  7  0.00555  1.33  3.39    168.            28
    ##  8  0.0112   1.24  4.25    146.            50
    ##  9  0.00826  1.19  1.85     87.7           39
    ## 10  0.00197  1.32  4.14     90.5           11
    ## # ... with 9,990 more rows

    Let’s see what this actually did by comparing the probability contours estimated from the function above vs. the reference contours produced during generation of the kde object:

    n <- mc_1_to_99_tbl %>%
      select(nearest_prob, estimate) %>%
      mutate(set = as_factor("manual_fit_10k"))
    
    p <- cont_vals_tbl %>%
      mutate(set = as_factor("kde_output")) %>%
      rename(estimate = x, nearest_prob = probs)
    
    bind_rows(n, p) %>%
      filter(estimate < .005 & estimate > 0) %>%
      ggplot(aes(x = estimate, y = nearest_prob)) +
      geom_point(aes(color = set)) +
      geom_vline(xintercept = percentile_5)

    So it appropriately rounded each value in the simulated population to the nearest whole percentile, as desired.

    Density Plot with Probability Contours in 2d

    Now the fun part – Visualization. Here’s how I’d go about creating a 2d density plots. If you don’t care about specific probability contours, you can use the built in ggplot method. All you need is the raw data points, not the kde object or estimates. ggplot can do that for you.

    Here’s ellipticity vs. curvature:

    cp_plt <- correlated_ln_draws_tbl %>%
      ggplot(aes(x = ellip, y = curv)) +
      geom_point(alpha = .3, size = .5) +
      geom_density2d(size = 1.3) +
      theme_classic() +
      xlim(c(.9, 1.6)) +
      ylim(c(1, 7.5)) +
      labs(
        title = "Joint Distribution of Vessel Ellipticity and Curvature",
        subtitle = "Density Contours at Default Settings",
        x = "Ellipticity (unitless)",
        y = "Radius of Curvature (mm)"
      )
    
    cp_plt

    This is pretty good. But for specifying specific contours and specifically we’ll need a bit more. Here’s a solution that I adapted from one I found on Cross Validated.

    First, select the 2 variables of interest.

    d <- correlated_ln_draws_tbl %>% select(ellip, curv)
    
    ## density function
    kd <- ks::kde(d, compute.cont = TRUE, h = 0.05)

    Now a a function to extract the points of the contour line from the kde:

    get_contour <- function(kd_out = kd, prob = "5%") {
      contour_95 <- with(kd_out, contourLines(
        x = eval.points[[1]], y = eval.points[[2]],
        z = estimate, levels = cont[prob]
      )[[1]])
      as_tibble(contour_95) %>%
        mutate(prob = prob)
    }

    Map it over the kd object.

    dat_out <- map_dfr(c("5%", "20%", "40%", "60%", "80%", "95%"), ~ get_contour(kd, .)) %>%
      group_by(prob) %>%
      mutate(n_val = 1:n()) %>%
      ungroup()
    
    dat_out %>%
      head(10) %>%
      kable(align = "c")
    level x y prob n_val
    0.1050379 1.024919 2.053556 5% 1
    0.1050379 1.023034 2.112579 5% 2
    0.1050379 1.022015 2.176684 5% 3
    0.1050379 1.021665 2.240789 5% 4
    0.1050379 1.021762 2.304894 5% 5
    0.1050379 1.022205 2.368999 5% 6
    0.1050379 1.022956 2.433104 5% 7
    0.1050379 1.024009 2.497208 5% 8
    0.1050379 1.024919 2.540507 5% 9
    0.1050379 1.025311 2.561313 5% 10

    Clean kde output

    kd_df <- expand_grid(x = kd$eval.points[[1]], y = kd$eval.points[[2]]) %>%
      mutate(z = c(kd$estimate %>% t()))

    Now visualize again, this time with probability contours at specified values and the 5% curve labeled with geom_label_repel().

    label_tbl <- dat_out %>%
      filter(
        prob == "5%",
        n_val == 100
      )
    
    # visualize
    ellip_curv_2plt <- ggplot(data = kd_df, aes(x, y)) +
      geom_tile(aes(fill = z)) +
      geom_point(data = d, aes(x = ellip, y = curv), alpha = .4, size = .4, colour = "white") +
      geom_path(aes(x, y, group = prob),
        data = dat_out %>% filter(prob %in% c("5%", "20%", "40%", "60%", "80%", "95%")), colour = "white", size = 1.2, alpha = .8
      ) +
      #  geom_text(aes(label = prob), data =
      #              filter(dat_out, (prob %in% c("5%") & n_val==1)), # | (prob %in% c("90%") & n_val==20)),
      #            colour = "yellow", size = 5)+
      geom_label_repel(
        data = label_tbl, aes(x, y),
        label = label_tbl$prob[1],
        fill = "yellow",
        color = "black",
        segment.color = "yellow",
        #    segment.size = 1,
        min.segment.length = unit(1, "lines"),
        nudge_y = .5,
        nudge_x = -.025
      ) +
      xlim(c(.95, 1.5)) +
      ylim(c(0, 7.5)) +
      labs(
        title = "Joint Distribution [Ellipticity and Radius of Curvature]",
        subtitle = "Simulated Data",
        caption = "Density Contours shown at 5%, 20%, 40%, 60%, 80%, 95%"
      ) +
      scale_fill_viridis_c(end = .9) +
      theme_bw() +
      theme(legend.position = "none") +
      labs(x = "Ellipticity (unitless)", y = "Radius of Curvature (mm)")
    ggExtra::ggMarginal(ellip_curv_2plt, type = "density", fill = "#403891ff", alpha = .7)

    Nice! Now we have a clear delineation of the most common patients and most extreme patients and we can group them as desired. Almost done – the last thing to do is to see how to extend into 3d.

    Density Plot with Probability Contours in 3d

    Honestly, this part is pretty easy thanks to a built in plot.kde method. Just use the cont argument to specify with probability contours you want.

    plot(x = kd_d3m, cont = c(45, 70, 95), drawpoints = FALSE, col.pt = 1)

    Add points using the points3d function. In this case I add 2 sets, 1 for the 5% most extreme and 1 for the 95% most common.

    # plot(x = kd_d3m, cont = c(95) ,drawpoints = FALSE, col.pt = 1)
    mc_lowest_5_tbl <- mc_1_to_99_tbl %>% filter(estimate < percentile_5)
    mc_6_to_100_tbl <- mc_1_to_99_tbl %>% filter(estimate >= percentile_5)
    
    # points3d(x = mc_lowest_5_tbl$ellip, y = mc_lowest_5_tbl$curv, z = mc_lowest_5_tbl$pressure, color = "dodgerblue",  size = 3, alpha = 1)
    
    # points3d(x = mc_6_to_100_tbl$ellip, y = mc_6_to_100_tbl$curv, z = mc_6_to_100_tbl$pressure, color = "black",  size = 3, alpha = 1)

    And there we have it! A 3d joint distribution with 95% probability density contour and groups separated by color and confirmed to cover 95/5% of the population.

    If you’ve made it this far, I thank you. Here is a brief summary of what I tried to capture in this post:

    • Manual method to simulate a population from a sample with known correlation structure using mvrnorm()
    • Simplified / Efficient way to do the same using AnySim package
    • Method for identifying worst-case 5% of patients when there is a known target point to stay away from
    • Method for identifying most common 95% of patients with respect the their density
    • Simple way to plot 2d density with ggplot, showing “levels”
    • Generate 2d density plots with specified probability density contours and labels
    • Method for creating 3d plots with specified probability density contours and in/out groups

    1. Hamdan et. al. Journal of the American College of Cardiology, Volume 59, Issue 2, 2012, Pages 119-127↩︎

    2. The units of measure aren’t really important for the analysis but just for calibration the ellipticity is generally unitless (it is a ratio of the long axis over the short axis); curvature is usually in mm and represents the radius of curvature on the inner curve of the vessel; pressure is usually in mm Hg and can be assumed to be the systolic blood pressure.↩︎

    3. see Water 2020, 12, 1645; doi:10.3390/w12061645↩︎

    4. simulated population conditional on the model – no sampling error is considered here. For population inference that takes into account sampling error, see tolerance intervals and the tolerance package in R↩︎

    5. https://en.wikipedia.org/wiki/Goodman_relation↩︎

    6. https://bookdown.org/egarpor/NP-UC3M/#welcome↩︎

    7. Adapted from this Stack Overflow response: https://stackoverflow.com/questions/23437000/how-to-plot-a-contour-line-showing-where-95-of-values-fall-within-in-r-and-in↩︎

    To leave a comment for the author, please follow the link and comment on their blog: [R]eliability.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.