How I Rebuilt a Lost ECG Data Script in R

Vladyslav Shuliar

3 months ago

[This article was first published on pharmaverse blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

< !--------------- typical setup -----------------> < !--------------- post begins here ----------------->

As a Data Science placement student at Roche UK, I was given an exciting opportunity to enhance my R programming skills while contributing to the open-source community. Under the guidance of my manager, Edoardo Mancini, I undertook a unique and challenging task within the {pharmaversesdtm} project that tested both my technical expertise and problem-solving abilities.

The project involved recreating the eg domain (Electrocardiogram data) from the SDTM datasets used within {pharmaversesdtm}. The original dataset had been sourced from the CDISC pilot project, but since that source was no longer available, we had no direct reference. Fortunately, a saved copy of the dataset still existed, allowing me to analyze it and attempt to reproduce it as closely as possible.

< section id="how-i-solved-the-problem" class="level2">

How I Solved the Problem

< section id="explored-and-analyzed-the-data" class="level3">

Explored and Analyzed the Data

The first step was to thoroughly explore the existing ECG dataset of over 25,000 entries. I needed to understand the structure and key variables that defined the dataset, such as the “one row for each patient’s test during each visit” format. By analyzing these elements, I was able to gain a clear picture of how the dataset was organized. I also examined the range of values, variance, and other characteristics of the tests to ensure that my recreated version would align with the original dataset’s structure and statistical properties.

To provide a clearer understanding of how the data is structured, let’s take a quick look at the information collected during a patient’s visit. Below is an example of data for patient 01-701-1015 during their WEEK 2 visit:

# A tibble: 10 × 7
   USUBJID     EGTEST             VISIT  EGDTC      EGTPT EGSTRESN EGSTRESC
   <chr>       <chr>              <chr>  <chr>      <chr>    <dbl> <chr>   
 1 01-701-1015 QT Duration        WEEK 2 2014-01-16 "1"        449 449     
 2 01-701-1015 QT Duration        WEEK 2 2014-01-16 "2"        511 511     
 3 01-701-1015 QT Duration        WEEK 2 2014-01-16 "3"        534 534     
 4 01-701-1015 Heart Rate         WEEK 2 2014-01-16 "1"         63 63      
 5 01-701-1015 Heart Rate         WEEK 2 2014-01-16 "2"         83 83      
 6 01-701-1015 Heart Rate         WEEK 2 2014-01-16 "3"         66 66      
 7 01-701-1015 RR Duration        WEEK 2 2014-01-16 "1"        316 316     
 8 01-701-1015 RR Duration        WEEK 2 2014-01-16 "2"        581 581     
 9 01-701-1015 RR Duration        WEEK 2 2014-01-16 "3"        570 570     
10 01-701-1015 ECG Interpretation WEEK 2 2014-01-16 ""          NA ABNORMAL

In this example, USUBJID identifies the subject, while EGTEST specifies the type of ECG test performed. VISIT refers to the visit during which the test occurred, and EGDTC records the date of the test. EGTPT indicates the condition under which the ECG test was conducted. EGSTRESN provides the numeric result, and EGSTRESC gives the corresponding categorical result.

< section id="wrote-the-new-r-script" class="level3">

Wrote the New R Script

Armed with insights from my analysis, I set about writing a new R script to replicate the lost one. This involved a lot of trial and error, as I kept refining the code to ensure it generated a dataset that closely resembled the original ECG data in both structure and content. In order to give you, my reader, better understanding of the solution, I’ll walk you through the main parts of the script.

< section id="loading-libraries-and-data" class="level4">

Loading Libraries and Data

To begin, I loaded the necessary libraries and read in the vital signs (vs) dataset, This dataset is functional to my cause because it has the same structure and schedule as the eg data, so I can recreate the eg visit schedule for each patient from it. By setting a seed for the random data generation, I ensured that the process was reproducible, allowing others to verify my results and maintain consistency in future analyses. Additionally, the metatools package was loaded to facilitate adding labels to the variables later, which enhanced the readability of the dataset.

library(dplyr)
library(metatools)
library(pharmaversesdtm)

data("vs")
set.seed(123)

< section id="extracting-unique-datetime-of-measurements" class="level4">

Extracting Unique Date/Time of Measurements

Next, I extracted the unique combination of subject IDs, visit names, and visit dates from the vs dataset.

egdtc <- vs %>%
  select(USUBJID, VISIT, VSDTC) %>%
  distinct() %>%
  rename(EGDTC = VSDTC)

egdtc

# A tibble: 2,741 × 3
   USUBJID     VISIT               EGDTC     
   <chr>       <chr>               <chr>     
 1 01-701-1015 SCREENING 1         2013-12-26
 2 01-701-1015 SCREENING 2         2013-12-31
 3 01-701-1015 BASELINE            2014-01-02
 4 01-701-1015 AMBUL ECG PLACEMENT 2014-01-14
 5 01-701-1015 WEEK 2              2014-01-16
 6 01-701-1015 WEEK 4              2014-01-30
 7 01-701-1015 AMBUL ECG REMOVAL   2014-02-01
 8 01-701-1015 WEEK 6              2014-02-12
 9 01-701-1015 WEEK 8              2014-03-05
10 01-701-1015 WEEK 12             2014-03-26
# ℹ 2,731 more rows

This data was used later to match the generated ECG data to the correct visit and time points.

< section id="generating-a-grid-of-patient-data" class="level4">

Generating a Grid of Patient Data

Subsequently, I created a grid of all possible combinations of subject IDs, test codes (e.g., QT, HR, RR, ECGINT), time points (e.g., after lying down, after standing), and visits. These combinations represented different test results collected across multiple visits.

eg <- expand.grid(
  USUBJID = unique(vs$USUBJID),
  EGTESTCD = c("QT", "HR", "RR", "ECGINT"),
  EGTPT = c(
    "AFTER LYING DOWN FOR 5 MINUTES",
    "AFTER STANDING FOR 1 MINUTE",
    "AFTER STANDING FOR 3 MINUTES"
  ),
  VISIT = c(
    "SCREENING 1",
    "SCREENING 2",
    "BASELINE",
    "AMBUL ECG PLACEMENT",
    "WEEK 2",
    "WEEK 4",
    "AMBUL ECG REMOVAL",
    "WEEK 6",
    "WEEK 8",
    "WEEK 12",
    "WEEK 16",
    "WEEK 20",
    "WEEK 24",
    "WEEK 26",
    "RETRIEVAL"
  ), stringsAsFactors = FALSE
)

# Filter the dataset for one subject and one visit
filtered_eg <- eg %>%
  filter(USUBJID == "01-701-1015" & VISIT == "WEEK 2")

# Display the result
filtered_eg

       USUBJID EGTESTCD                          EGTPT  VISIT
1  01-701-1015       QT AFTER LYING DOWN FOR 5 MINUTES WEEK 2
2  01-701-1015       HR AFTER LYING DOWN FOR 5 MINUTES WEEK 2
3  01-701-1015       RR AFTER LYING DOWN FOR 5 MINUTES WEEK 2
4  01-701-1015   ECGINT AFTER LYING DOWN FOR 5 MINUTES WEEK 2
5  01-701-1015       QT    AFTER STANDING FOR 1 MINUTE WEEK 2
6  01-701-1015       HR    AFTER STANDING FOR 1 MINUTE WEEK 2
7  01-701-1015       RR    AFTER STANDING FOR 1 MINUTE WEEK 2
8  01-701-1015   ECGINT    AFTER STANDING FOR 1 MINUTE WEEK 2
9  01-701-1015       QT   AFTER STANDING FOR 3 MINUTES WEEK 2
10 01-701-1015       HR   AFTER STANDING FOR 3 MINUTES WEEK 2
11 01-701-1015       RR   AFTER STANDING FOR 3 MINUTES WEEK 2
12 01-701-1015   ECGINT   AFTER STANDING FOR 3 MINUTES WEEK 2

In order to demonstrate the data more clearly, I have displayed the combinations for only one subject and one visit for you to see, as the full table is very large. Each of these test codes corresponds to specific ECG measurements: QT refers to the QT interval (a measurement made on an electrocardiogram used to assess some of the electrical properties of the heart), HR represents heart rate, RR is the interval between R waves, and ECGINT refers to the ECG interpretation.

As I analyzed the original ECG dataset, I learned more about these test codes and their relevance to the clinical data.

< section id="generating-random-test-results" class="level4">

Generating Random Test Results

For each combination in the grid, I generated random test results using a normal distribution to simulate realistic values for each test code. To determine the means and standard deviations, I used the original EG dataset as a reference. By analyzing the range and distribution of values in the original dataset, I was able to extract realistic means and standard deviations for each numerical ECG test (QT, HR, RR).

EGSTRESN = case_when(
EGTESTCD == "RR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 543.9985, 80)),
EGTESTCD == "RR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 536.0161, 80)),
EGTESTCD == "RR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 532.3233, 80)),
EGTESTCD == "HR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 70.04389, 8)),
EGTESTCD == "HR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 74.27798, 8)),
EGTESTCD == "HR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 74.77461, 8)),
EGTESTCD == "QT" & EGELTM == "PT5M" ~ floor(rnorm(n(), 450.9781, 60)),
EGTESTCD == "QT" & EGELTM == "PT3M" ~ floor(rnorm(n(), 457.7265, 60)),
EGTESTCD == "QT" & EGELTM == "PT1M" ~ floor(rnorm(n(), 455.3394, 60))
)

This approach ensured that the synthetic data aligned closely with the patterns and variability observed in the original clinical data.

< section id="finalizing-the-dataset" class="level4">

Finalizing the Dataset

Finally, I added labels to the dataframe for easier analysis and future use by utilizing the metatools::add_labels() function.

add_labels(
STUDYID = "Study Identifier",
USUBJID = "Unique Subject Identifier",
EGTEST = "ECG Test Name",
VISIT = "Visit Name",
EGSTRESC = "Character Result/Finding in Std Format",
EGSTRESN = "Numeric Result/Finding in Standard Units",
<etc>
)

This provided descriptive names for each column in the dataset, making it more intuitive to understand the data during analysis and ensuring clarity in its subsequent use.

< section id="limitations" class="level4">

Limitations

However, this approach has certain limitations. One key issue is that the simulations do not account for the time structure, as each observation is generated independently (i.i.d.), which does not reflect real-world dynamics. Additionally, sampling from a normal distribution may not always be appropriate and can sometimes yield unrealistic results, such as negative heart rate (HR) values. To mitigate this, I manually reviewed the generated data to ensure that only plausible values were included. Below are the valid ranges I established for this purpose:

# Filter the data for the relevant test codes (QT, RR, HR)
eg_filtered <- pharmaversesdtm::eg %>%
  filter(EGTESTCD %in% c("QT", "HR", "RR"))

# Display the minimum and maximum values for each test code
value_ranges <- eg_filtered %>%
  group_by(EGTESTCD) %>%
  summarize(
    min_value = min(EGSTRESN, na.rm = TRUE),
    max_value = max(EGSTRESN, na.rm = TRUE)
  )

# Show the result
value_ranges

# A tibble: 3 × 3
  EGTESTCD min_value max_value
  <fct>        <dbl>     <dbl>
1 QT             242       671
2 HR              40       107
3 RR             236       889

< section id="conclusion" class="level3">

Conclusion

This project not only sharpened my R programming skills but also provided invaluable experience in reverse-engineering data, analyzing large healthcare datasets, and tackling real-world challenges in the open-source domain. By following a structured approach, I was able to successfully recreate the EG dataset synthetically, ensuring it mirrors realistic clinical data. This achievement not only enhances my technical capabilities but also contributes to the broader open-source community, as the synthetic dataset will be featured in the next release of {pharmaversesdtm}, offering a valuable resource for future research and development.

< !--------------- appendices go here ----------------->

< section id="last-updated" class="level2 appendix">

Last updated

2025-01-21 14:39:35.919743

< section id="details" class="level2 appendix">

Details

Source, Session info

< section class="quarto-appendix-contents" id="quarto-reuse">

Reuse

CC BY 4.0

< section class="quarto-appendix-contents" id="quarto-citation">

Citation

BibTeX citation:

@online{shuliar2024,
  author = {Shuliar, Vladyslav},
  title = {How {I} {Rebuilt} a {Lost} {ECG} {Data} {Script} in {R}},
  date = {2024-10-31},
  url = {https://pharmaverse.github.io/blog/posts/2024-10-31_how__i__reb.../how__i__rebuilt_a__lost__ec_g__data__script_in__r.html},
  langid = {en}
}

For attribution, please cite this work as:

Shuliar, Vladyslav. 2024. “How I Rebuilt a Lost ECG Data Script in R.” October 31, 2024. https://pharmaverse.github.io/blog/posts/2024-10-31_how__i__reb…/how__i__rebuilt_a__lost__ec_g__data__script_in__r.html.

To leave a comment for the author, please follow the link and comment on their blog: pharmaverse blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.