Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As a Data Science placement student at Roche UK, I was given an exciting opportunity to enhance my R programming skills while contributing to the open-source community. Under the guidance of my manager, Edoardo Mancini, I undertook a unique and challenging task within the {pharmaversesdtm} project that tested both my technical expertise and problem-solving abilities.
The project involved recreating the eg
domain (Electrocardiogram data) from the SDTM datasets used within {pharmaversesdtm}. The original dataset had been sourced from the CDISC pilot project, but since that source was no longer available, we had no direct reference. Fortunately, a saved copy of the dataset still existed, allowing me to analyze it and attempt to reproduce it as closely as possible.
How I Solved the Problem
< section id="explored-and-analyzed-the-data" class="level3">Explored and Analyzed the Data
The first step was to thoroughly explore the existing ECG dataset of over 25,000 entries. I needed to understand the structure and key variables that defined the dataset, such as the “one row for each patient’s test during each visit” format. By analyzing these elements, I was able to gain a clear picture of how the dataset was organized. I also examined the range of values, variance, and other characteristics of the tests to ensure that my recreated version would align with the original dataset’s structure and statistical properties.
To provide a clearer understanding of how the data is structured, let’s take a quick look at the information collected during a patient’s visit. Below is an example of data for patient 01-701-1015
during their WEEK 2
visit:
# A tibble: 10 × 7 USUBJID EGTEST VISIT EGDTC EGTPT EGSTRESN EGSTRESC <chr> <chr> <chr> <chr> <chr> <dbl> <chr> 1 01-701-1015 QT Duration WEEK 2 2014-01-16 "1" 449 449 2 01-701-1015 QT Duration WEEK 2 2014-01-16 "2" 511 511 3 01-701-1015 QT Duration WEEK 2 2014-01-16 "3" 534 534 4 01-701-1015 Heart Rate WEEK 2 2014-01-16 "1" 63 63 5 01-701-1015 Heart Rate WEEK 2 2014-01-16 "2" 83 83 6 01-701-1015 Heart Rate WEEK 2 2014-01-16 "3" 66 66 7 01-701-1015 RR Duration WEEK 2 2014-01-16 "1" 316 316 8 01-701-1015 RR Duration WEEK 2 2014-01-16 "2" 581 581 9 01-701-1015 RR Duration WEEK 2 2014-01-16 "3" 570 570 10 01-701-1015 ECG Interpretation WEEK 2 2014-01-16 "" NA ABNORMAL
In this example, USUBJID
identifies the subject, while EGTEST
specifies the type of ECG test performed. VISIT
refers to the visit during which the test occurred, and EGDTC
records the date of the test. EGTPT
indicates the condition under which the ECG test was conducted. EGSTRESN
provides the numeric result, and EGSTRESC
gives the corresponding categorical result.
Wrote the New R Script
Armed with insights from my analysis, I set about writing a new R script to replicate the lost one. This involved a lot of trial and error, as I kept refining the code to ensure it generated a dataset that closely resembled the original ECG data in both structure and content. In order to give you, my reader, better understanding of the solution, I’ll walk you through the main parts of the script.
< section id="loading-libraries-and-data" class="level4">Loading Libraries and Data
To begin, I loaded the necessary libraries and read in the vital signs (vs
) dataset, This dataset is functional to my cause because it has the same structure and schedule as the eg
data, so I can recreate the eg
visit schedule for each patient from it. By setting a seed for the random data generation, I ensured that the process was reproducible, allowing others to verify my results and maintain consistency in future analyses. Additionally, the metatools package was loaded to facilitate adding labels to the variables later, which enhanced the readability of the dataset.
library(dplyr) library(metatools) library(pharmaversesdtm) data("vs") set.seed(123)
Extracting Unique Date/Time of Measurements
Next, I extracted the unique combination of subject IDs, visit names, and visit dates from the vs
dataset.
egdtc <- vs %>% select(USUBJID, VISIT, VSDTC) %>% distinct() %>% rename(EGDTC = VSDTC) egdtc
# A tibble: 2,741 × 3 USUBJID VISIT EGDTC <chr> <chr> <chr> 1 01-701-1015 SCREENING 1 2013-12-26 2 01-701-1015 SCREENING 2 2013-12-31 3 01-701-1015 BASELINE 2014-01-02 4 01-701-1015 AMBUL ECG PLACEMENT 2014-01-14 5 01-701-1015 WEEK 2 2014-01-16 6 01-701-1015 WEEK 4 2014-01-30 7 01-701-1015 AMBUL ECG REMOVAL 2014-02-01 8 01-701-1015 WEEK 6 2014-02-12 9 01-701-1015 WEEK 8 2014-03-05 10 01-701-1015 WEEK 12 2014-03-26 # ℹ 2,731 more rows
This data was used later to match the generated ECG data to the correct visit and time points.
< section id="generating-a-grid-of-patient-data" class="level4">Generating a Grid of Patient Data
Subsequently, I created a grid of all possible combinations of subject IDs, test codes (e.g., QT
, HR
, RR
, ECGINT
), time points (e.g., after lying down, after standing), and visits. These combinations represented different test results collected across multiple visits.
eg <- expand.grid( USUBJID = unique(vs$USUBJID), EGTESTCD = c("QT", "HR", "RR", "ECGINT"), EGTPT = c( "AFTER LYING DOWN FOR 5 MINUTES", "AFTER STANDING FOR 1 MINUTE", "AFTER STANDING FOR 3 MINUTES" ), VISIT = c( "SCREENING 1", "SCREENING 2", "BASELINE", "AMBUL ECG PLACEMENT", "WEEK 2", "WEEK 4", "AMBUL ECG REMOVAL", "WEEK 6", "WEEK 8", "WEEK 12", "WEEK 16", "WEEK 20", "WEEK 24", "WEEK 26", "RETRIEVAL" ), stringsAsFactors = FALSE ) # Filter the dataset for one subject and one visit filtered_eg <- eg %>% filter(USUBJID == "01-701-1015" & VISIT == "WEEK 2") # Display the result filtered_eg
USUBJID EGTESTCD EGTPT VISIT 1 01-701-1015 QT AFTER LYING DOWN FOR 5 MINUTES WEEK 2 2 01-701-1015 HR AFTER LYING DOWN FOR 5 MINUTES WEEK 2 3 01-701-1015 RR AFTER LYING DOWN FOR 5 MINUTES WEEK 2 4 01-701-1015 ECGINT AFTER LYING DOWN FOR 5 MINUTES WEEK 2 5 01-701-1015 QT AFTER STANDING FOR 1 MINUTE WEEK 2 6 01-701-1015 HR AFTER STANDING FOR 1 MINUTE WEEK 2 7 01-701-1015 RR AFTER STANDING FOR 1 MINUTE WEEK 2 8 01-701-1015 ECGINT AFTER STANDING FOR 1 MINUTE WEEK 2 9 01-701-1015 QT AFTER STANDING FOR 3 MINUTES WEEK 2 10 01-701-1015 HR AFTER STANDING FOR 3 MINUTES WEEK 2 11 01-701-1015 RR AFTER STANDING FOR 3 MINUTES WEEK 2 12 01-701-1015 ECGINT AFTER STANDING FOR 3 MINUTES WEEK 2
In order to demonstrate the data more clearly, I have displayed the combinations for only one subject and one visit for you to see, as the full table is very large. Each of these test codes corresponds to specific ECG measurements: QT
refers to the QT interval (a measurement made on an electrocardiogram used to assess some of the electrical properties of the heart), HR
represents heart rate, RR
is the interval between R waves, and ECGINT
refers to the ECG interpretation.
As I analyzed the original ECG dataset, I learned more about these test codes and their relevance to the clinical data.
< section id="generating-random-test-results" class="level4">Generating Random Test Results
For each combination in the grid, I generated random test results using a normal distribution to simulate realistic values for each test code. To determine the means and standard deviations, I used the original EG dataset as a reference. By analyzing the range and distribution of values in the original dataset, I was able to extract realistic means and standard deviations for each numerical ECG test (QT
, HR
, RR
).
EGSTRESN = case_when( EGTESTCD == "RR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 543.9985, 80)), EGTESTCD == "RR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 536.0161, 80)), EGTESTCD == "RR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 532.3233, 80)), EGTESTCD == "HR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 70.04389, 8)), EGTESTCD == "HR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 74.27798, 8)), EGTESTCD == "HR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 74.77461, 8)), EGTESTCD == "QT" & EGELTM == "PT5M" ~ floor(rnorm(n(), 450.9781, 60)), EGTESTCD == "QT" & EGELTM == "PT3M" ~ floor(rnorm(n(), 457.7265, 60)), EGTESTCD == "QT" & EGELTM == "PT1M" ~ floor(rnorm(n(), 455.3394, 60)) )
This approach ensured that the synthetic data aligned closely with the patterns and variability observed in the original clinical data.
< section id="finalizing-the-dataset" class="level4">Finalizing the Dataset
Finally, I added labels to the dataframe for easier analysis and future use by utilizing the metatools::add_labels()
function.
add_labels( STUDYID = "Study Identifier", USUBJID = "Unique Subject Identifier", EGTEST = "ECG Test Name", VISIT = "Visit Name", EGSTRESC = "Character Result/Finding in Std Format", EGSTRESN = "Numeric Result/Finding in Standard Units", <etc> )
This provided descriptive names for each column in the dataset, making it more intuitive to understand the data during analysis and ensuring clarity in its subsequent use.
< section id="limitations" class="level4">Limitations
However, this approach has certain limitations. One key issue is that the simulations do not account for the time structure, as each observation is generated independently (i.i.d.), which does not reflect real-world dynamics. Additionally, sampling from a normal distribution may not always be appropriate and can sometimes yield unrealistic results, such as negative heart rate (HR) values. To mitigate this, I manually reviewed the generated data to ensure that only plausible values were included. Below are the valid ranges I established for this purpose:
# Filter the data for the relevant test codes (QT, RR, HR) eg_filtered <- pharmaversesdtm::eg %>% filter(EGTESTCD %in% c("QT", "HR", "RR")) # Display the minimum and maximum values for each test code value_ranges <- eg_filtered %>% group_by(EGTESTCD) %>% summarize( min_value = min(EGSTRESN, na.rm = TRUE), max_value = max(EGSTRESN, na.rm = TRUE) ) # Show the result value_ranges
# A tibble: 3 × 3 EGTESTCD min_value max_value <fct> <dbl> <dbl> 1 QT 242 671 2 HR 40 107 3 RR 236 889
Conclusion
This project not only sharpened my R programming skills but also provided invaluable experience in reverse-engineering data, analyzing large healthcare datasets, and tackling real-world challenges in the open-source domain. By following a structured approach, I was able to successfully recreate the EG
dataset synthetically, ensuring it mirrors realistic clinical data. This achievement not only enhances my technical capabilities but also contributes to the broader open-source community, as the synthetic dataset will be featured in the next release of {pharmaversesdtm}, offering a valuable resource for future research and development.
Last updated
2025-01-21 14:39:35.919743
Details
< section class="quarto-appendix-contents" id="quarto-reuse">Reuse
< section class="quarto-appendix-contents" id="quarto-citation">Citation
@online{shuliar2024, author = {Shuliar, Vladyslav}, title = {How {I} {Rebuilt} a {Lost} {ECG} {Data} {Script} in {R}}, date = {2024-10-31}, url = {https://pharmaverse.github.io/blog/posts/2024-10-31_how__i__reb.../how__i__rebuilt_a__lost__ec_g__data__script_in__r.html}, langid = {en} }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.