De-duplicating clinic visits
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Problem
We want to only keep an individual’s clinic visits which are for new infections.
Therefore, want to remove repeat visit to the clinic
where a repeat is defined as within 6 weeks of the previous visit.
Solutions
There are a number of way to tackle this problem which vary in terms of computation speed, simplicity and intuition.
To begin, load the packages we’re going to need:
library(reshape2) library(dplyr) library(readxl) library(lubridate) library(knitr)
Load in the data. This is in the form of individual patient data (IPD) clinic visit dates and ids.
load(file = here::here("cleaned_visit_date.RData")) dat0 <- dat kable(head(dat))
uniq_id | date | person_id | |
---|---|---|---|
67219 | F053090_2012-03-07 | 2012-03-07 | F053090 |
135647 | F105975_2014-05-21 | 2014-05-21 | F105975 |
211067 | 16M03749_2016-04-18 | 2016-04-18 | 16M03749 |
49805 | M091207_2012-03-27 | 2012-03-27 | M091207 |
39751 | F066543_2011-12-29 | 2011-12-29 | F066543 |
37228 | F108730_2011-12-02 | 2011-12-02 | F108730 |
Simple loops solution
We could simply step through the original data line by line.
library(dplyr) keep_record <- c() dat <- dat0 dat <- dat[1:5000, ] %>% arrange(person_id, date) previous_patient <- "" previous_date <- 0 for (i in seq_len(nrow(dat))){ person_id <- dat[i, "person_id"] date <- dat[i, "date"] # first visit if (person_id != previous_patient){ # print("new id") keep <- TRUE previous_patient <- person_id previous_date <- date } else if ((date - previous_date) > 42){ # print("over 42") keep <- TRUE previous_date <- date } else{ keep <- FALSE } keep_record <- c(keep_record, keep) } table(keep_record) ## keep_record ## FALSE TRUE ## 31 4969
Tidyverse solution
So within a while
loop we group by individuals and for the earliest date calculate a 6 week time window (end date window_end
) and a flag whether or not subsequent clinic visits are within this (within_window
).
We keep track of 2 output patient lists: repeat_visit_ids
and first_visit_ids
.
The later is just the id from the first date.
Once we have added to these list given the information from window_end
then we remove the first date and the repeat visit associated with this visit.
Finally, we repeat the whole thing until we reach the end of the dataframe.
time_window <- weeks(6) # duration(42, "days") #6 weeks in days repeat_visit_ids <- NULL #initialise first_visit_ids <- NULL dat <- dat0 while(nrow(dat) > 0) { dat <- dat %>% group_by(person_id) %>% dplyr::arrange(date, .by_group = TRUE) %>% mutate(window_end = min(date) + time_window, within_window = date <= window_end, first_date = date == min(date)) %>% ungroup() repeat_visit_ids <- dat %>% filter(within_window & !first_date) %>% transmute(uniq_id) %>% rbind(repeat_visit_ids, .) first_visit_ids <- dat %>% filter(first_date) %>% transmute(uniq_id) %>% rbind(first_visit_ids, .) dat <- dat %>% filter(!within_window) }
The output is a list of unique visit ids.
kable(head(first_visit_ids))
uniq_id |
---|
14F00004_2014-05-07 |
14F00006_2014-05-07 |
14F00016_2014-05-07 |
14F00019_2014-05-07 |
14F00021_2014-05-07 |
14F00027_2014-06-12 |
nrow(first_visit_ids) ## [1] 90389
Pre-processed data structure solution
We could create a data structure for easier computation.
First, lets index the repeat visits for each patient by creating a ‘visit’ column with a counter.
dat <- dat0 dat <- dat %>% group_by(person_id) %>% arrange(date) %>% mutate(visit = row_number()) %>% arrange(person_id, visit) kable(head(dat, n = 10))
uniq_id | date | person_id | visit |
---|---|---|---|
14F00004_2014-05-07 | 2014-05-07 | 14F00004 | 1 |
14F00004_2014-05-27 | 2014-05-27 | 14F00004 | 2 |
14F00004_2014-06-06 | 2014-06-06 | 14F00004 | 3 |
14F00006_2014-05-07 | 2014-05-07 | 14F00006 | 1 |
14F00016_2014-05-07 | 2014-05-07 | 14F00016 | 1 |
14F00019_2014-05-07 | 2014-05-07 | 14F00019 | 1 |
14F00021_2014-05-07 | 2014-05-07 | 14F00021 | 1 |
14F00021_2014-06-11 | 2014-06-11 | 14F00021 | 2 |
14F00027_2014-06-12 | 2014-06-12 | 14F00027 | 1 |
14F00034_2014-06-06 | 2014-06-06 | 14F00034 | 1 |
For any computation using this data set we can simply remove the single visit individuals because we already know about these.
This reduced the size of the dataframe by some way.
We can cast the data to a wide form to show a kind of patient trajectory such that the columns of visit counts and entries are dates (on a numeric scale).
yy <- dcast(person_id ~ visit, data = dat, value.var = "date") yy <- yy %>% filter(!is.na(`2`)) kable(head(yy[ ,1:11], n = 10))
person_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
14F00004 | 16197 | 16217 | 16227 | NA | NA | NA | NA | NA | NA | NA |
14F00021 | 16197 | 16232 | NA | NA | NA | NA | NA | NA | NA | NA |
14F00048 | 16198 | 16204 | 16484 | 16618 | 17001 | 17224 | 17488 | 17744 | 17831 | NA |
14F00054 | 16198 | 16209 | 17080 | NA | NA | NA | NA | NA | NA | NA |
14F00071 | 16226 | 16273 | 16380 | 16394 | 16408 | 16527 | 16555 | NA | NA | NA |
14F00073 | 16198 | 16359 | 17416 | NA | NA | NA | NA | NA | NA | NA |
14F00075 | 16261 | 17031 | NA | NA | NA | NA | NA | NA | NA | NA |
14F00106 | 16359 | 16378 | NA | NA | NA | NA | NA | NA | NA | NA |
14F00108 | 16202 | 16393 | NA | NA | NA | NA | NA | NA | NA | NA |
14F00109 | 16202 | 16230 | 16237 | NA | NA | NA | NA | NA | NA | NA |
Rather than this flat representation, we can split the data in to a list of visit frequencies.
That is each element in the list consist of trajectories for those individual with that number of visits.
The elements in the dataframes are the difference between the i
th visit and all of the subsequent.
We’ll use a subsample just to keep the computation time down.
Note that I remove the patient id so that I can do compuation on all of the visit times before including it back in.
yy <- yy[1:1000, ] visits_wide <- select(yy ,-person_id) zz <- list() for (i in seq_along(visits_wide)) { zz[[i]] <- visits_wide - visits_wide[, i] # difference between origin visit and subsequent dates zz[[i]][zz[[i]] <= 0] <- NA # replace negative times # are all visits/ whole row NAs? all_NA <- rowSums(is.na(zz[[i]])) != ncol(zz[[i]]) zz[[i]]$person_id <- yy$person_id # include patient id zz[[i]] <- zz[[i]][all_NA, ] # remove rows } lapply(zz[1:3], function(x) x[1:6, 1:11]) ## [[1]] ## 1 2 3 4 5 6 7 8 9 10 11 ## 1 NA 20 30 NA NA NA NA NA NA NA NA ## 2 NA 35 NA NA NA NA NA NA NA NA NA ## 3 NA 6 286 420 803 1026 1290 1546 1633 NA NA ## 4 NA 11 882 NA NA NA NA NA NA NA NA ## 5 NA 47 154 168 182 301 329 NA NA NA NA ## 6 NA 161 1218 NA NA NA NA NA NA NA NA ## ## [[2]] ## 1 2 3 4 5 6 7 8 9 10 11 ## 1 NA NA 10 NA NA NA NA NA NA NA NA ## 3 NA NA 280 414 797 1020 1284 1540 1627 NA NA ## 4 NA NA 871 NA NA NA NA NA NA NA NA ## 5 NA NA 107 121 135 254 282 NA NA NA NA ## 6 NA NA 1057 NA NA NA NA NA NA NA NA ## 10 NA NA 7 NA NA NA NA NA NA NA NA ## ## [[3]] ## 1 2 3 4 5 6 7 8 9 10 11 ## 3 NA NA NA 134 517 740 1004 1260 1347 NA NA ## 5 NA NA NA 14 28 147 175 NA NA NA NA ## 12 NA NA NA 151 785 NA NA NA NA NA NA ## 15 NA NA NA 185 NA NA NA NA NA NA NA ## 18 NA NA NA 69 182 NA NA NA NA NA NA ## 20 NA NA NA 161 NA NA NA NA NA NA NA
The reason for splitting the data in this way is that now doing operations on it to deduplicate are straightforward because its kind of already sorted.
The below was my first attempt at this approach.
It slow and pretty ugly.
out <- zz from_visit_seq <- head(seq_along(zz), -1) for (i in from_visit_seq){ # only consider later columns (to the right) keep_cols <- names(out[[i]])[!names(out[[i]]) %in% (1:i)] future_visits <- out[[i]][ ,keep_cols] for (j in out[[i]]$person_id){ if (nrow(out[[i]]) == 0) break future_id <- future_visits[future_visits$person_id == j, ] # which visit number (column name) is within time window (42 days) for each patient? times <- select(future_id, -person_id) visit_rm <- colnames(times)[times <= time_window & !is.na(times)] if (length(visit_rm) > 0) { # remove these repeat visits in list element for (k in as.numeric(visit_rm)) out[[k]] <- out[[k]][out[[k]]$person_id != j, ] } } } lapply(out[1:3], function(x) x[1:6, 1:11]) ## [[1]] ## 1 2 3 4 5 6 7 8 9 10 11 ## 1 NA 20 30 NA NA NA NA NA NA NA NA ## 2 NA 35 NA NA NA NA NA NA NA NA NA ## 3 NA 6 286 420 803 1026 1290 1546 1633 NA NA ## 4 NA 11 882 NA NA NA NA NA NA NA NA ## 5 NA 47 154 168 182 301 329 NA NA NA NA ## 6 NA 161 1218 NA NA NA NA NA NA NA NA ## ## [[2]] ## 1 2 3 4 5 6 7 8 9 10 11 ## 273 NA NA 2069 NA NA NA NA NA NA NA NA ## 398 NA NA 831 1317 1321 NA NA NA NA NA NA ## 494 NA NA 995 1038 NA NA NA NA NA NA NA ## 504 NA NA 258 NA NA NA NA NA NA NA NA ## 678 NA NA 578 672 1519 NA NA NA NA NA NA ## NA NA NA NA NA NA NA NA NA NA NA NA ## ## [[3]] ## 1 2 3 4 5 6 7 8 9 10 11 ## NA NA NA NA NA NA NA NA NA NA NA NA ## NA.1 NA NA NA NA NA NA NA NA NA NA NA ## NA.2 NA NA NA NA NA NA NA NA NA NA NA ## NA.3 NA NA NA NA NA NA NA NA NA NA NA ## NA.4 NA NA NA NA NA NA NA NA NA NA NA ## NA.5 NA NA NA NA NA NA NA NA NA NA NA
I realised that the break
could be replaced with a next
if I moved the if statement up a level outside of the j
for
loop.
Also, I don’t have to do the duplicate visit identification and removal separately.
I only need to consider the visits in the future of the current visit, hence the (i+1):ncol(tmp)
term.
This produces a much cleaner chunk of code.
from_visit_seq <- head(seq_along(zz), -1) for (i in from_visit_seq){ if (nrow(out[[i]]) == 0) next for (j in out[[i]]$person_id){ # stop at first time outside window t <- i + 1 out_person <- out[[i]][out[[i]]$person_id == j, ] # remove these repeat visits in list element while(out_person[, t] <= time_window & !is.na(out_person[, t])){ out[[t]] <- out[[t]][out[[t]]$person_id != j, ] t <- t + 1 } } } lapply(out[1:3], function(x) x[1:6, 1:11]) ## [[1]] ## 1 2 3 4 5 6 7 8 9 10 11 ## 1 NA 20 30 NA NA NA NA NA NA NA NA ## 2 NA 35 NA NA NA NA NA NA NA NA NA ## 3 NA 6 286 420 803 1026 1290 1546 1633 NA NA ## 4 NA 11 882 NA NA NA NA NA NA NA NA ## 5 NA 47 154 168 182 301 329 NA NA NA NA ## 6 NA 161 1218 NA NA NA NA NA NA NA NA ## ## [[2]] ## 1 2 3 4 5 6 7 8 9 10 11 ## 273 NA NA 2069 NA NA NA NA NA NA NA NA ## 398 NA NA 831 1317 1321 NA NA NA NA NA NA ## 494 NA NA 995 1038 NA NA NA NA NA NA NA ## 504 NA NA 258 NA NA NA NA NA NA NA NA ## 678 NA NA 578 672 1519 NA NA NA NA NA NA ## NA NA NA NA NA NA NA NA NA NA NA NA ## ## [[3]] ## 1 2 3 4 5 6 7 8 9 10 11 ## NA NA NA NA NA NA NA NA NA NA NA NA ## NA.1 NA NA NA NA NA NA NA NA NA NA NA ## NA.2 NA NA NA NA NA NA NA NA NA NA NA ## NA.3 NA NA NA NA NA NA NA NA NA NA NA ## NA.4 NA NA NA NA NA NA NA NA NA NA NA ## NA.5 NA NA NA NA NA NA NA NA NA NA NA
Finally, include a visit count column and remove empty dataframes.
names(out) <- as.character(seq_along(out) + 1) for (i in names(out)){ if (nrow(out[[i]]) == 0) out[[i]] <- NULL else out[[i]]$visit <- as.numeric(i) }
Stack the lists and include back in the initial visits to give all visits with duplicates removed.
res <- purrr::map_dfr(out, `[`, c("person_id", "visit")) %>% merge(dat) %>% rbind.data.frame(dat[dat$visit == 1, ]) %>% arrange(person_id, visit) kable(head(res))
person_id | visit | uniq_id | date |
---|---|---|---|
14F00004 | 1 | 14F00004_2014-05-07 | 2014-05-07 |
14F00004 | 2 | 14F00004_2014-05-27 | 2014-05-27 |
14F00006 | 1 | 14F00006_2014-05-07 | 2014-05-07 |
14F00016 | 1 | 14F00016_2014-05-07 | 2014-05-07 |
14F00019 | 1 | 14F00019_2014-05-07 | 2014-05-07 |
14F00021 | 1 | 14F00021_2014-05-07 | 2014-05-07 |
nrow(res) ## [1] 64799
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.