Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- Intro
- 1
Other_
related category - 2
Pt_
Patient related category - 3
R_
Radiology related category - 4
SS_
Category related to signs and symptoms of CAP - 5
Hx_
medical history category - 6
Social_
social history category - 7
HCAP_
healthcare associated pneumonia category - 8
PE_
observations during physical examination category - To be continued….
This post is a supplementary material for an assignment. The assignment is part of the Augmented Machine Learning unit for a Specialised Diploma in Data Science for Business. The aim of the assignment is to use DataRobot
for predictive modelling. Exploratory data analysis and feature engineering will be done here in R
before the data is imported into DataRobot
.
Intro
The dataset is from a prospective population-based surveillance study. The observational study was conducted over 3 different South America cities across 3 different countries over a 3-year period to investigate the incidence rate of Community Acquired Pneumonia (CAP). The dataset has a wealth of variables which can be used for predictive modelling, there is no known predictive analysis published using this dataset. The aim of this project is to classify if patients with CAP became better after seeing a doctor or became worse despite seeing a doctor.
library(tidyverse) theme_set(theme_light()) raw<- readxl::read_excel("Incidence rate of community-acquired pneumonia in adults a population-based prospective active surveillance study in three cities in South America.xls") ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, : ## Expecting logical in EL1372 / R1372C142: got '2014-09-02' ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, : ## Expecting logical in EM1372 / R1372C143: got '2014-09-08'
The dataset consists of 2302 rows and 176 columns.
dim(raw) ## [1] 2302 176
The original column names were the description of the variables (e.g. Received flu shot in the last 12 months
). Based on these descriptions/column names, the columns can be classified into 13 categories (see table below). Most of the categories ae clinically related variables.
categories13<- readxl::read_excel("Incidence rate of community-acquired pneumonia in adults a population-based prospective active surveillance study in three cities in South America.xls", sheet=3) categories13 %>% DT::datatable(rownames = F, options = list(searchHighlight = TRUE, paging= T))
Data dictionary
The column names are renamed to shorter column names (e.g. Received flu shot in the last 12 months
-> flu
) with prefixes to identify which of the above 13 categories they belong to (e.g. flu
-> V_flu
. Prefix V_
stands for vaccine against the flu.).
metadata<- readxl::read_excel("Incidence rate of community-acquired pneumonia in adults a population-based prospective active surveillance study in three cities in South America.xls", sheet=2) # metadata metadata %>% DT::datatable(rownames = F, options = list(searchHighlight = TRUE, paging= T))
# rename col names colnames(raw)<- metadata$`New column name` %>% t()
EDA blueprint
EDA will be iterated for each of the 13 categories as there are too many columns to do the EDA at once. Also, there may be some association among the variables for each category. EDA includes exploring (i) the types of variables for each category (ii) the number of missing values (iii) the number of outliers (iv) and data cleaning if needed. Customized functions were created to facilitate EDA:
dtype
provides the number of columns beginning with the prefix (e.g.dtype(dataframe, “Pt”)
will list all the columns related to patient (pt
). The types of variables are also provided.eda_c
breaks down the labels for columns beginning with the prefix. Used mostly for categorical variables.eda_n_NAplt
plots the percentage ofNA
/missing values for each column beginning with the prefix. Used mostly for numeric variables.eda_n_NAcutoff
provides a vector of variable names with acceptableNA
values. Used mostly for numeric variables. The ball park maximum amount of missing values is 20% though higher proportion of missing values may be included after inspecting the plot generated byeda_n_NAplt
eda_n_outlier
plots boxplots for numeric variables beginning with the prefix. Variables with large number of outliers can be isolated for further investigation.
dtype<- function(datafr, x){ datafr %>% select(starts_with(x, ignore.case = F)) %>% str() } eda_c<- function(datafr,x){ datafr %>% select(starts_with(x, ignore.case = F)) %>% map(~ table(.x, useNA = "always")) } eda_n_NAplt<- function (datafr, x){ datafr %>% select(starts_with(x, ignore.case = F)) %>% summarise(across(starts_with(x), ~mean(is.na(.)))) %>% pivot_longer(cols = everything(), names_to= "Variables" , values_to="pct_na") %>% mutate(Variables= fct_reorder(Variables, pct_na)) %>% ggplot(aes(x=Variables, y=pct_na, fill= pct_na))+ geom_col() + coord_flip() + scale_y_continuous(labels=scales::percent_format()) + scale_fill_viridis_c(option = "plasma")} eda_n_NAcutoff<- function(datafr, x, low, high){ datafr%>% select(starts_with(x, ignore.case = F)) %>% summarise(across(starts_with(x), ~mean(is.na(.)))) %>% pivot_longer(cols = everything(), names_to="Variables", values_to="pct_na") %>% filter((pct_na>low & pct_na<high)) %>% pull(Variables)} eda_n_outlier<-function(datafr, x_selected){ # nested df with plots plt<-datafr %>% select(all_of(x_selected)) %>% pivot_longer(cols=everything(),names_to="Variables", values_to="values") %>% nest(-Variables) %>% mutate(plot= map2(.x= data, .y= Variables, ~ggplot(data=.x, aes(x= values)) + geom_boxplot() + labs(title = .y) )) # print the plots for (i in 1:length(x_selected)){ p<-plt[[3]][[i]] print(p)} }
Outcome
The outcome will be Other_Outcome
. As the prediction is whether the patient was better
or worse
after seeking medical treatment, a binary classification is warranted here. However the Other_Outcome
has 4 values, namely cure
, improvement
, unfavourable
and death
.
eda_c(raw, "Other_Outcome") ## $Other_Outcome ## .x ## Cure death Improvement unfavorable <NA> ## 799 277 1179 26 21
cure
and improvement
will be collapsed as better
and unfavourable
and death
will be collapsed as worse
. 6 times as many patients became better after seeking medical help. While encouraging from the doctor’s and patient’s perspective, it results in an imbalanced dataset for prediction. The imbalanced dataset will be addressed much later.
After removing 21 NA
outcomes, there are 2281 observations remaining.
# collapse 4 categories into 2 raw$Other_Outcome<-fct_collapse(raw$Other_Outcome, better=c("Cure", "Improvement")) raw$Other_Outcome<-fct_collapse(raw$Other_Outcome, worse=c("unfavorable", "death")) # remove na df<-raw %>% filter(!is.na(Other_Outcome)) eda_c(df, "Other_Outcome") ## $Other_Outcome ## .x ## better worse <NA> ## 1978 303 0
Discard the noise
There are column names with the prefix rm_
in front of the category prefix (e.g. rm_Other_
). These columns are removed for numerous reasons.
dtype(df,"rm") ## tibble [2,281 x 36] (S3: tbl_df/tbl/data.frame) ## $ rm_R_CXR : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ... ## $ rm_R_CT : chr [1:2281] "No" "No" "No" "No" ... ## $ rm_R_CT_date : chr [1:2281] NA NA NA NA ... ## $ rm_SS : logi [1:2281] NA NA NA NA NA NA ... ## $ rm_SS_infilterate : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ... ## $ rm_SS_WBC : chr [1:2281] "Yes" "Yes" "No" "No" ... ## $ rm_HCAP : chr [1:2281] "No" "No" "No" "No" ... ## $ rm_PE : logi [1:2281] NA NA NA NA NA NA ... ## $ rm_PE_O2 : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ... ## $ rm_Lab : logi [1:2281] NA NA NA NA NA NA ... ## $ rm_Lab_RBC : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ... ## $ rm_Lab_Hb : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ... ## $ rm_Lab_WBC : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ... ## $ rm_Lab_NeuImu : chr [1:2281] "No" "No" "No" "No" ... ## $ rm_Lab_NeuImuDate : chr [1:2281] NA NA NA NA ... ## $ rm_Lab_Neu : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ... ## $ rm_Lab_plt : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ... ## $ rm_Lab_Na : chr [1:2281] "No" "No" "No" "No" ... ## $ rm_Lab_NaDate : chr [1:2281] NA NA NA NA ... ## $ rm_Lab_urea : chr [1:2281] "Yes" "No" "Yes" "Yes" ... ## $ rm_Lab_Cr : chr [1:2281] "Yes" "No" "Yes" "Yes" ... ## $ rm_Lab_Bicarb : chr [1:2281] "No" "No" "No" "No" ... ## $ rm_Lab_BicarbDate : chr [1:2281] NA NA NA NA ... ## $ rm_Lab_Sugar : chr [1:2281] "Yes" "No" "Yes" "Yes" ... ## $ rm_Lab_Alb : chr [1:2281] "No" "No" "No" "No" ... ## $ rm_Lab_AlbDate : chr [1:2281] NA NA NA NA ... ## $ rm_Lab_lactate : chr [1:2281] "No" "No" "No" "No" ... ## $ rm_Lab_lactateDate : chr [1:2281] NA NA NA NA ... ## $ rm_Lab_CRP : chr [1:2281] "Yes" "Yes" "No" "Yes" ... ## $ rm_Lab_ABG : chr [1:2281] "No" "No" "No" "No" ... ## $ rm_Lab_ABGDate : chr [1:2281] NA NA NA NA ... ## $ rm_Abx : logi [1:2281] NA NA NA NA NA NA ... ## $ rm_Care_ICUdate : chr [1:2281] NA NA NA NA ... ## $ rm_Other_phone : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ... ## $ rm_Other_1yearstatus: chr [1:2281] "dead after 1 year" NA "dead after 1 year" "dead after 1 year" ... ## $ rm_Abx_AbxDuration : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
For instance, 1 year status
contains information about the patient’s status one year post CAP which is a data leakage as it reveals the patient’s outcome from the CAP.
df %>% select(rm_Other_1yearstatus) %>% tail() ## # A tibble: 6 x 1 ## rm_Other_1yearstatus ## <chr> ## 1 alive ## 2 dead after 1 year ## 3 alive ## 4 alive ## 5 alive ## 6 alive
Other columns contained duplicated information. For lab results, there is a column, which indicates if the specific biochemical was tested (eg rm_Lab_urea
), and another column of the result (eg Lab_urea
). If the biochemical was not tested, the column will indicate No
test being done and the result column will be blank. Keeping only the results column will suffice. The reasons for removing specific rm_
columns is described in the above data dictionary.
df %>% select(rm_Lab_urea, Lab_urea) %>% head() ## # A tibble: 6 x 2 ## rm_Lab_urea Lab_urea ## <chr> <dbl> ## 1 Yes 60 ## 2 No NA ## 3 Yes 99 ## 4 Yes 56 ## 5 Yes 143 ## 6 Yes 56.3
The dataframe of 176 columns ends up with 140 columns after discarding rm_
columns.
df<-df %>% select(-starts_with("rm")) ncol(df) ## [1] 140
1 Other_
related category
After removing redundant Other_
variables, only Other_Outcome
remains. Rename it to just Outcome
for readability.
dtype(df, "Other") ## tibble [2,281 x 1] (S3: tbl_df/tbl/data.frame) ## $ Other_Outcome: Factor w/ 2 levels "better","worse": 1 1 2 1 2 1 1 1 1 2 ... df<-df %>% rename(Outcome=Other_Outcome)
2 Pt_
Patient related category
dtype(df, "Pt") ## tibble [2,281 x 5] (S3: tbl_df/tbl/data.frame) ## $ Pt_Site : chr [1:2281] "Location A" "Location A" "Location A" "Location A" ... ## $ Pt_CaseNumber: num [1:2281] 1 2 3 4 5 6 7 9 10 11 ... ## $ Pt_Age : num [1:2281] 95 79 89 93 81 83 88 65 27 37 ... ## $ Pt_incorrect : logi [1:2281] NA NA NA NA NA NA ... ## $ Pt_correct : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ...
Appropriate patients
Pt_incorrect
and Pt_correct
are columns to indicate if the patients enrolled met the criteria for the study. All the subjects met the criteria for the study. Pt_incorrect
and Pt_correct
can be dropped.
(df %>% count(Pt_incorrect)) ## # A tibble: 1 x 2 ## Pt_incorrect n ## <lgl> <int> ## 1 NA 2281 (df %>% count(Pt_correct)) ## # A tibble: 1 x 2 ## Pt_correct n ## <chr> <int> ## 1 Yes 2281 df<-df %>% select(- c(Pt_correct, Pt_incorrect))
Case_number
Case numbers, Pt_CaseNumber
are not distinct to the entire dataset. they are only distinct to the research site. Assign new unique case number for entire dataset.
# are case number distinct (df %>% pull(Pt_CaseNumber) %>% n_distinct()) ## [1] 1231 # explore why case numbers are not distinct (df %>% filter(Pt_CaseNumber==1)) ## # A tibble: 3 x 138 ## Pt_Site Pt_CaseNumber Pt_Age R_CXR_infiltrate R_CXR_cavitation R_CXR_effusion ## <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 Locati~ 1 95 Yes No No ## 2 Locati~ 1 26 Yes No No ## 3 Locati~ 1 32 Yes No No ## # ... with 132 more variables: R_CXR_effusionSite <chr>, R_CT_inflitrate <chr>, ## # R_CT_cavitation <chr>, R_CT_effusion <chr>, R_CT_effusionSite <chr>, ## # SS_cough <chr>, SS_phlegm <chr>, SS_lungSounds <chr>, SS_temp <chr>, ## # SS_breathing <chr>, SS_daysOfRespSymp <dbl>, Hx_mass <chr>, Hx_heart <chr>, ## # Hx_stroke <chr>, Hx_kidney <chr>, Hx_liver <chr>, Hx_brainMental <chr>, ## # Hx_diabetes <chr>, Hx_pastCAP <chr>, Hx_asp <chr>, Hx_alcohol <chr>, ## # Hx_immune <chr>, Hx_COPD <chr>, Social_drugs <chr>, ## # Social_overcrowded <chr>, Hx_heart_type <chr>, Social_smoke <chr>, ## # Social_smoke_duration <chr>, Hx_HIV <chr>, Hx_HIV_CD4 <dbl>, ## # Hx_HIV_viralLoad <chr>, Hx_HIV_Medicine <chr>, HCAP_hospStay <chr>, ## # HCAP_IVAbx <chr>, HCAP_Chemo <chr>, HCAP_diaylsis <chr>, HCAP_injury <chr>, ## # PE_AMS <chr>, PE_HR <dbl>, PE_RR <dbl>, PE_BP_S <dbl>, PE_BP_D <dbl>, ## # PE_temp <dbl>, PE_O2 <chr>, Lab_RBC <dbl>, Lab_Hb <dbl>, Lab_WBC <dbl>, ## # Lab_NeuImu <dbl>, Lab_Neu <dbl>, Lab_plt <dbl>, Lab_Na <chr>, ## # Lab_urea <dbl>, Lab_Cr <dbl>, Lab_Bicarb <dbl>, Lab_Sugar <dbl>, ## # Lab_Alb <dbl>, Lab_lactate <dbl>, Lab_lactateHigh <chr>, Lab_CRP <dbl>, ## # Lab_CRPHigh <chr>, Lab_pH <dbl>, Lab_CO2 <dbl>, Lab_O2 <dbl>, ## # Lab_FiO2 <dbl>, CS_Resp <chr>, CS_Blood <chr>, CS_Urine <chr>, ## # CS_screen <chr>, CS_agent <chr>, CS_Organism1 <chr>, ## # CS_Organism1Blood <chr>, CS_Organism1Sputum <chr>, ## # CS_Organism1Tracheal <chr>, CS_Organism1BAL <chr>, CS_Organism1Urine <chr>, ## # CS_Organism1Sero <chr>, CS_Organism1Other <chr>, ## # CS_Organism1Comments <chr>, CS_Organism2 <chr>, CS_Orgainsim2Blood <chr>, ## # CS_Organism2Sputum <chr>, CS_Organism2Tracheal <chr>, ## # CS_Organism2BAL <chr>, CS_Organism2Urine <chr>, CS_OrganismSero <chr>, ## # CS_OrganismOther <chr>, CS_OrganismComments <chr>, ## # Abx_AmoxicillinSulbactam <chr>, Abx_AmoxicillinSulbactamOral <chr>, ## # Abx_AmoxicillinSulbactamNonoral <chr>, ## # Abx_AmoxicillinSulbactamNonoralStart <chr>, ## # Abx_AmoxicillinSulbactamNonoralEnd <chr>, Abx_Ampicillin <chr>, ## # Abx_AmpicillinStart <chr>, Abx_AmpicillinEnd <chr>, ## # Abx_AmpicillinSulbactam <chr>, Abx_Azithromycin <chr>, ## # Abx_Ceftriaxone <chr>, Abx_Cefotaxime <chr>, Abx_ClarithromycinOral <chr>, ## # ... (df %>% filter(Pt_CaseNumber==11)) ## # A tibble: 3 x 138 ## Pt_Site Pt_CaseNumber Pt_Age R_CXR_infiltrate R_CXR_cavitation R_CXR_effusion ## <chr> <dbl> <dbl> <chr> <chr> <chr> ## 1 Locati~ 11 37 Yes No Yes ## 2 Locati~ 11 37 Yes No No ## 3 Locati~ 11 76 Yes Unavailable Unavailable ## # ... with 132 more variables: R_CXR_effusionSite <chr>, R_CT_inflitrate <chr>, ## # R_CT_cavitation <chr>, R_CT_effusion <chr>, R_CT_effusionSite <chr>, ## # SS_cough <chr>, SS_phlegm <chr>, SS_lungSounds <chr>, SS_temp <chr>, ## # SS_breathing <chr>, SS_daysOfRespSymp <dbl>, Hx_mass <chr>, Hx_heart <chr>, ## # Hx_stroke <chr>, Hx_kidney <chr>, Hx_liver <chr>, Hx_brainMental <chr>, ## # Hx_diabetes <chr>, Hx_pastCAP <chr>, Hx_asp <chr>, Hx_alcohol <chr>, ## # Hx_immune <chr>, Hx_COPD <chr>, Social_drugs <chr>, ## # Social_overcrowded <chr>, Hx_heart_type <chr>, Social_smoke <chr>, ## # Social_smoke_duration <chr>, Hx_HIV <chr>, Hx_HIV_CD4 <dbl>, ## # Hx_HIV_viralLoad <chr>, Hx_HIV_Medicine <chr>, HCAP_hospStay <chr>, ## # HCAP_IVAbx <chr>, HCAP_Chemo <chr>, HCAP_diaylsis <chr>, HCAP_injury <chr>, ## # PE_AMS <chr>, PE_HR <dbl>, PE_RR <dbl>, PE_BP_S <dbl>, PE_BP_D <dbl>, ## # PE_temp <dbl>, PE_O2 <chr>, Lab_RBC <dbl>, Lab_Hb <dbl>, Lab_WBC <dbl>, ## # Lab_NeuImu <dbl>, Lab_Neu <dbl>, Lab_plt <dbl>, Lab_Na <chr>, ## # Lab_urea <dbl>, Lab_Cr <dbl>, Lab_Bicarb <dbl>, Lab_Sugar <dbl>, ## # Lab_Alb <dbl>, Lab_lactate <dbl>, Lab_lactateHigh <chr>, Lab_CRP <dbl>, ## # Lab_CRPHigh <chr>, Lab_pH <dbl>, Lab_CO2 <dbl>, Lab_O2 <dbl>, ## # Lab_FiO2 <dbl>, CS_Resp <chr>, CS_Blood <chr>, CS_Urine <chr>, ## # CS_screen <chr>, CS_agent <chr>, CS_Organism1 <chr>, ## # CS_Organism1Blood <chr>, CS_Organism1Sputum <chr>, ## # CS_Organism1Tracheal <chr>, CS_Organism1BAL <chr>, CS_Organism1Urine <chr>, ## # CS_Organism1Sero <chr>, CS_Organism1Other <chr>, ## # CS_Organism1Comments <chr>, CS_Organism2 <chr>, CS_Orgainsim2Blood <chr>, ## # CS_Organism2Sputum <chr>, CS_Organism2Tracheal <chr>, ## # CS_Organism2BAL <chr>, CS_Organism2Urine <chr>, CS_OrganismSero <chr>, ## # CS_OrganismOther <chr>, CS_OrganismComments <chr>, ## # Abx_AmoxicillinSulbactam <chr>, Abx_AmoxicillinSulbactamOral <chr>, ## # Abx_AmoxicillinSulbactamNonoral <chr>, ## # Abx_AmoxicillinSulbactamNonoralStart <chr>, ## # Abx_AmoxicillinSulbactamNonoralEnd <chr>, Abx_Ampicillin <chr>, ## # Abx_AmpicillinStart <chr>, Abx_AmpicillinEnd <chr>, ## # Abx_AmpicillinSulbactam <chr>, Abx_Azithromycin <chr>, ## # Abx_Ceftriaxone <chr>, Abx_Cefotaxime <chr>, Abx_ClarithromycinOral <chr>, ## # ... #assign unique case numbers df<-df %>% mutate(Pt_CaseNumber=1:nrow(df)) (df %>% pull(Pt_CaseNumber) %>% n_distinct()) ## [1] 2281
Age
There is no missing age and mostly elderly >60 across all three sites
# any m/s age (eda_n_NAplt(df, "Pt_Age"))
# distribution of age (ggplot(df, aes(Pt_Age)) + geom_histogram(aes(fill=Pt_Site),alpha=.5,bins=round(sqrt(nrow(df)))/3)) +labs(title = "Distribution of Age") + facet_wrap(.~Pt_Site) +theme(legend.position="none") + geom_vline(xintercept = 60, size=1)
3 R_
Radiology related category
dtype(df, "R") ## tibble [2,281 x 8] (S3: tbl_df/tbl/data.frame) ## $ R_CXR_infiltrate : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ... ## $ R_CXR_cavitation : chr [1:2281] "No" "No" "Yes" "No" ... ## $ R_CXR_effusion : chr [1:2281] "No" "No" "No" "No" ... ## $ R_CXR_effusionSite: chr [1:2281] NA NA NA NA ... ## $ R_CT_inflitrate : chr [1:2281] "Unavailable" "Unavailable" "Unavailable" "Unavailable" ... ## $ R_CT_cavitation : chr [1:2281] "Unavailable" "Unavailable" "Unavailable" "Unavailable" ... ## $ R_CT_effusion : chr [1:2281] "Unavailable" "Unavailable" "Unavailable" "Unavailable" ... ## $ R_CT_effusionSite : chr [1:2281] NA NA NA NA ...
Effusion and effusion site
R_CXR_effusion
and R_CT_effusion
indicates if effusion was seen on the radiological imaging while R_CXR_effusionSite
and R_CT_effusionSite
records the location of the effusion site. These columns contain different facets of the same information; the values of these columns can be integrated into single columns.
On chest x-ray ( R_CXR_effusion
, R_CXR_effusionSite
)
2080 effusion sites on chest x-rays , R_CXR_effusionSite
, were recorded as NA
not because they are truly missing but because these x-rays have no effusion sites in the begin with. They will be relabelled as nil
effusion sites. 43 effusions sites were recorded as NA
and whether effusion sites were found the x-rays are unknown. These observations will retain their NA
values. 9 effusion sites were recorded as NA
but effusions were detected on x-rays. They will be relabelled as effusion but ? site
.
After relabelling R_CXR_effusionSite
using information from R_CXR_effusion
, the latter column will be removed.
(table(df$R_CXR_effusion, df$R_CXR_effusionSite, useNA = "always")) ## ## Bilateral Left Right <NA> ## No 0 0 0 2080 ## Unavailable 0 0 0 43 ## Yes 27 52 70 9 ## <NA> 0 0 0 0 # relabel effusion sites df<-df %>% mutate(R_CXR_effusionSite=case_when( R_CXR_effusion=='No' & is.na(R_CXR_effusionSite)~ "Nil", R_CXR_effusion=='Unavailable' & is.na(R_CXR_effusionSite) ~ "Unavailable", R_CXR_effusion=="Yes" & is.na(R_CXR_effusionSite) ~"Effusion but ?site", T~as.character(R_CXR_effusionSite) )) (table(df$R_CXR_effusion, df$R_CXR_effusionSite, useNA = "always")) ## ## Bilateral Effusion but ?site Left Nil Right Unavailable <NA> ## No 0 0 0 2080 0 0 0 ## Unavailable 0 0 0 0 0 43 0 ## Yes 27 9 52 0 70 0 0 ## <NA> 0 0 0 0 0 0 0 # remove R_CXR_effusion df<-df %>% select(- R_CXR_effusion)
On CT chest (R_CT_effusion
, R_CT_effusionSite
)
Repeat the same for effusion related variables on CT chest.
(table(df$R_CT_effusion, df$R_CT_effusionSite, useNA = "always")) ## ## Bilateral Left Right <NA> ## No 0 0 0 45 ## Unavailable 0 0 0 2205 ## Yes 15 6 10 0 ## <NA> 0 0 0 0 # relabel effusion site df<-df %>% mutate(R_CT_effusionSite=case_when( R_CT_effusion=='No' & is.na(R_CT_effusionSite)~ "Nil", R_CT_effusion=='Unavailable' & is.na(R_CT_effusionSite) ~ "Unavailable", T~as.character(R_CT_effusionSite) )) (table(df$R_CT_effusion, df$R_CT_effusionSite, useNA = "always")) ## ## Bilateral Left Nil Right Unavailable <NA> ## No 0 0 45 0 0 0 ## Unavailable 0 0 0 0 2205 0 ## Yes 15 6 0 10 0 0 ## <NA> 0 0 0 0 0 0 #remove CT_effusion df<-df %>% select(-R_CT_effusion)
4 SS_
Category related to signs and symptoms of CAP
99
days of respiratory symptoms SS_daysOfRespSymp
are outliers and likely represents missing values Relabel 99
as NA
. The usage of 99
occurs frequently for numeric variables in this dataset, more examples to follow in later sections.
(dtype(df, "SS")) ## tibble [2,281 x 6] (S3: tbl_df/tbl/data.frame) ## $ SS_cough : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ... ## $ SS_phlegm : chr [1:2281] "No" "Yes" "Yes" "Yes" ... ## $ SS_lungSounds : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ... ## $ SS_temp : chr [1:2281] "Yes" "Yes" "No" "Yes" ... ## $ SS_breathing : chr [1:2281] "Yes" "Yes" "Yes" "Yes" ... ## $ SS_daysOfRespSymp: num [1:2281] 1 2 5 3 3 4 3 4 NA NA ... ## NULL (eda_c(df, "SS")) ## $SS_cough ## .x ## No Yes <NA> ## 145 2136 0 ## ## $SS_phlegm ## .x ## No Yes <NA> ## 536 1745 0 ## ## $SS_lungSounds ## .x ## No Yes <NA> ## 267 2014 0 ## ## $SS_temp ## .x ## No Yes <NA> ## 798 1483 0 ## ## $SS_breathing ## .x ## No Yes <NA> ## 532 1749 0 ## ## $SS_daysOfRespSymp ## .x ## 0 1 2 3 4 5 6 7 8 9 10 12 13 14 15 20 ## 2 203 414 397 224 267 69 357 43 1 69 2 1 3 55 12 ## 21 22 23 30 99 <NA> ## 1 1 1 6 4 149 df<-df %>% mutate(SS_daysOfRespSymp=na_if(SS_daysOfRespSymp, 99))
5 Hx_
medical history category
(dtype(df, "Hx")) ## tibble [2,281 x 17] (S3: tbl_df/tbl/data.frame) ## $ Hx_mass : chr [1:2281] "No" "No" "No" "No" ... ## $ Hx_heart : chr [1:2281] "No" "No" "Yes" "Yes" ... ## $ Hx_stroke : chr [1:2281] "No" "No" "No" "No" ... ## $ Hx_kidney : chr [1:2281] "No" "No" "No" "No" ... ## $ Hx_liver : chr [1:2281] "No" "No" "No" "No" ... ## $ Hx_brainMental : chr [1:2281] "No" "No" "No" "No" ... ## $ Hx_diabetes : chr [1:2281] "No" "No" "No" "No" ... ## $ Hx_pastCAP : chr [1:2281] "No" "No" "No" "No" ... ## $ Hx_asp : chr [1:2281] "Yes" "No" "No" "No" ... ## $ Hx_alcohol : chr [1:2281] "No" "No" "No" "No" ... ## $ Hx_immune : chr [1:2281] "No" "No" "No" "No" ... ## $ Hx_COPD : chr [1:2281] "No" "Yes" "No" "No" ... ## $ Hx_heart_type : chr [1:2281] NA NA NA NA ... ## $ Hx_HIV : chr [1:2281] "No" "No" "No" "No" ... ## $ Hx_HIV_CD4 : num [1:2281] NA NA NA NA NA NA NA NA NA NA ... ## $ Hx_HIV_viralLoad: chr [1:2281] NA NA NA NA ... ## $ Hx_HIV_Medicine : chr [1:2281] "Unavailable" "Unavailable" "Unavailable" "Unavailable" ... ## NULL (eda_c(df,"Hx")) ## $Hx_mass ## .x ## No Uncertain Yes <NA> ## 2152 9 117 3 ## ## $Hx_heart ## .x ## No Uncertain Yes <NA> ## 1273 19 983 6 ## ## $Hx_stroke ## .x ## No Uncertain Yes <NA> ## 2105 11 161 4 ## ## $Hx_kidney ## .x ## No Uncertain Yes <NA> ## 2110 7 156 8 ## ## $Hx_liver ## .x ## No Uncertain Yes <NA> ## 2214 4 58 5 ## ## $Hx_brainMental ## .x ## No Uncertain Yes <NA> ## 1907 28 341 5 ## ## $Hx_diabetes ## .x ## No Uncertain Yes <NA> ## 1907 13 358 3 ## ## $Hx_pastCAP ## .x ## No Uncertain Yes <NA> ## 1985 5 287 4 ## ## $Hx_asp ## .x ## No Uncertain Yes <NA> ## 2202 15 59 5 ## ## $Hx_alcohol ## .x ## No Uncertain Yes <NA> ## 2117 18 135 11 ## ## $Hx_immune ## .x ## No Uncertain Yes <NA> ## 2130 4 139 8 ## ## $Hx_COPD ## .x ## No Uncertain Yes <NA> ## 1879 50 343 9 ## ## $Hx_heart_type ## .x ## Arrhythmia CHF Hypertension ## 23 43 220 ## Isquemic cardiomyopathy Other <NA> ## 20 1 1974 ## ## $Hx_HIV ## .x ## No Unavailable Yes <NA> ## 2134 105 42 0 ## ## $Hx_HIV_CD4 ## .x ## 14 41 74 99 127 175 245 291 324 349 485 493 844 1104 <NA> ## 1 1 1 3 1 1 1 1 1 1 1 1 1 1 2265 ## ## $Hx_HIV_viralLoad ## .x ## < Detection limit > Detection limit Value <NA> ## 5 3 4 2269 ## ## $Hx_HIV_Medicine ## .x ## No Unavailable Yes <NA> ## 596 1671 14 0
HIV details
Remove Hx_HIV_CD4
& Hx_HIV_viralLoad
as there are too many NA
. Remove Hx_HIV_Medicine
as there are too many Unavailable
.
df<-df %>% select(-contains("Hx_HIV_"))
Heart disease
Hx_heart
indicates whether the patient has heart disease or not. Hx_heart_type
provides details on the type of heart disease. Both the variables can be integrated into a variable.
1273 observations for heart disease details, Hx_heart_type
were labelled as NA
not because they are truly missing but because these patients have No
heart disease. These values will be labelled as None
. 19 observations for heart disease details were labelled as NA
when the heart disease status is Uncertain
. These values will be labelled as Query heart disease
. 676 observations for heart disease details were labelled as NA
but were classified to have a heart disease. As these patients have heart diseases but no details can be obtained, the NA
values will be treated as Other
types of heart disease.
After using Hx_heart
to expand Hx_heart_type
, Hx_heart
will be dropped.
(table(df$Hx_heart, df$Hx_heart_type, useNA="always")) ## ## Arrhythmia CHF Hypertension Isquemic cardiomyopathy Other <NA> ## No 0 0 0 0 0 1273 ## Uncertain 0 0 0 0 0 19 ## Yes 23 43 220 20 1 676 ## <NA> 0 0 0 0 0 6 # relabel df<- df %>% mutate(Hx_heart_type=case_when( Hx_heart== "No" & is.na(Hx_heart_type) ~ "None", Hx_heart=="Uncertain" & is.na(Hx_heart_type) ~ "Query heart disease", Hx_heart=="Yes" & is.na(Hx_heart_type) ~ "Other", T~ Hx_heart_type)) (table(df$Hx_heart, df$Hx_heart_type, useNA="always")) ## ## Arrhythmia CHF Hypertension Isquemic cardiomyopathy None Other ## No 0 0 0 0 1273 0 ## Uncertain 0 0 0 0 0 0 ## Yes 23 43 220 20 0 677 ## <NA> 0 0 0 0 0 0 ## ## Query heart disease <NA> ## No 0 0 ## Uncertain 19 0 ## Yes 0 0 ## <NA> 0 6 # remove `hx_heart` df <- df %>% select (-Hx_heart)
6 Social_
social history category
(dtype(df, "Social")) ## tibble [2,281 x 4] (S3: tbl_df/tbl/data.frame) ## $ Social_drugs : chr [1:2281] "No" "No" "No" "No" ... ## $ Social_overcrowded : chr [1:2281] "No" "No" "No" "No" ... ## $ Social_smoke : chr [1:2281] "No" "Yes" "No" "No" ... ## $ Social_smoke_duration: chr [1:2281] NA "Previous to the last 5 years" NA NA ... ## NULL (eda_c(df, "Social")) ## $Social_drugs ## .x ## No Uncertain Yes <NA> ## 2263 3 10 5 ## ## $Social_overcrowded ## .x ## No Uncertain Yes <NA> ## 2193 7 51 30 ## ## $Social_smoke ## .x ## No Unavailable Yes <NA> ## 1276 175 830 0 ## ## $Social_smoke_duration ## .x ## current In the last 5 years ## 394 200 ## Previous to the last 5 years <NA> ## 236 1451
smoking
Social_smoke
indicates if the patient smokes or not. Social_smoke_duration
records how long the patients was smoking. These variables contain different facets of the same information; the values can be integrated into a single column.
1276 observations for smoking duration, Social_smoke_duration
were labelled as NA
but these values were not truly missing. These patients did not smoke. The values will be relabelled as non-smoker
. 175 observations for smoking duration were labelled as NA
but the information if they smoked was Unavailable
. These values will be relabelled as Unavailable
. All the patients who smoked had the duration of their smoking habit recorded. The bins of smoking duration were relabelled to terms that are more intuitive. current
was relabelled as still smokes
, In the last 5 years
was relabelled as smoked in the last 5y
, Previous to the last 5 years
was relabelled as smoked >5y ago
.
After using Social_smoke
to expand Social_smoke_duration
, Social_smoke
is dropped.
(table(df$Social_smoke, df$Social_smoke_duration, useNA = "always")) ## ## current In the last 5 years Previous to the last 5 years <NA> ## No 0 0 0 1276 ## Unavailable 0 0 0 175 ## Yes 394 200 236 0 ## <NA> 0 0 0 0 # relabel df<-df %>% mutate(Social_smoke=case_when( Social_smoke=="Yes" & Social_smoke_duration=="current"~ "still smokes", Social_smoke=="Yes" & Social_smoke_duration=="In the last 5 years"~ "smoked in last 5y", Social_smoke=="Yes" & Social_smoke_duration=="Previous to the last 5 years"~ "smoked >5y ago", T~as.character(Social_smoke) )) (df %>% count(Social_smoke)) ## # A tibble: 5 x 2 ## Social_smoke n ## <chr> <int> ## 1 No 1276 ## 2 smoked >5y ago 236 ## 3 smoked in last 5y 200 ## 4 still smokes 394 ## 5 Unavailable 175 # remove df <- df %>% select(-Social_smoke_duration)
7 HCAP_
healthcare associated pneumonia category
No data cleaning is needed for this category.
(dtype(df,"HCAP")) ## tibble [2,281 x 5] (S3: tbl_df/tbl/data.frame) ## $ HCAP_hospStay: chr [1:2281] "Yes" "No" "No" "No" ... ## $ HCAP_IVAbx : chr [1:2281] "No" "No" "No" "No" ... ## $ HCAP_Chemo : chr [1:2281] "No" "No" "No" "No" ... ## $ HCAP_diaylsis: chr [1:2281] "No" "No" "No" "No" ... ## $ HCAP_injury : chr [1:2281] "No" "No" "No" "No" ... ## NULL (eda_c(df, "HCAP")) ## $HCAP_hospStay ## .x ## No Uncertain Yes <NA> ## 2039 3 236 3 ## ## $HCAP_IVAbx ## .x ## No Uncertain Yes <NA> ## 2074 1 203 3 ## ## $HCAP_Chemo ## .x ## No Uncertain Yes <NA> ## 2240 3 33 5 ## ## $HCAP_diaylsis ## .x ## No Uncertain Yes <NA> ## 2234 1 41 5 ## ## $HCAP_injury ## .x ## No Uncertain Yes <NA> ## 2206 3 63 9
8 PE_
observations during physical examination category
(dtype(df, "PE")) ## tibble [2,281 x 7] (S3: tbl_df/tbl/data.frame) ## $ PE_AMS : chr [1:2281] "No" "No" "No" "Unavailable" ... ## $ PE_HR : num [1:2281] 88 92 100 95 95 110 85 85 110 85 ... ## $ PE_RR : num [1:2281] 26 24 48 30 30 28 26 25 26 28 ... ## $ PE_BP_S: num [1:2281] 100 110 140 140 120 140 100 120 120 100 ... ## $ PE_BP_D: num [1:2281] 50 60 80 80 60 90 60 60 80 60 ... ## $ PE_temp: num [1:2281] 38 38 36 37 36 39 37 37 40 37 ... ## $ PE_O2 : chr [1:2281] NA NA NA NA ... ## NULL # explore categorical (eda_c(df, "PE_AMS")) ## $PE_AMS ## .x ## No Unavailable Yes <NA> ## 1832 29 420 0
Oxygen levels, PE_O2
are calculated in the form of percentage. In this case, there is a mixture of pure numbers and numbers ending with %
resulting in the variable to be treated as a character variable. %
will be omitted and the variable will be converted to a numeric variable.
(eda_c(df, "PE_O2")) ## $PE_O2 ## .x ## 100 37 55 58 ## 3 1 1 1 ## 60 63 65 67 ## 3 1 3 1 ## 7 70 73 74 ## 1 9 3 6 ## 75 76 77 78 ## 3 2 3 10 ## 79 80 80.099999999999994 82 ## 4 23 1 18 ## 83 84 85 85 % ## 12 17 23 2 ## 86 86,7 86.599999999999994 87 ## 21 1 1 19 ## 88 88% 88,9 88.900000000000006 ## 55 1 1 1 ## 89 90 91 91.299999999999997 ## 50 116 62 1 ## 92 92% 92.400000000000006 93 ## 124 1 1 107 ## 93 % 93.400000000000006 93.599999999999994 94 ## 1 1 1 153 ## 94 % 94.299999999999997 94.5 95 ## 1 1 1 153 ## 95.5 95.609999999999999 95.900000000000006 96 ## 1 1 1 191 ## 96 % 96.5 96.700000000000003 97 ## 1 1 1 142 ## 97.5 98 99 <NA> ## 1 97 35 784 # clean up `PE_O2` df<-df %>% mutate(PE_O2= as.numeric(str_replace_all(PE_O2,pattern="%", replacement = ""))) ## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
Missing PE_
values
Now, all the PE_
variables are in numeric form and the proportion of NA
can be appropriately calculated. Oxygen levels PE_O2
has the highest proportion of missing values, >30% values are missing. PE_O2
will be dropped.
(eda_n_NAplt(df,"PE"))
df<-df %>% select(-PE_O2)
Outlier PE_
values
The following variables have unrealistic outliers:
- Temperature,
PE_temp
. Outliers >50’C will be explored - Breathing rate,
PE_RR
. Outliers of >50 breaths per minute will be explored - Diastolic blood pressure,
PE_BP_D
. There is only one observation with a diastolic blood pressure >300, this observation will be removed.
pe_selected<-eda_n_NAcutoff(df, "PE", 0, 0.3) (eda_n_outlier(df,pe_selected)) ## Warning: All elements of `...` must be named. ## Did you want `data = c(values)`? ## Warning: Removed 176 rows containing non-finite values (stat_boxplot).
## Warning: Removed 201 rows containing non-finite values (stat_boxplot).
## Warning: Removed 167 rows containing non-finite values (stat_boxplot).
## Warning: Removed 168 rows containing non-finite values (stat_boxplot).
## Warning: Removed 167 rows containing non-finite values (stat_boxplot).
## NULL df<-df %>% filter(PE_BP_D<300)
Further investigation of outliers
The 90-ish values stand out for both PE_temp
and PE_RR
. A frequency count will be done.
(df %>% filter(PE_temp>50) %>% ggplot(aes(PE_temp)) + geom_histogram(bins=round(sqrt(nrow(df)))))
(df %>% filter(PE_RR>50) %>% ggplot(aes(PE_RR)) + geom_histogram(bins=round(sqrt(nrow(df)))))
99
is the most common value. Similar to previous numeric variables where 99
is an outlier, it will be converted to NA
. For PE_temp
, the values 363
and 384
, are likely missing a decimal point (It is more likely your body’s temperature is 36.3’C instead of 363’C) .
# PE_temp (df %>% filter(PE_temp>50) %>% group_by(PE_temp) %>% summarise(n(), .groups="drop")) ## # A tibble: 6 x 2 ## PE_temp `n()` ## <dbl> <int> ## 1 67 1 ## 2 80 1 ## 3 92 1 ## 4 99 98 ## 5 363 1 ## 6 384 1 # PE_RR (df %>% filter(PE_RR>50) %>% group_by(PE_RR) %>% summarise(n(), .groups="drop")) ## # A tibble: 10 x 2 ## PE_RR `n()` ## <dbl> <int> ## 1 70 2 ## 2 80 1 ## 3 84 1 ## 4 90 1 ## 5 99 138 ## 6 100 1 ## 7 114 1 ## 8 120 1 ## 9 126 1 ## 10 130 1
The rest of the outliers will take a plausible maximum value based on the 90th-95th percentile.
# 90-ish percentile (quantile(df$PE_temp, probs = seq(0,1,.05), na.rm = T)) ## 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% ## 32 36 36 36 36 36 36 37 37 37 37 37 37 38 38 38 ## 80% 85% 90% 95% 100% ## 38 38 39 40 384 (quantile(df$PE_RR, probs = seq(0,1,.05), na.rm = T)) ## 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% ## 7 16 16 18 18 20 20 20 21 22 23 24 24 24 26 27 ## 80% 85% 90% 95% 100% ## 28 30 35 99 130 # clean up PE_temp and PE_RR df<-df %>% mutate(PE_temp=na_if(PE_temp, 99), PE_RR=na_if(PE_RR, 99), PE_temp=if_else(PE_temp==363, 36.3, PE_temp), PE_temp=if_else(PE_temp==384, 38.4, PE_temp), PE_temp=if_else(PE_temp>50, 40, PE_temp), PE_RR=if_else(PE_RR>50, 35, PE_RR))
To be continued….
Before this post becomes too long, EDA for the remaining categories
- lab findings
Lab_
- cultures
CS_
- antibiotics
Abx_
- continiuum of care state
Care_
- vaccine
V_
will be done in the next post.
save(df, file = "CAP_EDA1.RData")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.