Linear Regression with Healthcare Data for Beginners in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Category
Tags
In this post I will show how to build a linear regression model. As an example, for this post, I will evaluate the association between vitamin D and calcium in the blood, given that the variable of interest (i.e., calcium levels) is continuous and the linear regression analysis must be used. I will also construct multivariable-adjusted models to account for confounders.
Let's start loading the packages:
library(tidyverse) library(RNHANES) library(ggplot2)
Variables selected for this analysis include age, sex, plasma levels of vitamin D, and plasma levels of calcium. All variables are assessed from NHANES 2007 to 2010 wave.
d07 = nhanes_load_data("DEMO_E", "2007-2008") %>% select(SEQN, cycle, RIAGENDR, RIDAGEYR) %>% transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR) %>% left_join(nhanes_load_data("VID_E", "2007-2008"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, LBXVIDMS) %>% transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD=LBXVIDMS) %>% left_join(nhanes_load_data("BIOPRO_E", "2007-2008"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, LBXSCA) %>% transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium = LBXSCA) d09 = nhanes_load_data("DEMO_F", "2009-2010") %>% select(SEQN, cycle, RIAGENDR, RIDAGEYR) %>% transmute(SEQN=SEQN, wave=cycle, RIAGENDR, RIDAGEYR) %>% left_join(nhanes_load_data("VID_F", "2009-2010"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, LBXVIDMS) %>% transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD=LBXVIDMS) %>% left_join(nhanes_load_data("BIOPRO_F", "2009-2010"), by="SEQN") %>% select(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, LBXSCA) %>% transmute(SEQN, wave, RIAGENDR, RIDAGEYR, vitD, Calcium = LBXSCA) dat = rbind(d07, d09) all = dat %>% # exclude missings filter(!is.na(vitD), !is.na(Calcium)) %>% mutate(Gender = recode_factor(RIAGENDR, `1` = "Males", `2` = "Females")) head(all) ## SEQN wave RIAGENDR RIDAGEYR vitD Calcium Gender ## 1 41475 2007-2008 2 62 58.8 9.5 Females ## 2 41477 2007-2008 1 71 81.8 10.0 Males ## 3 41479 2007-2008 1 52 78.4 9.0 Males ## 4 41482 2007-2008 1 64 61.9 9.1 Males ## 5 41483 2007-2008 1 66 53.3 8.9 Males ## 6 41485 2007-2008 2 30 39.1 9.3 Females
The dataset is complete. Before running the regression analysis, the linear model, I will check the assumption, that the distribution of the dependent variable (levels of calcium) is normal.
Distribution of calcium level:
ggplot(data = all) + geom_histogram(aes(Calcium), binwidth = 0.2)
It is a normal distribution.
Note: If the distribution is not normal, the dependant variable should be log transform by using log(Calcium)
.
The model
I will use the function lm()
to create a linear regression model. In the first model I will not adjust for confunders, insted, I will do a univariate model.
fit1 <- lm(Calcium ~ vitD, data = all)
To see the results, estimates, pvalues etc use summary
function.
summary(fit1) ## ## Call: ## lm(formula = Calcium ~ vitD, data = all) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.51254 -0.23398 -0.00581 0.22943 2.64876 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.3517792 0.0087769 1065.50 <2e-16 *** ## vitD 0.0016522 0.0001315 12.56 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3683 on 12389 degrees of freedom ## Multiple R-squared: 0.01258, Adjusted R-squared: 0.0125 ## F-statistic: 157.8 on 1 and 12389 DF, p-value: < 2.2e-16
The 95% confidence interval:
confint(fit1) ## 2.5 % 97.5 % ## (Intercept) 9.334575125 9.368983370 ## vitD 0.001394404 0.001910026
Intepretation
From the results, I find that vitamin D is associated with calcium in the blood because the p-value is less than 0.05. Next, I see the direction of the association. The positive beta estimate (\(\beta\) = 0.0016) indicate that with increasing vitamin D in the blood, the levels of calcium also increases.
To visualize this association I will use the ggplot
and the function geom_smooth
. See below:
ggplot(all, aes(x = vitD, y = Calcium)) + geom_point() + geom_smooth(method="lm")
The plot shows an increase of the levels of Calcium with the increase of vitamin D in the blood.
Multivariable adjusted models
Often, a significant association could be explained by confounders. According to Wikipedia, a confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association. Therefore, it is important to adjust for major confounders such as age and gender. The levels of vitamin D in the blood are dependent to age because older adults have lower vitamin D in blood compared to young adults.
To conduct a multivariable-adjusted model I add other variables to the model, in this example, I will add age and gender.
fit2 <- lm(Calcium ~ vitD + Gender + RIDAGEYR, data = all) summary(fit2) ## ## Call: ## lm(formula = Calcium ~ vitD + Gender + RIDAGEYR, data = all) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.50114 -0.22824 -0.00857 0.22354 2.69352 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.4686333 0.0109933 861.307 <2e-16 *** ## vitD 0.0019034 0.0001310 14.526 <2e-16 *** ## GenderFemales -0.0653111 0.0065383 -9.989 <2e-16 *** ## RIDAGEYR -0.0022455 0.0001581 -14.204 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3639 on 12387 degrees of freedom ## Multiple R-squared: 0.03619, Adjusted R-squared: 0.03596 ## F-statistic: 155.1 on 3 and 12387 DF, p-value: < 2.2e-16
The association between vitamin D and calcium remained significant after adjustment, suggesting that the association is independent (e.g., not explained) by age and gender.
Stratifing analysis
To evaluate the association separately in men and women is necessary to conduct a stratified analysis. For this, I need to separate men and women into two different datasets and run linear regression for each group.
allfem = all %>% filter(Gender == "Females") allmal = all %>% filter(Gender == "Males")
Linear regression in women and men
fitfem <- lm(Calcium ~ vitD + RIDAGEYR, data = allfem) summary(fitfem) ## ## Call: ## lm(formula = Calcium ~ vitD + RIDAGEYR, data = allfem) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.03557 -0.24115 -0.01084 0.22396 2.61555 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.2764092 0.0145412 637.940 <2e-16 *** ## vitD 0.0019577 0.0001729 11.321 <2e-16 *** ## RIDAGEYR 0.0005348 0.0002307 2.318 0.0205 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3727 on 6254 degrees of freedom ## Multiple R-squared: 0.02247, Adjusted R-squared: 0.02216 ## F-statistic: 71.89 on 2 and 6254 DF, p-value: < 2.2e-16 fitmal <- lm(Calcium ~ vitD + RIDAGEYR, data = allmal) summary(fitmal) ## ## Call: ## lm(formula = Calcium ~ vitD + RIDAGEYR, data = allmal) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.42787 -0.21555 -0.00506 0.21384 2.70896 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.6027158 0.0150801 636.78 <2e-16 *** ## vitD 0.0016591 0.0001973 8.41 <2e-16 *** ## RIDAGEYR -0.0049452 0.0002105 -23.49 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3451 on 6131 degrees of freedom ## Multiple R-squared: 0.08713, Adjusted R-squared: 0.08684 ## F-statistic: 292.6 on 2 and 6131 DF, p-value: < 2.2e-16
The interpretation of results should be as above.
Thats all.
Related Post
- Multiple Linear Regression in Python
- Linear regression in Python
- Logistic Regression with Python
- Linear Regression with Python
- Bayesian Statistics: Analysis of Health Data
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.