Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.
We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.
This is the tenth part of the series and it aims to cover the very basics of the subject of principal correlation coefficient and components analysis, those two methods illustrate how variables are related.
In my opinion, it is necessary for researchers to know how to have a notion of the relationships between variables, in order to be able to find potential cause and effect relation – however this relation is hypothetical, you can’t claim that there is a cause-effect relation only because the correlation is high between those two variables-,remove unecessary variables etc. In particular we will go through Pearson correlation coefficient and Confidence interval by the bootstrap and ( Principal component analysis.
Before proceeding, it might be helpful to look over the help pages for the ggplot
, cor
, cor.tes
, boot.cor
, quantile
, eigen
, princomp
, summary
, plot
, autoplot
.
Moreover please load the following libraries.
install.packages("ggplot2")
library(ggplot2)
install.packages("ggfortify")
library(ggfortify)
Please run the code below in order to load the data set and transform it into a proper data frame format:
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Compute the value of the correlation coefficient for the variables age
and preg
.
Exercise 2
Construct the scatterplot for the variables age
and preg
.
Exercise 3
Apply a correlation test for the variables age
and preg
with null hypothesis to be the correlation is zero and the alternative to be different from zero.
hint: cor.test
Exercise 4
Construct a 95% confidence interval is by the bootstrap. First find the correlation by bootstrap.
hint: mean
Exercise 5
Now that you have found the correlation, find the 95% confidence interval.
Exercise 6
Find the eigen values and eigen vectors for the data set(exclude the class.fac
variable).
Exercise 7
Compute the principal components for the dataset used above.
Exercise 8
Show the importance of each principal component.
Exercise 9
Plot the principal components using an elbow graph.
Exercise 10
Constract a scatterplot with x-axis to be the first component and the y-axis to be the second component. Moreover if possible draw the eigen vectors on the plot.
hint: autoplot
Related exercise sets:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.