Learning inferential statistics using R
[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Imagine you need to find the average height of 20-year olds. One way is to go around and measure each person individually. But that seems quite a bit of work, doesn’t it? Luckily, there’s a better way. Inferential statistics allows us to use samples to draw conclusions about the population. In other words, we can get a small group of people and use their characteristics to estimate the characteristics of the entire group.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
To see how this works in practice, let’s take a look at a dataset from Kaggle. This platform provides a wealth of data sets from various fields, each offering unique challenges for R users. Here, we’ll be using a dataset on Cardiovascular diseases compiled by Jocelyn Dumlao.
This dataset originates from a renowned multispecialty hospital situated in India, encompassing a comprehensive array of health-related information. Comprising an extensive structure of 1000 columns and 14 rows, this dataset plays a pivotal role in the early detection of diseases.
Let us see how to import this into RStudio. The dataset is imported into RStudio using the library ‘readr’ (this is only if the dataset is in .csv format). Replace “File path” with the path of your downloaded dataset.
library(readr) cardio <- read.csv("File path")Just type in the name of the variable you used to import the dataset so that you can view the entire dataset in RStudio.
cardio
The first 6 rows of the dataset can be viewed using the ‘head’ function.
top_6=head(cardio) top_6
Similarly, the last 6 rows of the dataset can be viewed using the ‘tail’ function.
bottom_6=tail(cardio) bottom_6
The dimension of the dataset (number of rows and columns) can be found out using the ‘dim’ function.
dimension=dim(cardio) dimension
The entire dataset can be termed as population and all the population parameters can be easily found. The mean of a target variable in the population is calculated by the ‘mean’ function. Below, we choose serumcholestrol as the target variable.
mean_chol=mean(cardio$serumcholestrol) mean_chol
So, we can infer that the average serumcholestrol levels in the patient population taken from the hospital is 311.447.
There also exists a function to calculate the standard deviation of a dataset.
std_chol=sd(cardio$serumcholestrol) std_chol
From this value, it can be understood that the values of serumcholestrol lies 132.4438 below or above the mean level.
We take a random sample of size 100 where our target variable is serumcholestrol. If you want to take a random sample with replacement, give the third argument as TRUE. Here, we’re taking a sample without replacement.
sample_1=sample(cardio$serumcholestrol,100,FALSE) sample_1
mean_sample_chol=mean(sample_1) mean_sample_chol
The mean of the sample that we selected is 317.51. This mean can be used to calculate the test statistic which further can be used to make decisions about the null hypothesis(whether to accept or reject).
Calculating the standard error of the sample
Getting the standard deviation of a dataset gives us many insights. Standard deviation provides the spread of the data around the mean. The standard deviation of sampling distribution is called standard error.
std_error=sd(sample_1) std_error
The mean and the standard error of the sample is close to the population mean and standard deviation.
Plotting the sample distribution in histogram with x-axis as frequency and y-axis as Cholesterol levels.
To get a sampling distribution, we repeatedly take samples 1000 times. This is done using the replicate function, which repeatedly evaluates an expression a given number of times.
samp_dist_1=replicate(1000,mean(sample(cardio$serumcholestrol,100,replace=TRUE))) samp_dist_1The obtained graph is similar to normal distribution graph. That is, values near the mean is occurring more frequently than values far from mean. Now let's calculate the variance of the sampling distribution using the var function.
variance_sample_1=var(samp_dist_1) variance_sample_1
Now let us see how increasing the sample size affects the variance of the sample.
Increasing the sample size by 200
sample_2=sample(cardio$serumcholestrol,200,FALSE) sample_2Calculating the mean of the sample 2
mean_sample_chol=mean(sample_2) mean_sample_chol
The mean of the sample 2 with sample size 200 is 308.875 .
Calculating the standard error of the sample2
std_error=sd(sample_2) std_error
The standard error of sample2 is 135.9615 .
We repeat the previous steps to obtain a sampling distribution.
samp_dist_2=replicate(1000,mean(sample(cardio$serumcholestrol,200,replace=TRUE))) samp_dist_2Now we plot it like before.
hist(samp_dist_2,main="Sampling distribution of serum_cholestrol",xlab = "Frequency",ylab = "Cholestrol Levels", col = "skyblue")
variance_sample_2=var(samp_dist_2) variance_sample_2The variance of the sample 2 with sample size 200 is 84.513. That is, the variance of sample 1 with size 100 is greater than the latter sample. Hence we can conclude that as sample size increase, variance as well as standard error reduces. On the other hand, precision increases with an increase in sample size.
Authors: Aadith Joseph Mathew, Amrutha Paalathara, Devika S Vinod, Jyosna Philip
Learning inferential statistics using R was first posted on December 16, 2023 at 7:55 pm.
To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.