Chi-square test of independence in R

R on Stats and R

2 years ago

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

This article explains how to perform the Chi-square test of independence in R and how to interpret its results. To learn more about how the test works and how to do it by hand, I invite you to read the article “Chi-square test of independence by hand”.

To briefly recap what have been said in that article, the Chi-square test of independence tests whether there is a relationship between two categorical variables. The null and alternative hypotheses are:

$H_0$ : the variables are independent, there is no relationship between the two categorical variables. Knowing the value of one variable does not help to predict the value of the other variable
$H_1$ : the variables are dependent, there is a relationship between the two categorical variables. Knowing the value of one variable helps to predict the value of the other variable

The Chi-square test of independence works by comparing the observed frequencies (so the frequencies observed in your sample) to the expected frequencies if there was no relationship between the two categorical variables (so the expected frequencies if the null hypothesis was true).

Example

Data

For our example, let’s reuse the dataset introduced in the article “Descriptive statistics in R”. This dataset is the well-known iris dataset slightly enhanced. Since there is only one categorical variable and the Chi-square test requires two categorical variables, we added the variable size which corresponds to small if the length of the petal is smaller than the median of all flowers, big otherwise:

dat <- iris
dat$size <- ifelse(dat$Sepal.Length < median(dat$Sepal.Length),
  "small", "big"
)

We now create a contingency table of the two variables Species and size with the table() function:

table(dat$Species, dat$size)
##             
##              big small
##   setosa       1    49
##   versicolor  29    21
##   virginica   47     3

The contingency table gives the observed number of cases in each subgroup. For instance, there is only one big setosa flower, while there are 49 small setosa flowers in the dataset.

It is also a good practice to draw a barplot representing the data:

library(ggplot2)

ggplot(dat) +
 aes(x = Species, fill = size) +
 geom_bar() +
 scale_fill_hue() +
 theme_minimal()

Chi-square test of independence

For this example, we are going to test in R if there is a relationship between the variables Species and size. For this, the chisq.test() function is used:

test <- chisq.test(table(dat$Species, dat$size))
test
## 
##  Pearson's Chi-squared test
## 
## data:  table(dat$Species, dat$size)
## X-squared = 86.035, df = 2, p-value < 2.2e-16

Everyting you need appears in this output: the title of the test, what variables have been used, the test statistic, the degrees of freedom and the $p$-value of the test.¹ You can also retrieve the $\chi^2$ test statistic and the $p$-value with:

test$statistic # test statistic
## X-squared 
##  86.03451
test$p.value # p-value
## [1] 2.078944e-19

If you need to find the expected frequencies, use test$expected.

Conclusion and interpretation

From the table and from test$p.value we see that the $p$-value is less than the significance level of 5%. Like any other statistical test, if the $p$-value is less than the significance level, we can reject the null hypothesis. < !-- If you are not familiar with $p$-values, I invite you to read this [article](/blog/xxx/). -->

$\Rightarrow$ In our context, rejecting the null hypothesis for the Chi-square test of independence means that there is a significant relationship between the species and the size. Therefore, knowing the value of one variable helps to predict the value of the other variable

Thanks for reading. I hope the article helped you to perform the Chi-square test of independence in R and interpret its results. If you would like to learn how to do this test by hand, read “Chi-square test of independence by hand”. As always, if you find a mistake/bug or if you have any questions do not hesitate to let me know in the comment section below, raise an issue on GitHub or contact me. Get updates every time a new article is published by subscribing to this blog.

If a warning such as “Chi-squared approximation may be incorrect.” appears, it means that the smallest expected frequencies is lower than 5. To avoid this issue, you can either: (i) gather some levels (especially those with a small number of observations) to increase the number of observations in the subgroups, or (ii) use the Fisher’s exact test (which do not have this assumption) with the function fisher.test(). This test is similar to the Chi-square test in terms of hypothesis and interpretation of the results.↩

To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.