Site icon R-bloggers

To eat or not to eat! That’s the question? Measuring the association between categorical variables

[This article was first published on Stories Data Speak, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

1. Introduction

I serve as a reviewer to several ISI and Scopus indexed journals in Information Technology. Recently, I was reviewing an article, wherein the researchers had made a critical mistake in data analysis. They converted the original categorical data to continuous without providing a rigorous statistical treatment, nor, any justification to the loss of information if any. Thus, my motivation to develop this study, is borne out of their error.

We know the standard association measure between continuous variables is the product-moment correlation coefficient introduced by Karl Pearson. This measure determines the degree of linear association between continuous variables and is both normalized to lie between -1 and +1 and symmetric: the correlation between variables x and y is the same as that between y and x. the best-known association measure between two categorical variables is probably the chi-square measure, also introduced by Karl Pearson. Like the product-moment correlation coefficient, this association measure is symmetric, but it is not normalized. This lack of normalization provides one motivation for Cramer’s V, defined as the square root of a normalized chi-square value; the resulting association measure varies between 0 and 1 and is conveniently available via the assocstats function in the vcd package. An interesting alternative to Cramer’s V is Goodman and Kruskal’s tau, which is not nearly as well known and is asymmetric. This asymmetry arises because the tau measure is based on the fraction of variability in the categorical variable y that can be explained by the categorical variable x. 1

The data for this study is sourced from UCI Machine Learning repository. As it states in the data information section, “This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The guide clearly states that there is no simple rule for determining the edibility of a mushroom;

Furthermore, the possible research questions, I want to explore are;

2. Making data management decisions

As a first step, I imported the data in R environment as;

# Import data from UCI ML repo
> theURL<- "http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
# Explicitly adding the column headers from the data dictionary
> mushroom.data<- read.csv(file = theURL, header = FALSE, sep = ",",strip.white = TRUE,
                     stringsAsFactors = TRUE, 
                     col.names = c("class","cap-shape","cap-surface","cap-color","bruises",
                                   "odor","gill-attachment","gill-spacing","gill-size",
                                   "gill-color","stalk-shape","stalk-root","stalk-surface-above-ring",
                                   "stalk-surface-below-ring","stalk-color-above-ring","stalk-color-below-ring",
                                   "veil-type","veil-color","ring-number","ring-type","spore-print-color",
                                   "population","habitat"))

Next, I quickly summarize the dataset to get a brief glimpse. The reader’s should note that the data has no missing values.

# Calculate number of levels for each variable
> mushroom.data.levels<-cbind.data.frame(Variable=names(mushroom.data), Total_Levels=sapply(mushroom.data,function(x){as.numeric(length(levels(x)))}))
> print(mushroom.data.levels)
                                         Variable Total_Levels
class                                       class            2
cap.shape                               cap.shape            5
cap.surface                           cap.surface            3
cap.color                               cap.color           10
bruises                                   bruises            2
odor                                         odor            8
gill.attachment                   gill.attachment            1
gill.spacing                         gill.spacing            2
gill.size                               gill.size            3
gill.color                             gill.color           10
stalk.shape                           stalk.shape            3
stalk.root                             stalk.root            6
stalk.surface.above.ring stalk.surface.above.ring            4
stalk.surface.below.ring stalk.surface.below.ring            5
stalk.color.above.ring     stalk.color.above.ring            7
stalk.color.below.ring     stalk.color.below.ring            8
veil.type                               veil.type            2
veil.color                             veil.color            2
ring.number                           ring.number            3
ring.type                               ring.type            5
spore.print.color               spore.print.color            7
population                             population            7
habitat                                   habitat            8

As we can see, the variable, gill.attachement has just one level. Nor, does it make any significant contribution to the target class so dropping it.

# dropping variable with constant variance
> mushroom.data$gill.attachment<- NULL

The different levels are uninterpretable in their current format. I will use the data dictionary and recode the levels into meaningful names.

> levels(mushroom.data$class)<- c("edible","poisonous")
> levels(mushroom.data$cap.shape)<-c("bell","conical","flat","knobbed","sunken","convex") 
> levels(mushroom.data$cap.surface)<- c("fibrous","grooves","smooth","scaly")
> levels(mushroom.data$cap.color)<- c("buff","cinnamon","red","gray","brown","pink","green","purple","white","yellow")
> levels(mushroom.data$bruises)<- c("bruisesno","bruisesyes")
> levels(mushroom.data$odor)<-c("almond","creosote","foul","anise","musty","nosmell","pungent","spicy","fishy")
> levels(mushroom.data$gill.attachment)<- c("attached","free")
> levels(mushroom.data$gill.spacing)<- c("close","crowded")
> levels(mushroom.data$gill.size)<-c("broad","narrow")
> levels(mushroom.data$gill.color)<- c("buff","red","gray","chocolate","black","brown","orange","pink","green","purple","white","yellow")
> levels(mushroom.data$stalk.shape)<- c("enlarging","tapering")
> table(mushroom.data$stalk.root) # has a missing level coded as ?
   	?    b    c    e    r 
2480 3776  556 1120  192 
> levels(mushroom.data$stalk.root)<- c("missing","bulbous","club","equal","rooted")
> levels(mushroom.data$stalk.surface.above.ring)<-c("fibrous","silky","smooth","scaly")
> levels(mushroom.data$stalk.surface.below.ring)<-c("fibrous","silky","smooth","scaly")
> levels(mushroom.data$stalk.color.above.ring)<- c("buff","cinnamon","red","gray","brown",                "orange","pink","white","yellow")
> levels(mushroom.data$stalk.color.below.ring)<- c("buff","cinnamon","red","gray","brown",      "orange","pink","white","yellow")
> levels(mushroom.data$veil.type)<-c("partial")
> levels(mushroom.data$veil.color)<- c("brown","orange","white","yellow")
> levels(mushroom.data$ring.number)<-c("none","one","two")
> levels(mushroom.data$ring.type)<- c("evanescent","flaring","large","none","pendant")
> levels(mushroom.data$spore.print.color)<- c("buff","chocolate","black","brown","orange","green","purple","white","yellow")
> levels(mushroom.data$population)<- c("abundant","clustered","numerous","scattered","several","solitary")
> levels(mushroom.data$habitat)<-c("woods","grasses","leaves","meadows","paths","urban","waste")

3. Initial data visualization

a. Is there a relationship between cap-surface and cap-shape of a mushroom?
> p<- ggplot(data = mushroom.data, aes(x=cap.shape, y=cap.surface, color=class))
> p + geom_jitter(alpha=0.3) + scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

Fig-1: Mushroom cap-shape and cap-surface

From Fig-1, we can easily notice, the mushrooms with a, flat cap-shape and scaly, smooth or fibrous cap-surface are poisonous. While, the mushrooms with a, bell,knob or sunken cap-shape and are fibrous, smooth or scaly are edible. A majority of flat cap-shaped mushrooms with scaly or smooth cap surface are poisonous.

b. Is mushroom habitat and its population related?
> p<- ggplot(data = mushroom.data, aes(x=population, y=habitat, color=class))
> p + geom_jitter(alpha=0.3) +  
  	scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

Fig-2: Mushroom cap-shape and cap-surface

From Fig-2, we see that mushrooms which are clustered or scattered in population and living in woods are entirely poisonous. Those that live in grasses, wasteland, meadows, leaves, paths and urban area’s are edible.

c. What’s the deal with living condition and odor?
> p<- ggplot(data = mushroom.data, aes(x=habitat, y=habitat, color=class))
> p + geom_jitter(alpha=0.3) +scale_color_manual(breaks = c('edible','poisonous'),values=c('darkgreen','red'))

Fig-3: Mushroom habitat and odor

From Fig-3, we notice, the mushrooms with fishy, spicy,pungent, foul, musty and creosote odor are clearly marked poisonous irrespective of there habitat. Whereas the one’s with almond, anise odour are edible mushrooms. We also notice, that a minority of no odour mushrooms are poisonous while the one’s living in meadows are entirely poisonous in nature.

Although, there could be many other pretty visualizations but I will leave that as a future work.

I will now focus on exploratory data analysis.

4. Exploratory data analysis

a. Correlation detection & treatment for categorical predictors

If we look at the structure of the dataset, we notice that each variable has several factor levels. Moreover, these levels are unordered. Such unordered categorical variables are termed as nominal variables. The opposite of unordered is ordered, we all know that. The ordered categorical variables are called, ordinal variables.

“In the measurement hierarchy, interval variables are highest, ordinal variables are next, and nominal variables are lowest. Statistical methods for variables of one type can also be used with variables at higher levels but not at lower levels.”, see Agresti

I found this cheat-sheet that can aid in determining the right kind of test to perform on categorical predictors (independent/explanatory variables). Also, this SO post is very helpful. See the answer by user gung.

For categorical variables, the concept of correlation can be understood in terms of significance test and effect size (strength of association)

The Pearson’s chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data. It is a significance test. Given two categorical random variables, X and Y, the chi-squared test of independence determines whether or not there exists a statistical dependence between them. Formally, it is a hypothesis test. The chi-squared test assumes a null hypothesis and an alternate hypothesis. The general practice is, if the p-value that comes out in the result is less than a pre-determined significance level, which is 0.05 usually, then we reject the null hypothesis.

H0: The The two variables are independent

H1: The The two variables are dependent

The null hypothesis of the chi-squared test is that the two variables are independent and the alternate hypothesis is that they are related.

To establish that two categorical variables (or predictors) are dependent, the chi-squared statistic must have a certain cutoff. This cutoff increases as the number of classes within the variable (or predictor) increases.

In section 3a, 3b and 3c, I detected possible indications of dependency between variables by visualizing the predictors of interest. In this section, I will test to prove how well those dependencies are associated. First, I will apply the chi-squared test of independence to measure if the dependency is significant or not. Thereafter, I will apply the Goodman’s Kruskal Tau test to check for effect size (strength of association).

i. Pearson’s chi-squared test of independence (significance test)
> chisq.test(mushroom.data$cap.shape, mushroom.data$cap.surface, correct = FALSE)

Pearson's Chi-squared test

data:  mushroom.data$cap.shape and mushroom.data$cap.surface
X-squared = 1011.5, df = 15, p-value < 2.2e-16

since the p-value is < 2.2e-16 is less than the cut-off value of 0.05, we can reject the null hypothesis in favor of alternative hypothesis and conclude, that the variables, cap.shape and cap.surface are dependent to each other.

> chisq.test(mushroom.data$habitat, mushroom.data$odor, correct = FALSE)

Pearson's Chi-squared test

data:  mushroom.data$habitat and mushroom.data$odor
X-squared = 6675.1, df = 48, p-value < 2.2e-16    

Similarly, the variables habitat and odor are dependent to each other as the p-value < 2.2e-16 is less than the cut-off value 0.05.

ii. Effect size (strength of association)

The measure of association does not indicate causality, but association–that is, whether a variable is associated with another variable. This measure of association also indicates the strength of the relationship, whether, weak or strong.

Since, I’m dealing with nominal categorical predictor’s, the Goodman and Kruskal’s tau measure is appropriate. Interested readers are invited to see pages 68 and 69 of the Agresti book. More information on this test can be seen here

> library(GoodmanKruskal)
> varset1<- c("cap.shape","cap.surface","habitat","odor","class")
> mushroomFrame1<- subset(mushroom.data, select = varset1)
> GKmatrix1<- GKtauDataframe(mushroomFrame1)
> plot(GKmatrix1, corrColors = "blue")

In Fig-4, I have shown the association plot. This plot is based on the corrplot library. In this plot the diagonal element K refers to number of unique levels for each variable. The off-diagonal elements contain the forward and backward tau measures for each variable pair. Specifically, the numerical values appearing in each row represent the association measure τ(x,y)τ(x,y) from the variable xx indicated in the row name to the variable yy indicated in the column name.

The most obvious feature from this plot is the fact that the variable odor is almost perfectly predictable (i.e. τ(x,y)=0.94) from class and this forward association is quite strong. The forward association suggest that x=odor (which has levels “almond”, “creosote”, “foul”, “anise”, “musty”, “nosmell”, “pungent”, “spicy”, “fishy”) is highly predictive of y=class (which has levels “edible”, “poisonous”). This association between odor and class is strong and indicates that if we know a mushroom’s odor than we can easily predict its class being edible or poisonous.

On the contrary, the reverse association y=class and x=odor(i.e. τ(y,x)=0.34; is a strong association and indicates that if we know the mushroom’s class being edible or poisonous than its easy to predict its odor.

Earlier we have found cap.shape and cap.surface are dependent to each other (chi-squared significance test). Now, let’s see if the association is strong too or not. Again, from Fig-4, both the forward and reverse association suggest that x=cap shape is weakly associated to y=cap surface (i.e.τ(x,y)=0.03) and (i.e.τ(y,x)=0.01). Thus, we can safely say that although these two variables are significant but they are association is weak; i.e. it will be difficult to predict one from another.

Similarly, many more associations can be interpreted from plot-4. I invite interested reader’s to explore it further.

5. Conclusion

The primary objective of this study was to drive the message, do not tamper the data without providing a credible justification. The reason I chose categorical data for this study to provide an in-depth treatment of the various measures that can be applied to it. From my prior readings of statistical texts, I could recall that significance test alone was not enough justification; there had to be something more. It is then, I found the about the different types of association measures and it sure did cleared my doubts. In my next post, I will continue the current work by providing inferential and predictive analysis. For interested reader’s, I have uploaded the complete code on my Github repository in here

To leave a comment for the author, please follow the link and comment on their blog: Stories Data Speak.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.