Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
‘Happiness in intelligent people is the rarest thing I know’
A character in Ernest Hemingway’s novel “The Garden of Eden”
Greetings, humanists, social and data scientists!
In this lesson, we will learn how to evaluate the relationship between two variables with R. Check out the video below for a short introduction.
Data source
The Guerry dataset is provided by the R package HistData. To know more about this package, please refer to our lesson ‘Uncovering History with R – A Look at the HistData Package’.
Coding the past: the relationship between literacy and suicides in 1830s France
1. Exploring Andre-Michel Guerry’s Pioneering Data: Moral Statistics of 1830s France
Andre-Michel Guerry was a French lawyer who was passionate about statistics. He is considered to be the founder of moral statistics and had a major influence on the development of modern social science. His work “Essay on the Moral Statistics of France” includes data on several social variables of 86 French departments in the 1830s.
To access this data, we need to load the HistData package. After doing so, we can use the command help(Guerry)
to see the description of the dataset and the details about each of the 23 variables. Variables include information such as population, crime, literacy, suicide, wealth, and location of the 86 French departments.
You can use df <- Guerry
to load the data. Feel free to explore the dataset and check the structure of the dataframe with str(df)
.
content_copy Copy
2. Add a new column to a dataframe in R
In the documentation of the dataset, the author states “Note that most of the variables (e.g., Crime_pers) are scaled so that ‘more is better’ morally.”. Thus, suicide, for example, is expressed as the population divided by the number of suicides. In this way, the fewer the suicides, the larger the value in the Suicides
column.
To make our analysis easier to interpret, we can calculate the inverse of Suicides
, that is, instead of having population/suicides, we will consider suicides/population (suicides per inhabitants). Moreover, to avoid very small numbers, let us multiply this by 100,000 so that we have suicides per 100,000 population. The code below creates this new variable.
content_copy Copy
3. Use geom_point to create a scatter plot
Now, we’ll examine the relationship between Suicides_Pop
and Literacy
using a scatter plot. As per the documentation, Literacy
represents the “percentage of military conscripts who can read and write” in a department. Keep in mind that the relationships studied in this lesson apply only to this subgroup which is not representative of the whole population. The code below leverages geom_point
to visualize this relationship.
content_copy Copy
Please note, the code above incorporates the function theme_coding_the_past
to style the plot. You can access this theme in the lesson ‘Climate Data Visualization’
The plot suggests that as literacy percentages rise, suicide rates tend to increase. In the distribution of literacy rates below, we also see that the majority of the French departments recorded literacy rates lower than 50% (indicated by the dashed line). If you count the departments to the right of the dashed line, you will find 24 departments, which represents only 24/86 = 28% of the total departments. Notably, the highest suicide rates are in this subgroup.
content_copy Copy
4. cort.test in R
Having observed a graphical association between Literacy
and Suicides
, let’s use cor.test
to find this association analytically. This function takes two arguments x
and y
and returns a Pearson correlation coefficient (by default) and its statistical significance. As explained in the lesson R programming for climate data analysis and visualization “correlation measures how much two variables change together. It ranges from 1 to -1, where 1 means perfect positive correlation, 0 means no correlation at all and -1 means perfect negative correlation”.
Using cor.test(x = df$Literacy, df$Suicides_Pop)
we obtain a correlation coefficient of 0.4 which means a moderate positive correlation. As literacy increases so does suicide proportion. The p-value is less than 0.01, meaning there is a statistically significant association between Literacy
and Suicides_Pop
. Framed differently, under the hypothesis that there is no correlation between the two variables, the probability of finding a coefficient of 0.4 or higher would be less than 1%. So we can reject the null hypothesis.
5. Linear models with R
To further study the relationship between these two variables let’s model 3 linear regressions. To know more about linear regression, check out the lesson R programming for climate data analysis and visualization.
The first model will only include Suicides_Pop
as the dependent variable and Literacy
as the independent variable. Use summary(lm(Suicides_Pop ~ Literacy, data = df))
to see the results of this model. The literacy coefficient tells us that if we increase the literacy rate by 1%, then the suicide proportion grows by 0.11. Put differently, a 10% increase in literacy is associated with around 1 suicide more per 100,000 population. This estimate is statistically significant.
In the code below, we use geom_smooth
to plot the regression line describing the positive link between literacy and suicides. The method
argument tells ggplot to use a linear model (lm) to depict the relationship.
content_copy Copy
Note that we cannot say that higher literacy rates directly cause more suicides, as factors beyond literacy rates might influence suicide rates. In the next section, we will check whether wealth and the distance to Paris influence suicides as well. Moreover, we will determine if the association between literacy and suicides holds even after controlling for these variables. To show the results, we will use stargazer, a very handy package designed for displaying linear model results.
6. How to use stargazer in R
The stargazer
package offers a very neat and practical way of presenting the results of several linear models. Users can set it up to produce LaTeX or HTML outputs using the type
argument. In the code that follows, we configure it to generate HTML, making it suitable for this blog post. First, we create three models adding variables indicating the wealth and distance to Paris of each department. Second, we pass these models to stargazer.
content_copy Copy
The stargazer
table can be seen below. Note in model 2 that Wealth
appears to influence Suicides
negatively, meaning that richer areas are associated with fewer suicides. The coefficient regarding Literacy
decreases a bit but remains statistically significant. Finally, model 3 includes the distance to Paris as an additional variable. The coefficient of Literacy
decreases again but remains statistically significant. Moreover, being close to Paris is associated with more suicides.
Dependent variable: | |||
Suicides_Pop | |||
(1) | (2) | (3) | |
Literacy | 0.112*** | 0.080*** | 0.064** |
(0.027) | (0.026) | (0.025) | |
Wealth | -0.080*** | -0.059*** | |
(0.018) | (0.018) | ||
Distance | -0.014*** | ||
(0.004) | |||
Constant | 0.645 | 5.347*** | 7.901*** |
(1.168) | (1.489) | (1.604) | |
Observations | 86 | 86 | 86 |
R2 | 0.167 | 0.329 | 0.408 |
Adjusted R2 | 0.157 | 0.313 | 0.386 |
Residual Std. Error | 4.360 (df = 84) | 3.938 (df = 83) | 3.720 (df = 82) |
F Statistic | 16.826*** (df = 1; 84) | 20.321*** (df = 2; 83) | 18.841*** (df = 3; 82) |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
Like all social phenomena, the incidence of suicide is shaped by a multitude of factors. While we cannot definitively claim that literacy directly caused suicides in 19th-century France, our analysis above does indicate an association between these variables. Delving deeper into the contextual nuances of France in the 1830s might shed light on whether literacy indeed influenced the decision to commit suicide. For instance, check this article by Lisa Lieberman “Romanticism and the Culture of Suicide in Nineteenth-Century France”
If you are interested in this topic, The Sorrows of Young Werther, by Johann Wolfgang Goethe, is a literary representation of a particular view on suicide that would influence the Romantic movement in 19th-century Europe.
If you have any questions or would like to share your thoughts on this topic, please feel free to ask in the comments below.
Conclusions
- Association between two variables can be identified with a scatter plot;
- It can also be explored analytically with
cor.test
; - Linear regression helps us further understand the relationship of two variables, given other relevant variables
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.