Statistical inference with the General Social Survey Data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Setup
Load packages
Load data
Part 1: Data
Background
The General Social Survey (GSS) is a sociological survey used to collect information and keep a historical record of the concerns, experiences, attitudes, and practices of residents of the United States. GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior. The dataset used for this project is an extract of the General Social Survey (GSS) Cumulative File 1972-2012. It consists of 57061 observations with 114 variables. Each variable corresponds to a specific question asked to the respondent.
Methodology
According to Wikipedia, The GSS survey is conducted face-to-face with an in-person interview by NORC at the University of Chicago. The target population is adults (18+) living in households in the United States. Respondents are random sampled from a mix of urban, suburban, and rural geographic areas. Participation in the study is strictly voluntary.
The scope of inference
The sample data should allow us to generalize to the population of interest. It is a survey of 57061 U.S adults aged 18 years or older. The survey is based on random sampling. However, there is no causation can be established as GSS is an observation study that can only establish correlation/association between variables. In addition, potential biases are associated with non-response because this is a voluntary in-person survey that takes approximately 90 minutes. Some potential respondents may choose not to participate.
Part 2: Research questions
Research Question 1: Do Americans distrust the media? That is, correlation between reading news and confidence in press.
According to Financial Times, Public trust in traditional media has fallen to an all-time low as people increasingly favour their friends and contacts on the internet as sources of news. This has prompted me to learn how frequency of reading news may impact a person’s confidence in press. Based on above question, the analysis will be based on the following variables:
- news – How often does respondent read newspaper.
- conpress – Confidence in press.
To reflect the most recent trends, I will only keep the data from year 2010 and removing all “NA”s.
Exploratory data analysis for Research Question 1
The data I will be using for this analysis has 1398 observations. Here is the summary and two-way table.
Mosaic plots provide a way to visualize contingency tables. If we compare the mosaic plot to the table of counts, the size of the boxes are related to the counts in the table. We can see that the number of respondents who read every day and only have some confidence in press are the highest, and the number of respondents who read few times a week and have great confidence in press are the lowest.
Observations:
- The number of respondents who have a great deal of confidence in press are consistently lowest across all groups of news readers.
- The number of respondents who hardly have any confidence in press are highest in three group of news readers – “Never”, “Less than Once Wk” and “Once A Week”.
Inference for Research Question 1
State hypothesis
Null Hypothesis: Frequency of reading news and confidence in press are independent. Frequency of reading news does not vary by confidence in press.
Alternative Hypothesis: Frequency of reading news and confidence in press are dependent. Frequency of reading news does vary by confidence in press.
Check conditions
- The survey respondents were random sampled, we can assume the independence.
- If sampling without replacement, n < 10% of population. The 1398 observations met this requirement.
- Each case only contribute to one cell in the table. Because the independence requirement has been met, we can check this as well.
- Each cell must have at least 5 expected cases. From below table, it is clear that it meets the requirement.
State method to be used and why and how
The method used here is Chi-Square Independence Test since we are quantifying how different the observed counts are from the expected counts and we are evaluating relationship between two categorical variables.
Perform inference
The Chi-Squre statistic is 8.6459, degree of freedom is 8 and the associated P value is larger than the significance level of 0.05.
Interpret results
We failed to reject null hypothesis, the data does not provide convincing evidence that the frequency of reading news and confidence in press are associated.
Research Question 2: Does marital status have impact on financial satisfaction?
According to a report from The National Center for Biotechnology Information (NCBI), people who are financially satisfied tend to be more stable in their marriages. I am interested to find out whether satisfaction in financial situation is associated with marital status, by analyzing GSS data. I will be using the following variables for this analysis:
- marital - Marital status
- satfin - Satifaction with financial situation
I will remove “NA”s and keep all years data.
Exploratory data analysis for Research Question 2
The data I will be using for this analysis have 52441 observations. Here is the summary and two-way table.
Interesting to see that the number of respondents who are “married” and “satisfied” are not the highest across all the groups. Seems people are hard to please.
Observations:
- The number of respondents who are more or less satisfied with financial situation are the highest across all groups of marital status except “Separated” group.
- The number of respondents who are not at all satisfied with financial situation are lowest in “Married” and “Widowed” groups.
- The number of respondents who are satisfied with financial situation are lowest in “Never Married”, “Separated” and “Divorced” groups.
Inference for Research Question 2
State hypothesis
Null Hypothesis: Marital status and satisfaction with financial situation are independent. Satisfaction with financial situation does not vary by marital status.
Alternative Hypothesis: Marital status and satisfaction with financial situation are dependent. Satisfaction with financial situation does vary by marital status.
Check conditions
- The independence and sampling population (n < 10%) have been discussed earlier.
- Each cell must have at least 5 expected cases. From below table, it is clear that it meets the requirement
State method to be used and why and how
Again, the method used here is Chi-Square Independence Test since we are quantifying how different the observed counts are from the expected counts and we are evaluating relationship between two categorical variables.
Perform inference
The Chi-Squre statistic is 1728.2, degree of freedom is 8 and the associated P value is almost 0.
Interpret results
With such a small P value, We reject null hypothesis in favor of alternative hypothesis and conclude that satisfaction with financial situation and marital status are dependent. The study is observational, so we can only establish association but not causation between these two variables.
Research Question 3: Income disparity among races
The existence of racial wage gap in America has been documented by many scholars and written about in many journals. Some research finds that the racial wage gap is focused in the private sector. How about U.S citizen in general? I am interested in drawing statistical inference using GSS data to compare family income between difference races. I will be using the following variables for this analysis:
- race: Race of respondent
- coninc: Total family income in constant dollars
Again, remove all “NA”s and keep data for all the years.
Exploratory data analysis for Research Question 3
The data I will be using for this analysis has 51232 observations and 2 variables.
Observations:
- The distribution of total family incomce for all race groups are right skewed.
- The White group has the highest total family income in first quartile, median, mean and 3rd quartile.
- The Black group has the least variability in total family income.
Inference for Research Question 3
State hypothesis
Null Hypothesis: The average total family income is the same across all the race groups.
Alternative Hypothesis: The average total family income differ between at least one pair of the race groups.
Check conditions
- The survey respondents were random sampled and the samples are less than 10% of population. So the condition of independence within group is met.
- We have three race groups and they are independent of each other.
- Distribution of response variable in each race group should be approximately normal. Unfortunatley, from below QQ plot, we can see all three groups have quite a bit divergence from normality in the upper tail.
- Variability should be consistent across groups. From above boxplot, it seems that the variability is consistent across “White” and “Other” groups, but it is much lower for the “Black” group.
Although the assumptions of ANOVA did not meet, I will still go ahead to perform ANOVA but I have to be very careful in interpreting the results.
State method to be used and why and how
We compare average family income from more than two groups, - “White”, “Black” and “Other”. We want to know whether those three means are so far apart that the observed differences cannot all reasonably be attributed to sampling variability. So we use ANOVA.
Interpret results
With such a small P value (almost 0), we reject the null hypothesis, we have sufficient evidence that at least one pair of race groups’ population average family income are different from each other. But we don’t know which pair of groups.
Since the results of ANOVA was significant, we are safe to continue with our pairwise comparisons to find out which pair of groups have difference income.
Using the Bonferroni adjustment, all group pairs, White-Black, White-Other, Black-Other comparisons are statistically significant. We are using a more stringent and conservative correction (Bonferroni adjustment) and we still got very small P values, therefore, we can reject the null hypothesis again.
Confidence Interval
As a last step, calculate confidence interval for one pairwise group difference - “White” and “Other”.
Conclusion
We are 95% confident that the average total family income of the “White” group was $3042 to $6140 higher than the “Other” group per year.
Here we performed ANOVA to assess whether a significant difference exists at all amongst the groups, whereas the pairwise comparisons were used to determine which group differences are statistically significant. Although our data have significant divergence from normality for all three race groups, ANOVA is considered a robust test against the normality assumption.
References:
Pearson’s Chi-squared test for count data
Statistical inference with the General Social Survey Data was originally published by Susan Li at Susan Li | Data Ninja on June 07, 2017.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.