[This article was first published on My R Nightmares, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
While writing a paper on sparse principal component analysis I came across an old dataset containing 1990s socio-economic data and rate of violent crime for 1994 communities in the US. I am not a sociologist, so my analysis may be superficial, but I found the results interesting with respect to Mr Trump’s political views. Looking at the results, it turns out that the traditional approach of considering only the largest loadings of the PCs seems to support the view that immigrants are a major cause of violent crime Instead, applying SPCA gives an entirely different view of the problem identifying mainly socio-economical characteristics, rather than being immigrants or speaking poor English, as drivers for crime. Naturally, the two things are correlated but the causal inference may be different.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The dataset is called Communities and Crime and can be downloaded from the UCI Machiine Learning Repository. I deleted 26 variables with missing values, ending up with 99 explanatory variaables and ran Principal Components Analysis on their correlation matrix. The the first and second PC respectively explain 25.3% and 17% of the variance of the data, Their ordered contributions (loadings scaled to unit sum of the absolute values) are plotted below.
Clearly it is difficult to choose a threshold to isolate the most important contributions. I selected the ten largest contributions, corresponding to a threshold of about 2%. Then I ran a sparse PCA algorithm of my invention* requiring that each sparse component explained at least 95% of the variance explained by the corresponding PC, Both sets of loadings are shown in Table 1 below. Rj-sq is the R squared resulting from regressing a variable on all other in the set (the one used to compute the variance inflation factor in regression), the closer the value to one the more the variable is (multiply) correlated with the others.
TABLE 1
Contribution | Variable | Rj-sq |
---|---|---|
PCA | ||
-2.2% | median family income (differs from household income) | 0.98 |
-2.2% | median household income | 0.98 |
-2.1% | percentage of kids in family housing with two parents | 0.98 |
-2.1% | percentage of households with investment / rent income in 1989 | 0.83 |
2.1% | percentage of people under the poverty level | 0.84 |
-2.1% | percentage of families (with kids) that are headed by two parents | 0.98 |
-2.0% | percent of kids 4 and under in two parent households | 0.90 |
-2.0% | per capita income | 0.92 |
2.0% | percentage of households with public assistance income in 1989 | 0.75 |
2.0% | percent of occupied housing units without phone (in 1990, this was rare!) | 0.76 |
SPCA | ||
-51% | median family income (differs from household income) | 0.51 |
-37% | percentage of kids in family housing with two parents | 0.52 |
12% | percent of family households that are large (6 or more) | 0.09 |
It turns out that the ten most influential variables in the first PC are highly correlated and are either indicators of income or of families with two parents. They all have more or less the same weight. The sparse component with only three variables explains 96.5% percent of the variance explained by the PC. The contributions, like for the PC, indicate the Income (51%) and family with two parents (37%) are influential but also add a 12% contribution from percentage of large families. This variable is hardly correlated with the others and its contribution to the PC ranks 46th in absolute value (so would not have been picked as influential in any analysis).
OK, Nothing really new here. I only managed to pick up that living conditions have a little importance. But let’s look at the second components. The ten largest PC’s contributions and those of the sparse component are given in Table 2.
TABLE 2
Contribution | Variable | Rj-sq |
---|---|---|
PCA | ||
2.7% | percent of population who have immigrated within the last 10 years | 1.00 |
2.7% | percent of population who have immigrated within the last 8 years | 1.00 |
2.6% | percent of population who have immigrated within the last 5 years | 1.00 |
2.6% | percent of population who have immigrated within the last 3 years | 0.98 |
2.6% | percent of people foreign born | 0.96 |
-2.3% | percent of people who speak only English | 0.95 |
2.3% | percent of people who do not speak English well | 0.94 |
2.1% | percent of persons in dense housing (more than 1 person per room) | 0.86 |
2.0% | percentage of population that is of asian heritage | 0.63 |
2.0% | percentage of population that is of hispanic heritage | 0.90 |
SPCA | ||
42% | percent of population who have immigrated within the last 10 years | 0.11 |
-15% | percentage of population that is 65 and over in age | 0.06 |
15% | owner occupied housing – upper quartile value | 0.55 |
14% | percent of family households that are large (6 or more) | 0.47 |
13% | number of people living in areas classified as urban | 0.32 |
Wow! 9 out of 10 variables with large contributions to the second PC are percentages of immigrants! These are highly correlated (the first 7 with Rj-sq over 94%). That’s right Mr Trump, immigrants do make a difference, the relative weight of their contributions to the ten most influential variables is about 91%.
But let’s look at the sparse contributions: the story is quite different. The percent of population recently immigrated has a contribution of 42% but other socio-economical aspects are now influential. We find again large families (which obviously is correlated with living in dense housing) but also percentage of elderly and urbanised residents. Surprisingly, owner occupied housing has positive contribution, This may be due to families for which the cost of the mortgage is a relevant part of their income. Note how the variables in the sparse component are not very correlated.
PCA and SPCA are not methods designed for the prediction of an external variable but for summarising the variation in a set of variables with few components. However, if we regress violent crime rate on the components, it turns out that the sparse components are somewhat better at predicting crime rate than the full cardinality PCs, as shown by the R-squared coefficients in Table 3, where also other summary statistics for the SPCA results are given.
TABLE 3
Comp1 | Comp2 | Comp3 | Comp4 | Comp5 | |
---|---|---|---|---|---|
cumulative vexp | 24.4% | 40.8% | 49.8% | 57.2% | 62.7% |
relative cumulative vexp (sPC/PC) | 96.5% | 96.5% | 96.5% | 96.6% | 96.6% |
Cardinality | 3 | 5 | 7 | 9 | 8 |
Correlation with PC | 0.97 | 0.98 | 0.942 | 0.833 | 0.92 |
R-sq log(crime rate) on PCs | 44.0% | 49.0% | 50.9% | 51.3% | 53.3% |
R-sq log(crime rate) on sparse comp. | 45.7% | 51.9% | 56.4% | 57.2% | 57.2% |
In conclusion, SPCA does a better job at identifying uncorrelated socio-economical characteristics that make a difference in a community, while PCA tends to identify clusters of highly correlated variables sharing those characteristics. The sparse components are also (slightly) better at predicting crime rate. After all also non-immigrants may have low income and live crammed in a house.
I imagine that seeing the PCA results Mr Trump would immediately conclude that recent immigrants who don’t have good command of English and maybe live in dense housing are criminals.
The data used are quite dated. If anybody has more recent data, please let me know.
Zealand Journal of Statistics 57, 391–429. My R package for SPCA is available on GitHub
To leave a comment for the author, please follow the link and comment on their blog: My R Nightmares.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.