Chances of going to college based on parent’s income

[This article was first published on Decision Science News » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

INCOME PERCENTILES AND INCOMES PAINT DIFFERENT PICTURES

ydi.blank

The amazing team at the New York Times have a “You Draw It” feature in The Upshot in which readers try their hands at drawing a graph. The graph should show the probability of a child going to college based on their parents’ percentile in the income distribution.

As a cool added feature, after people took their guesses, they could see, in shades of red, the other guesses people had taken and compare it to the actual graph.

SPOILER ALERT: You can see the true answer at the bottom of this post. If you want to try your hand at guessing, click through to the NY Times and guess before proceeding.

One way to interpret this relationship is that for every percentile a parent moves up in the income distribution, the chance that their child goes to college increases by a constant amount, which might seem somewhat surprising. Even the NY Times editors were surprised by this linear relationship, and the data they collected showed that other people were, too.

Jake Hofman and I wondered “what if people didn’t take the X-axis literally, what if they thought about it as something like log income or income (instead of percentile in the income distribution)?” Percentiles are tricky. They’re buckets with equal numbers of people, but those people can have very different incomes. What would the graph look like if the X-axis were income? Would this relationship be more intuitive to readers?

Jake scraped some income percentile data from whatsmypercent.com and we eyeballed the probability data from the chart at the NY Times. This enabled us to look at the probability of going to college based on income, which tells somewhat of a different story. In these plots, the size of the each point corresponds to the number of people in it.

The change you get by adding $10,000 a family’s income matters considerably for those earning between $10,000 and $100,000 (which the vast majority of Americans do), and matters much less outside that range. At the same time, it’s considerably more difficult for lower income parents to increase their income by this amount.

Probability of going to college vs log income:

dollars_college_log10

Probability of going to college vs income:

dollars_college

SPOILER ALERT – BELOW YOU WILL SEE THE ANSWER FROM THE NY TIMES

Probability of going to college vs income percentile:

nyt

R CODE FOR YOUR CODING PLEASURE

#!/bin/bash
#
# Scrape income distribution data from whatsmypercent.com
#
# Output is in incomes.csv (percentile,income)
#
# start at $100 / year
income=100
# loop over all 100 percentiles
for f in {1..100}
do
# grab the bottom of the next percentile
income=`curl -silent 'http://whatsmypercent.com/incomeRank.php?income='$income'&status=All+Filers' | grep 'The next percentile begins at:' | awk -F"[<>]" '{print $9}'`
income=${income/\$/}
income=${income/,/}
# grab the percentile
percentile=`curl -silent 'http://whatsmypercent.com/incomeRank.php?income='$income'&status=All+Filers' | grep 'Your percentile is:' | awk -F"[<>]" '{print $9}'`
percentile=${percentile/\%/}
echo $percentile,$income
done | \
grep -v '^0,' > incomes.csv
# write output to csv file
#
# Compare various plots of child college attendance by parent income
#
# Inspired by the interactive NYT piece "You Draw It" at http://www.nytimes.com/interactive/2015/05/28/upshot/you-draw-it-how-family-income-affects-childrens-college-chances.html
#
library(ggplot2)
library(scales)
# income distribution data (2010) from scrape_income_dist.sh
incomes <- structure(list(percentile = 2:99, dollars = c(2451L, 4134L, 5184L,
6028L, 6922L, 7626L, 8226L, 8764L, 9235L, 9832L, 10482L, 11366L,
12207L, 12999L, 13732L, 14447L, 15064L, 15736L, 16358L, 16992L,
17659L, 18204L, 18768L, 19375L, 19964L, 20860L, 22013L, 23034L,
23873L, 24675L, 25505L, 26311L, 27033L, 27811L, 28560L, 29306L,
29999L, 30999L, 32188L, 33281L, 34272L, 35295L, 36253L, 37194L,
38051L, 39064L, 39953L, 41113L, 42327L, 43564L, 44769L, 45871L,
46956L, 48095L, 49225L, 50353L, 51922L, 54282L, 57213L, 59670L,
61654L, 63469L, 65192L, 66639L, 68140L, 69658L, 71150L, 72539L,
73866L, 75296L, 77160L, 79838L, 83011L, 85811L, 88317L, 90794L,
93165L, 95174L, 97298L, 99424L, 102060L, 106770L, 117025L, 125260L,
131032L, 136231L, 141453L, 147725L, 154131L, 160864L, 168227L,
177123L, 187412L, 200026L, 235687L, 290860L, 360435L, 506553L
)), .Names = c("percentile", "dollars"), class = "data.frame", row.names = c(NA,
-98L))
# create a column for the percent of children who attend college at each percentile
# (slope and intercept guesstimated from Chetty et. al.)
incomes <- transform(incomes, college=2/3*percentile + 27)
# plot college attendance vs parent income percentile
qplot(data=incomes, x=percentile, y=college) +
xlab('Parent income percentile') +
ylab('Percent of children who attend college') +
ylim(c(0,100))
ggsave('percentile_college.png', width=4, height=4)
# plot college attendance vs parent income
qplot(data=incomes, x=dollars, y=college) +
xlab('Parent income') +
ylab('Percent of children who attend college') +
ylim(c(0,100)) +
scale_x_continuous(labels=comma)
ggsave('dollars_college.png', width=4, height=4)
# plot college attendance vs parent income, with log scale
qplot(data=incomes, x=dollars, y=college) +
xlab('Parent income') +
ylab('Percent of children who attend college') +
ylim(c(0,100)) +
scale_x_log10(labels=comma)
ggsave('dollars_college_log10.png', width=4, height=4)
# plot college attendance vs parent income, showing population at each income
incomes %>%
mutate(dollars_bin=round(dollars/10000)*10000) %>%
group_by(dollars_bin) %>%
summarize(size=n(), college=mean(college)) %>%
qplot(data=., x=dollars_bin, y=college, size=size) +
xlab('Parent income') +
ylab('Percent of children who attend college') +
ylim(c(0,100)) +
theme(legend.position="none") +
scale_x_continuous(labels=comma)
ggsave('dollars_college_sized.png', width=4, height=4)
# plot college attendance vs parent income, with log scale showing population at each income
incomes %>%
mutate(dollars_bin=round(dollars/10000)*10000) %>%
group_by(dollars_bin) %>%
summarize(size=n(), college=mean(college)) %>%
qplot(data=., x=dollars_bin, y=college, size=size) +
xlab('Parent income') +
ylab('Percent of children who attend college') +
ylim(c(0,100)) +
theme(legend.position="none") +
scale_x_log10(labels=comma)
ggsave('dollars_college_log10_sized.png', width=4, height=4)
view raw we_draw_it.R hosted with ❤ by GitHub

The post Chances of going to college based on parent’s income appeared first on Decision Science News.

To leave a comment for the author, please follow the link and comment on their blog: Decision Science News » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)