Philadelphia loses a TIGER — Bad omen or statistical inevitability?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Lions, TIGERs, US DOT Funding, oh my.
For the first time in seven years, Philly has lost a TIGER. WHAT?! Not to worry, there are no escapees from the Philadelphia Zoo. No, TIGER has nothing to do with the animal, or the baseball team from Detroit, or the US Census Bureau’s geographic shapefiles. Today we’ll be talking about Transportation Investment Generating Economic Recovery (TIGER). As WHYY’s Newsworks reports, Philadelphia’s winning streak ended this year, coinciding with the recent change in mayoral administrations.
Quick Background on TIGER Grants
TIGER grants are distributed by the US Department of Transportation every year to US cities and municipalities to make — you’ve probably guessed it! — transportation and infrastructural improvements. As the name might imply, the grant was started in response to the Great Recession of ‘07-’09, as a way to generate jobs (hence, there is a stated preference for “shovel-ready” projects) as well as reduce US cities’ energy dependence. For much more detail and history on the grant, see the grant’s website and Wikipedia. The grants are quite competitive. As Newsworks.org reports, about $9.3 billion in applications are submitted, with half a billion dollars ultimately distributed.
Data!
Well, thanks to the open data revolution, the US DOT has kindly published a dataset of all winning projects since the grant’s inception. I’ve taken the dataset and run a quick analysis to see what these grants look like, and where Philadelphia stacks up against other localities. For those in the analytic weeds, I’m using the R programming language for this analysis and will post in some of the relevant code throughout the post. If you’re not interested, skip over the gray code boxes. My full code is in this github repo.
First, let’s look at where Philly stacks up in terms of its award amounts. The graph below shows the distribution of award amounts, and where Philadelphia specifically stands.
## Load in required packages mods=c('data.table','raster','<span class="mceItemHidden" data-mce-bogus="1"><span class="hiddenSpellError" pre="" data-mce-bogus="1">rgeos</span></span>','<span class="hiddenSpellError" pre="" data-mce-bogus="1">ggplot2</span>','<span class="hiddenSpellError" pre="" data-mce-bogus="1">ggthemes</span>','<span class="hiddenSpellError" pre="" data-mce-bogus="1">ggmap</span>','sp','leaflet','<span class="hiddenSpellError" pre="" data-mce-bogus="1">maptools</span>') invisible(<span class="hiddenSpellError" pre="" data-mce-bogus="1">lapply</span>(mods, function(x) require(x,character.only=TRUE,quietly = TRUE))) ## Load in TIGER grant dataset tg=fread('https://www.transportation.gov/sites/dot.gov/files/docs/tiger_allUpdate.csv') ## note to self, tweet <span class="hiddenSpellError" pre="tweet " data-mce-bogus="1">USDOT</span> & tell them not to put spaces in column headings... <span class="hiddenSpellError" pre="" data-mce-bogus="1">setnames</span>(tg, tg[,<span class="hiddenSpellError" pre="" data-mce-bogus="1">gsub</span>('[^[:<span class="hiddenSpellError" pre="" data-mce-bogus="1">alnum</span>:]]','',names(tg))]) ## Create a column for year of award tg[,Year := as.integer(gsub("[^0-9]","",Round))] #<span style="line-height: 1.7;" data-mce-style="line-height: 1.7;"># change award amount to numeric </span>tg[,Amount:=as.numeric(gsub("[\\$,]","",Amount))] # And a column to show whether we're us or not tg[,Phila:=<span class="hiddenSpellError" pre="" data-mce-bogus="1">ifelse</span>(<span class="hiddenSpellError" pre="" data-mce-bogus="1">grepl</span>("*[Pp]hila*|Southeastern Pennsylvania Transportation Authority|*SEPTA*|Center City District",Applicant)==TRUE,1,0)] # Create indicator for Philadelphia tg_p=tg[Phila==1,] ## Plot the data <span class="hiddenSpellError" pre="" data-mce-bogus="1">ggplot</span>(data=tg, aes(x=as.character(Year),y=Amount/<span class="hiddenSpellError" pre="" data-mce-bogus="1">1e6</span>,fill=as.character(Year))) + geom_boxplot() + geom_point(data=tg_p, aes(x=as.character(Year), y=Amount/1e6), size=3,color="darkred")+ geom_text(data=tg_p,aes(x=as.character(Year),y=Amount/1e6,label=paste0('$',round(Amount/1e6,1),'M')),vjust=-1,<span class="hiddenSpellError" pre="" data-mce-bogus="1">fontface</span>=('bold.italic'))+ scale_y_continuous(labels = scales::dollar_format(prefix="$",suffix="M"))+ labs(x="Funding Year", y="Distribution of Amounts in Millions", title="Philly's award amounts vs. all other applicants")+ guides(fill=FALSE)
Since 2010, Philly has earned on average a cool $10.1M per funding round.
By the way, here are Philly’s project names and the project types they fell under:
amt=tg[Phila==1,list(Year,<span class="mceItemHidden" data-mce-bogus="1"><span class="hiddenSpellError" pre="Year " data-mce-bogus="1">ProjectName</span></span>,<span class="hiddenSpellError" pre="ProjectName " data-mce-bogus="1">ProjectType</span>,Amount=paste0('$',Amount/<span class="hiddenSpellError" pre="" data-mce-bogus="1">1e6</span>,'M'))] amt[order(Year),] Year | ProjectName | ProjectType | Amount 2009 | Philadelphia Area Pedestrian and Bicycle Network (PA and NJ) | Bicycle and Pedestrian | $23M 2010 | Dilworth Plaza and Concourse Improvements | Transit | $15M 2011 | IMPaCT Philadelphia | Transit | $10M 2012 | Wayne Junction Substation Replacement | Transit | $12.9M 2013 | SEPTA-CSX Separation Project | Rail | $10M 2014 | Roosevelt Boulevard Multimodal Study | Regional Planning | $2.5M 2015 | Closing the Gaps | Road | $10.3M
And the total number of awards distributed each year:
# count number of awards by year yr = tg[,.N,Year] # plot <span class="mceItemHidden" data-mce-bogus="1"><span class="hiddenSpellError" pre="" data-mce-bogus="1">ggplot</span></span>(data=yr,aes(x=as.character(Year),y=N)) + geom_bar(stat=’identity’,fill=’gray40') + geom_text(aes(y=N,label=N),vjust=1.5,color=’white’,fontface=’bold’)+ labs(x=”Funding Year” ,y=”Number of Applications Awarded” ,title=”Number of successful applications by funding year”)
Okay, Philadelphia has been consistent, but what does it mean that the new Office of Transport and Infrastructure (OTIS) didn’t win an award? It’s a let-down, but does it really reflect poorly on the new mayoral administration? As the article states, although the high-level leadership changed, most of the grant-writing staff probably stuck around. To answer this question, it’s worth looking at how (a)typical it is to win 7 times.
This was a little bit tricky from a data perspective. Applicants can be either municipalities (e.g. the City of Philadelphia), agencies (e.g. SEPTA, PENNDOT), or states (e.g. — yeah…). So identifying the “unique” winners, is kind of confusing. And as the Newsworks article mentions, Philadelphia won in a way, through SEPTA’s relationship with the Delaware Transit Corporation (DTC, which won an award, reimburses SEPTA for certain operating costs). To get a clean and consistent dataset, I overlayed the latitude/longitude points of each award site on a county-level map of the US, which brings the analysis to the county level. The next graph assumes that if a project site landed in a particular county, the award is attributed to or otherwise purposed for that county. The approximation should be close enough for, you know, government work…
## Download the Census county file <span class="mceItemHidden" data-mce-bogus="1"><span></span>download.file('ftp://ftp2.census.gov/geo/tiger/<span class="mceItemHidden" data-mce-bogus="1"><span class="hiddenSpellError" pre="" data-mce-bogus="1">TIGER2016</span></span>/COUNTY/tl_2016_us_county.zip','cty_shp.zip')</span> # Create a folder shp <- readOGR('C:/Users/danieladdyson/Desktop/cty_shp',layer='tl_2016_us_county') # Census county shapefile # turn TIGER lat/long data to spatial points crs = shp@proj4string@projargs pts=SpatialPointsDataFrame(coords=tg[,list(Longitude,Latitude)], data=tg, proj4string=CRS(crs) ) # overlay grant sites on map of US counties sj = data.table(over(pts,shp)) sj=cbind(tg,sj) sj[,NAME:=as.character(NAME)] cty_wins = as.data.table(sj[,table(GEOID)]) wins_summary = cty_wins[,list(freq=.N),by=list(n_events=N)] wins_summary[,rate:=n_events*freq] # marginal probabilities # rate of wins per county cty_wins = as.data.table(sj[,table(GEOID)]) # Number of wins by county -- the counties may be slightly off due to map overlay errors # cty_wins[!duplicated(GEOID),.N] 269 unique counties wins_summary = cty_wins[,list(freq=.N),by=list(n_events=N)] # Number of counties (269) X number of wins (381) wins_summary[,rate:=n_events*freq] # marginal probabilities # mean number of wins mu_wins = wins_summary[,sum(rate)/sum(freq)] # standard deviation of mean number of wins sd_wins = sqrt(mu_wins) sd_rng = c(mu_wins+sd_wins, # 1 standard deviation upper mu_wins+sd_wins*2, # 2 standard deviation upper mu_wins+sd_wins*3 # 3 standard deviation upper) txt=data.frame(lab=c('Average\n# wins','1 Std Dev','2 Std Dev','3 Std Dev'), x=c(round(mu_wins,1),round(sd_rng,1)), y=rep(100) ) ## Exact confidence intervals ex_ci=poisson.test(sum(wins_summary$rate),sum(wins_summary$freq))$conf.int[c(1,2)] ## plot out the counts + SDs & exact CIs ci_pal = brewer.pal(3,'Set1') sd_pal = brewer.pal(6,'Reds') hi_cty = data.frame(cty = c('King, WA','Philadelphia, PA','Los Angeles, CA','Cook, IL'), X=c(6,7,7,8),Y=c(10,22,10,10)) ggplot(data=wins_summary,aes(x=n_events,y=freq))+ geom_bar(stat='identity',color='grey58')+ scale_x_continuous(breaks = c(1:8), labels = c(1:8))+ geom_vline(xintercept = mu_wins,color=ci_pal[2],size=1)+ geom_vline(xintercept = ex_ci,color=ci_pal[3],linetype='longdash',size=.7)+ geom_vline(xintercept = sd_rng,color=sd_pal[c(4:6)],size=1)+ geom_text(data=txt,aes(label=paste0(lab,':\n',x),x=x,y=y,hjust=-.001),size=3.5)+ geom_text(data=hi_cty,aes(x=X,y=Y,label=cty),size=3,angle=35,hjust=-.01)+ labs(x='Number of grants won',y='Number of counties',title= "Statistical deviation of grant wins")
So, Philadelphia has been really, really good at winning, and it looks like that’s a pretty hard feat, given that the over half the counties only won a single award. And this next graph demonstrates the probability of winning 1,2,…7 times:
## calculate probabilities p=cbind(lapply(c(1:8), ppois, lambda=mu_wins, lower.tail=FALSE)) probs=data.frame(p=unlist(cbind(lapply(c(1:8), ppois, lambda=mu_wins, lower.tail=FALSE))),wins=c(1:8)) ## Plot probability data ggplot(data=probs,aes(x=wins,y=p))+geom_line()+ scale_y_continuous(labels = scales::percent,limits = c(0,.45))+ geom_text(aes(label=sprintf(“%1.2f%%”, 100*p)),hjust=-.1,vjust=-.5)+ labs(x=’Number of wins’,y=’Probability of achieving # wins’)+ scale_x_continuous(breaks=c(1:8))
To conclude,
My analysis shows that among the winners, Philadelphia’s winning streak was highly aberrant (in a good way, of course). From a federal perspective, this is probably a good thing, as it would look strange if just a few localities were winning over and over (I have more to show on that below!). And although the loss could have been from some major change in this year’s grant writing procedures or application quality that were consequences of Jim Kenney’s new administration, it would be hard to discern that, given that the city’s winning streak is so highly statistically significant.
I should strongly caution, that this analysis is based on only the grant winners. We don’t have any data on the losers. And because of that, we don’t know how rare any one success is. But considering that $500M in disbursements is only about 5.4% of $9.3B in applications, I’m going to say it ain’t easy. And that could potentially make Philly’s wins even more statistically significant (because you’d be adding a really tall bar at 0 for all the counties who never won). So don’t feel bad, Mr. Kenney! There’s always next year.
Geographic locations of projects
After mapping to the counties to look at probabilities of wins, I aggregated to a higher level, this time US Census Divisions and Regions, to see where projects were getting awarded by year.
## Download the US Census Division shapefile download.file('<a class="markup--anchor markup--pre-anchor" href="http://www2.census.gov/geo/tiger/GENZ2015/shp/cb_2015_us_division_500k.zip%27,%27USCB_div.zip%27" target="_blank" rel="nofollow" data-href="http://www2.census.gov/geo/tiger/GENZ2015/shp/cb_2015_us_division_500k.zip%27,%27USCB_div.zip%27">http://www2.census.gov/geo/tiger/GENZ2015/shp/cb_2015_us_division_500k.zip','USCB_div.zip'</a>) unzip('USCB_div.zip',exdir='div_shp') # unzip & create a folder div <- readOGR('C:/Users/A770213/Desktop/div_shp',layer='cb_2015_us_division_500k') # Census county shapefile crs=div@proj4string@projargs # set coordinate reference system pts=SpatialPointsDataFrame(coords=tg[,list(Longitude,Latitude)], data=tg, proj4string=CRS(crs)) # match points & shapefile’s CRS # div@proj4string@projargs == pts@proj4string@projargs TRUE site_div = data.table(over(pts,div)) site_div=data.table(cbind(pts@data,site_div)) xtabs(~site_div$NAME+site_div$Year) Year Division Name 2009 2010 2011 2012 2013 2014 2015 East North Central 8 9 6 6 4 6 4 East South Central 4 1 2 3 4 4 3 Middle Atlantic 2 9 6 5 4 6 5 Mountain 7 7 3 5 7 9 5 New England 6 7 4 5 6 6 5 Pacific 9 11 7 6 7 10 8 South Atlantic 5 16 7 7 8 14 3 West North Central 5 10 5 5 5 10 4 West South Central 5 4 4 5 6 7 2
I aggregated divisions into Census regions and came out with the following:
tab=site_div[,list(cnt=.N,amt=sum(Amount)),by=list(div_name=NAME,year=Year)] tab[,':='(yr_cnt=sum(cnt),yr_amt=sum(amt)),by=year] tab[,':='(cnt_pct=round(cnt/yr_cnt,3),amt_pct=round(amt/yr_amt,3))] tab_m=melt(tab,id.vars = c('year','div_name'),measure.vars=c('cnt_pct','amt_pct')) tab_m[,variable := factor(variable,levels=c('amt_pct','cnt_pct'),labels=c('Percent of Dollars Awarded','Percent of Awards Given'))] tab_m[,reg_name:=ifelse(div_name %in% c('New England','Middle Atlantic'),'Northwest', ifelse(div_name %in% c('East North Central','West North Central'),Midwest',ifelse(div_name %in% c('South Atlantic','East South Central','West South Central'),'South’,'West')))] tab_m2 = tab_m[,list(value=sum(value)),by=list(year,reg_name,variable)] ggplot(data=tab_m2,aes(x=year,color=reg_name,y=value))+ geom_line(stat=’identity’,size=1)+ scale_x_continuous(breaks=c(2009:2015),labels=c(2009:2015))+ scale_y_continuous(labels=scales::percent)+ labs(x='Funding Year',y='Percent of Total Awarded',title='Regional share of total awards & award amounts', color='US Region')+ theme(axis.text=element_text(size=10,color='black'),axis.text.x=element_text(angle = 90,vjust=.5),strip.text=element_text(size=10,color='black',face='bold'),legend.text=element_text(size=10))+ facet_grid(variable~.)
The South has really been killing it on wins over the past few years, although it looks like there was some kind of serious shift in 2015. Again, without knowing the total number of submissions (i.e. wins vs losses), it’s hard to judge what these numbers really represent. The South has the highest number of states of the Census Regions and thus more potential applicants; it could all just be a matter of scale. Southern states also may have generally lower economies and rely more heavily on federal funding, which could also drive up the number of applicants. There’s a blog post from wallethub that shows a state-level breakdown of federal funding, although we’d need to join states and regions together for an accurate view. For my next data adventure, I’ll have to pull that data and join it to the TIGER funding data.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.