Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Stoltzmaniac is going local in today’s blog post! I dug into the City of Fort Collins open data and published my findings below.
The data was surprisingly clean and laid out in JSON format on Amazon S3. This post shows my exploration of the data along with some observations.
As always, I loaded the required libraries. I also loaded a custom ‘cleaningData.R’ script, which I wrote specifically for this project.
source("cleaningData.R") library(ggplot2) library(lubridate) library(dplyr) library(scales)
I created a custom function to fetch, clean, and merge the data (code provided at the end of this post). I also created columns to help with the time series analysis.
The data is from The City of Fort Collins Website
data = fetchAndCleanData() data$year = year(data$gldate) data$month = month(data$gldate,label=TRUE) data$day = day(data$gldate) data$dayOfWeek = wday(data$gldate,label=TRUE)
When are vendors being paid by the City of Fort Collins?
- The overwhelming majority of payments are made at the end of the month regardless of the year or month in question (Fig. 1 & 2)
- For example, in January and February, between 15-20% of the payments were on the final day of the month (Fig. 2)
- Almost all payments are made during the week (no surprise there) (Fig. 3)
- Strangely, Thursday seems to have fewer payments than other days
- It is substantially lower than other weekdays in both years
- However, there is a relatively simple explanation: the end of the month didn’t land on Thursday very often
- For example, Friday would have been the final business day of the month in 6 out of 12 months of 2014 whereas Thursday was only the last day of 1 out of 12 months in 2014
p = ggplot(data, aes(day, fill = factor(year))) p + geom_bar(position='dodge') + scale_y_continuous(name="Daily Payments", labels = comma) + ggtitle('Fig 1. Payments by Day of Month') + scale_fill_discrete(name = "Year")
p = ggplot(data, aes(day, fill = month)) p + geom_density(alpha = 0.5) + scale_y_continuous(name="% of Payments by Day of Month", labels = percent) + ggtitle('Fig 2. Payments by Day of Month')
p = ggplot(data, aes(dayOfWeek,fill=factor(year))) p + geom_bar(position='dodge') + scale_y_continuous(name="Payment Count", labels = comma) + scale_fill_discrete(name = "Year") + ggtitle('Fig 3. Payments by Day of the Week 2014 vs 2015') + theme(legend.position='bottom')
Which departments spent the most money?
- L&P Operations Service Unit spent almost $100M each year!
- L&P stands for Light & Power
- That’s roughly 4x the next highest spending department (Engineering) (Fig. 4)
df = data %>% group_by(year,DEPTNAME) %>% summarise(payments = sum(glamount)) %>% arrange(desc(payments)) %>% top_n(10) p = ggplot(df,aes(x=reorder(DEPTNAME,payments),y=payments,fill=factor(year))) p + geom_bar(stat='identity',position='dodge') + labs(title='Fig 4. Spending By Department - Top 10',x='') + scale_y_continuous(name="", labels = dollar) + scale_fill_discrete(name = "Year") + coord_flip()
Where was L&P spending all of that money?
- Almost all of the money goes to Platte River Power Authority (Fig. 5)
- After removing this outlier, the rest of the data is easier to visualize (Fig. 6)
- The most interesting observation: “fountain, city of” cost over $1.25M in 2014!
df = data %>% filter(DEPTNAME == 'L&P Operations Service Unit') %>% group_by(year,glvendor) %>% summarise(payments = sum(glamount)) %>% arrange(desc(payments)) %>% top_n(10) p = ggplot(df,aes(x=reorder(glvendor,payments),y=payments,fill=factor(year))) p + geom_bar(stat='identity',position='dodge') + labs(title='Fig 5. Spending By Vendor - Top 10',x='') + scale_y_continuous(name="", labels = dollar) + scale_fill_discrete(name = "Year") + coord_flip()
df = data %>% filter(DEPTNAME == 'L&P Operations Service Unit') %>% filter(glvendor != 'platte river power authority (') %>% group_by(year,glvendor) %>% summarise(payments = sum(glamount)) %>% arrange(desc(payments)) %>% top_n(10) p = ggplot(df,aes(x=reorder(glvendor,payments),y=payments,fill=factor(year))) p + geom_bar(stat='identity',position='dodge') + labs(title='Fig 6. Spending By Vendor - Top 10',x='') + scale_y_continuous(name="", labels = dollar) + scale_fill_discrete(name = "Year") + coord_flip()
Which vendors received the highest average payments?
- The Platte River Power Authority ranks number one (Fig. 7)
- New Flyer of America came in second (Fig. 7)
- New Flyer of America was the vendor for the MAX buses that were purchased in 2015
- New Flyer of America was the vendor for the MAX buses that were purchased in 2015
vendorSummary = data %>% group_by(year,glvendor) %>% summarize( payments = sum(glamount), count = length(glvendor), averagePayment = round(sum(glamount)/ length(glvendor),2)) %>% arrange(desc(averagePayment)) vendorTop10 = top_n(vendorSummary, 10, payments) p = ggplot(vendorTop10, aes(x=reorder(glvendor,averagePayment), y=averagePayment,fill=factor(year))) p + geom_bar(stat='identity',position='dodge') + coord_flip() + labs(title='Fig 7 Average Payments to Vendor - Top 10',x='') + scale_y_continuous(name="", labels = dollar) + scale_fill_discrete(name = "Year") + coord_flip()
After diving into a few different departments and finding relatively unremarkable spending patterns I stumbled across ‘Patrol’ and it piqued my interest.
What vendors had the highest payment totals in the ‘Patrol’ department?
- Redflex Traffic Systems Inc. was the number one vendor (Fig. 8)
- They are a maker of red light cameras
- There is a lot of controversy around this company and red light cameras in general
- The company’s former CEO resigned due to being tied to bribes paid to elected officials
- Chicago and Columbus were the cities in which bribery was being conducted – News Article
- The financial return on these cameras is constantly being scrutinized in the US – News Article – the costs associated with the cameras are relatively high
- Multiple cities, lead by Chicago, are questioning how effective these cameras are at reducing accidents News Article
df = data %>% filter(DEPTNAME == 'Patrol') %>% group_by(glvendor) %>% summarise(payments = sum(glamount)) %>% arrange(desc(payments)) %>% top_n(20) p = ggplot(df,aes(x=reorder(glvendor,payments),y=payments)) p + geom_bar(stat='identity') + labs(title='Fig 8. Payments to Vendor - Top 20',x='') + scale_y_continuous(name="", labels = dollar) + coord_flip()
Conclusion
- Overall, spending appears to align with what we see in Fort Collins
- It will be interesting to see if Redflex continues to be on the payroll after what happened in Chicago & Columbus (not to mention the ROI and efficacy studies)
- There’s not enough data to see a trend or attempt to make a forecast for future spending
- Spending on schools and education has been a big topic in Fort Collins. I’ll attempt to find changes in this as we move forward and the data is updated
Code used in this post is on my GitHub
The fetchAndCleanData() function is below. I had some difficulties with binding the rows upon reading in JSON. Chip Oglesby helped me out, thanks Chip for introducing me to bind_rows!
library(rvest) library(jsonlite) library(gdata) library(dplyr) fetchAndCleanData = function(){ landingPage <- read_html("http://www.fcgov.com/opendata/") links <- landingPage %>% html_nodes('a') %>% html_attr('href') glServiceArea <- fromJSON("https://s3.amazonaws.com/fortcollins-opendata/glservicearea.json") glFund <- fromJSON("https://s3.amazonaws.com/fortcollins-opendata/glfund.json") glDepartment <- fromJSON("https://s3.amazonaws.com/fortcollins-opendata/gldepartment.json") glLedgerLinks <- links[grepl("https://s3.amazonaws.com/fortcollins-opendata/generalledger-",links) == TRUE] data <- fromJSON(glLedgerLinks[1]) for(i in 2:length(glLedgerLinks)){ data <- bind_rows(data, fromJSON(glLedgerLinks[i])) } # Lowercase column names names(data) <- tolower(names(data)) # Lowercase vendor info data$glvendor <- tolower(data$glvendor) data$glexplanation <- tolower(data$glexplanation) # Clean up data types data$glfund <- as.factor(data$glfund) data$glamount <- as.numeric(data$glamount) data$gldeptno <- as.factor(data$gldeptno) data$glserviceareaid <- as.factor(data$glserviceareaid) data$glpo <- as.factor(data$glpo) data$gldate <- as.Date(data$gldate) # Columns are null and can be dropped data$glvisadesc <- NULL data$glvisacardholder <- NULL library(gdata) data = merge(data,glServiceArea,by.x='glserviceareaid',by.y='SERVICEAREAID') data$SERVICEAREANAME = trim(data$SERVICEAREANAME,'right') data = merge(data,glFund,by.x='glfund',by.y='GLFUND') data$GLFUND = trim(data$GLFUND,'right') data = merge(data,glDepartment,by.x='gldeptno',by.y='DEPTNO') data$DEPTNAME = trim(data$DEPTNAME,'right') return(data) }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.