Data Mining the California Solar Statistics with R: Part III
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data Mining the California Solar Statistics with R: Part III
Today I want to combine the California solar statistics with information about the annual solar insolation in each county as well as information about the population and median income. These can then be used as predictors in the models I'll build in the next post. Information about the annual solar insolation can be found in a NASA data set, http://eosweb.larc.nasa.gov/sse/. This data set lists the annual insolation incident on a horizontal surface in kWh/m2 /day for the entire globe at a 1 degree latitude/longitude precision. The population and median income data can be downloaded from the American Community Survey which is run by the census bureau, http://www.census.gov/acs/www/.
Annual insolation in California by county
Ideally, I would like to average the annual solar insolation over the area of the county and assign that insolation to the county, however, the insolation data is much too sparse for that; not to mention that finding a way to average anything over those odd county shapes would be very difficult. What I settled on is to assign the annual solar insolation of each county's county seat to the entire county, this should capture most of the variation across the state. To find the latitude and longitude of each county seat I used the Google Maps API that is built into a great package called ggmaps. Once I have the latitude and longitude of each county seat then I just look up the average insolation at that latitude and longitude.
require(ggmap) require(RColorBrewer) require(grid) require(plyr) load("CaSolarII.RData") ##load text file containing solar insolation data avgAnnualInsolation=read.csv("nasaSolarData.txt",sep = " ", skip = 13 , header = T) ##remove monthly totals, only need lat,lon and annual insolation avgAnnualInsolation = avgAnnualInsolation[,c(1,2,15)] ##subset the data so that only california area is included avgAnnualInsolation = subset(avgAnnualInsolation, Lat > 32 & Lat < 43 & Lon > -125 & Lon < -113)
The following code is what I wrote to look up the latitdue and longitude using the Google Maps API
##I saved the CA county seats in this text file, read it into a variable countySeats=read.csv("caCountySeats.csv") ##remove "county" from the end of the county name countySeats$County=tolower(gsub(" County", "" ,countySeats$County )) ##add latitide,longitude and annual solar insolation variables to the county seat data frame countySeats$lat=rep(0,nrow(countySeats)) countySeats$lon=rep(0,nrow(countySeats)) countySeats$insolation=rep(0,nrow(countySeats)) ##This for loop uses the geocode funciton of ggmaps to look up the latitude and longitude of each county seat ##then the instolation from the nasa data set at that lat,lon pair is assigned to the county for ( counties in 1:nrow(countySeats)){ seat = geocode(paste(countySeats$Seat[counties],", CA")) countySeats$lat[counties] = seat$lat countySeats$lon[counties] = seat$lon countySeats$insolation[counties]=avgAnnualInsolation$Ann[avgAnnualInsolation$Lat == round(seat$lat) & avgAnnualInsolation$Lon == round(seat$lon)] print(paste(seat$lat," ",seat$lon,countySeats$insolation[counties])) } ##save the data to csv file so that i only have to do this once write.csv(countySeats,"caCountySeats_wlatlon.csv")
Now that I have the insolation per county I can merge that data with my California map and plot the results.
##load the data countySeats=read.csv("caCountySeats_wlatlon.csv") ##merge the insolation data with our state map from post 1 CASUN=merge(CA,countySeats[,c(2,6)],"County") ##plot the data ggplot(CASUN ,aes(x = long, y = lat,group=group,fill=insolation)) + geom_polygon(colour = "white", size = 0.1) + theme_bw(base_size = 16)+ scale_fill_gradientn(name=expression(paste("kWh/"*m^2*"/day")),colours = brewer.pal(8,"YlOrRd"),limits=c(min(CASUN$insolation),max(CASUN$insolation)))+ ggtitle("Insolation incident on a horizontal surface")+ theme(axis.text.x = element_text( angle = 25,vjust=-0.1))+ coord_fixed(ratio = 1)
As you would expect, southern California gets much more sun that northern California and the coast get's less sun than inland areas because the coast is often more cloudy.
Population and median income in California by county
As I mentioned before, this data is available from the American Community Survey. They have data for each county available from 2009-2013, the data sets from 2007 and 2008 did not include every county. The data was a bit tedious to get. I had to search around to find it and then download by year one year at a time and then combine them in an excel file. I'll put these files in my public dropbox folder so that anyone else interested doesn't have to go through the same tiresome exercise that I did.
##load population by county, data collected from american community survey - census.gov, data available form 2009-2013 pop=read.csv("population.csv") pop2013=subset(pop,year == 2013) ###Susbset out 2013 as an example to plot caPop2013 = merge(CA,pop2013,all.x=TRUE,sort=FALSE, by="County") ##must be sorted after merging for plot to work caPop2013=caPop2013[order(caPop2013$sort),] ##load income by county, data collected from american community survey - census.gov, data available form 2009-2013 income=read.csv("income.csv") income2013=subset(income,year == 2013) caInc2013 = merge(CA,income2013,all.x=TRUE,sort=FALSE, by="County") caInc2013=caInc2013[order(caInc2013$sort),] ##create plots using grid and ggplot2 packages caPopPlot = ggplot(caPop2013 ,aes(x = long, y = lat,group=group,fill=pop)) + geom_polygon(colour = "black", size = 0.1) + theme_bw(base_size = 16)+ scale_fill_gradientn(name="Population",colours = brewer.pal(8,"Blues"),limits=c(min(caPop2013$pop),max(caPop2013$pop)))+ ggtitle("2013 population by county")+ theme(axis.text.x = element_text( angle = 25,vjust=-0.1))+ coord_fixed(ratio = 1) caIncPlot = ggplot(caInc2013 ,aes(x = long, y = lat,group=group,fill=income)) + geom_polygon(colour = "black", size = 0.1) + theme_bw(base_size = 16)+ scale_fill_gradientn(name="Mediann Income ($)",colours = brewer.pal(8,"Greens"),limits=c(min(caInc2013$income),max(caInc2013$income)))+ ggtitle("2013 median income by county")+ theme(axis.text.x = element_text( angle = 25,vjust=-0.1))+ coord_fixed(ratio = 1) ##create a new page for plot grid.newpage() ##tell viewport that you want 2 rows x 3 cols pushViewport(viewport(layout = grid.layout(1, 2))) vplayout <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y) ##specify where each plot should be located print(caPopPlot, vp = vplayout(1, 1)) print(caIncPlot, vp = vplayout(1, 2))
I did not realize how concentrated the population in California is around Los Angeles until I saw this plot. Financially, the bay area has the highest median salary.Regions around Los Angeles, San Diego and Sacramento also have elevated incomes compared to the rest of the state.
The effective cost of solar by county
In my last post we looked at how the cost of solar decreased over the duration of this program. Now I would like to calculate what I am calling the “effective cost” which is the cost of the system minus the incentive received. I should also note that the owners of these system would have been available for the federal invesment tax credit (ITC) as well , but the incentive amounts for that are not included in this data set and are thus not inlcuded in my analysis.
##create new vairable for effective cost which subtracts incentive from total cost solarData$Effective.Cost=(solarData$Total.Cost-solarData$Incentive.Amount)/(solarData$CEC.PTC.Rating*1000) ##group installs by quarter and year, calculate average cost for each quarter by year costByQuarter = ddply(solarData, .(year,quarter),summarise, cost=mean(Effective.Cost,na.rm=TRUE)) colnames(solarData)[20]="County" solarData$County=tolower(solarData$County) ##use ddply to group install kW/county/quarter installsByYearCountyQuarter=ddply(solarData, .(year, County, quarter),summarise, installs=length(na.omit(quarter)), Total.kW = sum(na.omit(CSI.Rating))) ##only include 2009-2013, because that is the range we have all the variables for installsByYearCountyQuarter=subset(installsByYearCountyQuarter,year > 2008 & year < 2014) ##Merge our external data sets with the california solar statistics set installsByYearCountyQuarter=merge(installsByYearCountyQuarter,income,c("year","County")) installsByYearCountyQuarter=merge(installsByYearCountyQuarter,pop,c("year","County")) installsByYearCountyQuarter=merge(installsByYearCountyQuarter,costByQuarter,c("year","quarter")) installsByYearCountyQuarter=merge(installsByYearCountyQuarter,countySeats[,c(2,6)],c("County"))
Now that we've got our combined data set together i want to look for correlation between the variables
pairs(installsByYearCountyQuarter[,c(5,6,7,8,9)])
Looking at the correlation plot, the trends seem to be what I would expect; total install kW of PV increases with increasing population, increasing insolation and decreasing cost. The trend between installed kW and income is not as clear as I would have guessed.
Wrapping up
Well, it's been a lot of work through these last 3 posts, but now I'm ready to perform some machine learning algorithms on this data set I've pulled together. I am eager to see how accurately I'll be able to predict the totaled installed residential PV per quarter using these predictors. Check in next time to see the results!
Downloads
population by year/county – https://dl.dropboxusercontent.com/u/18368377/population.csv
median income by year/county – https://dl.dropboxusercontent.com/u/18368377/income.csv
CA county seats – https://dl.dropboxusercontent.com/u/18368377/caCountySeats.csv
This R markdown file – https://dl.dropboxusercontent.com/u/18368377/CaSolarIII.Rmd
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.