Data Mining the California Solar Statistics with R: Part I
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data Mining the California Solar Statistics with R: Part I
Intro
Today I’m taking a look at the data set available from California Solar Statistics availalbe from https://www.californiasolarstatistics.ca.gov/. This data set lists all the applications for state incentives for both residential and commercial systems, it contains information about the PV (Photovoltaic) system size, location, cost, incentive amount, panel and inverter manufacturer and a lot more, if you want more details check out their web page. I’m interested in taking a look at this data set and seeing what we can learn about residential solar installations in CA.
Getting Started – Cleaning the data
The working data sets have already been screened for input errors but the data still isn’t in the format I am looking for, I would like to see the total installed kW of solar by county by year. Additionally, the input which contains the install county has inconsistent nomenclature. For example, some values for county will say “Los Angles County” while others will just say “Los Angeles”, this is a problem because it will create more counties than exist when I try to group by county. Also, the data currently contains canceled applications, I’m only interested in taking a look at installed systems. In this first block of code I’m going to get the data in a more usable format for my purposes.
##First I'll load the packages I plan on using require(plyr) require(lubridate) require(RColorBrewer) require(grid) require(ggplot2) require(ggmap) ##load data, this may take a min solarData=read.csv(file = "WorkingDataSet_4-15-2015.csv") ## We're interested in the residential data solarData=subset(solarData,Host.Customer.Sector == "Residential") ## same labels are redundant, remove "county" to avoid this using gsub command solarData$Host.Customer.Physical.Address.County=gsub( " County","",solarData$Host.Customer.Physical.Address.County) ##There are two instances where no county was listed, I am removing them here solarData=solarData[solarData$Host.Customer.Physical.Address.County != "",] #remove cancelled applications, we're insteresed installed systems solarData=solarData[solarData$First.Cancelled.Date == "",] #only keep installed applications, if people have filed to receieve their incentives, I am counting it as installed solarData=solarData[solarData$First.Incentive.Claim.Request.Review.Date != "",] #extract install year and month variables solarData$First.Incentive.Claim.Request.Review.Date=as.Date(solarData$First.Incentive.Claim.Request.Review.Date) ##using a package called lubridate with the functions year and month to extract year and month as variables form the data solarData$year=year(solarData$First.Incentive.Claim.Request.Review.Date) solarData$month=month(solarData$First.Incentive.Claim.Request.Review.Date)
Next I want to group the data by county and year. The plyr package is great for aggregating data like this, it can be achieved with just a few lines of code. For more about the plyr package see http://plyr.had.co.nz/.
#Get data by county countyData = ddply(solarData, .(tolower(Host.Customer.Physical.Address.County),year ),summarize, Systems = length(na.omit(year)), Total.Size = sum(na.omit(CSI.Rating))) ##rename column name to "County" from "Host.Customer.Physical.Address.County" this will be important when I want to merge that data set with another one colnames(countyData)[1] ="County"
Now that I’ve got the data set in the format I’m looking for, I’ll use the gglpot2 package to load a county map of CA. Later We’ll be able to plot the solar install data we have per county on this map. I create a variable called “sort” so that after I merge this data set with the solar data I can put it back in the original order from my CA data frame. The merge function doesn’t keep the order of either your data frames. This is important because I won’t be able to plot the data by county correctly if it isn’t in the same order that it came from the ggplot package in.
Loading the California Map from ggplot2
CA=map_data('county') CA=subset(CA,region=="california") CA$sort=rep(0,nrow(CA)) ##create a sort variable so that after we merge the data sets, we can put in proper order to plot for (i in 1:nrow(CA)){ CA$sort[i]=i } colnames(CA)[6] ="County"
I also need to subset the data so that each year is it’s own data set if I want to compare state wide solar install plots by year. To do this I’m merging the county labels from my CA data frame with the county labels from my countyData data frame.
countyData2007=subset(countyData,year == 2007) CA2007 = merge(CA,countyData2007,all.x=TRUE,sort=FALSE, by="County") CA2007=CA2007[order(CA2007$sort),] countyData2008=subset(countyData,year == 2008) CA2008 = merge(CA,countyData2008,all.x=TRUE,sort=FALSE, by="County") CA2008=CA2008[order(CA2008$sort),] countyData2009=subset(countyData,year == 2009) CA2009 = merge(CA,countyData2009,all.x=TRUE,sort=FALSE, by="County") CA2009=CA2009[order(CA2009$sort),] countyData2010=subset(countyData,year == 2010) CA2010 = merge(CA,countyData2010,all.x=TRUE,sort=FALSE, by="County") CA2010=CA2010[order(CA2010$sort),] countyData2011=subset(countyData,year == 2011) CA2011 = merge(CA,countyData2011,all.x=TRUE,sort=FALSE, by="County") CA2011=CA2011[order(CA2011$sort),] countyData2012=subset(countyData,year == 2012) CA2012 = merge(CA,countyData2012,all.x=TRUE,sort=FALSE, by="County") CA2012=CA2012[order(CA2012$sort),] countyData2013=subset(countyData,year == 2013) CA2013 = merge(CA,countyData2013,all.x=TRUE,sort=FALSE, by="County") CA2013=CA2013[order(CA2013$sort),] countyData2014=subset(countyData,year == 2014) CA2014 = merge(CA,countyData2014,all.x=TRUE,sort=FALSE, by="County") CA2014=CA2014[order(CA2014$sort),] ##Data set for the installs statewide CA_comb_data=ddply(countyData,.(year),summarise,CAtotal=sum(Total.Size))
Plotting the data with ggplot2 and grid
Next I want to visualize the data, for this I’ll use two great packages, ggplot2 along with the viewport function of the grid package which makes it easy to make very flexible panel plots.
## create countywide plots for 2007, 2010 and 2013 p2007 = ggplot(CA2007 ,aes(x = long, y = lat,group=group,fill=Total.Size)) + geom_polygon(colour = "white", size = 0.1) + theme_bw(base_size = 16)+ scale_fill_gradientn(name="Installed nkW/year",colours = brewer.pal(8,"YlOrRd"),limits=c(0,max(countyData$Total.Size)))+ ggtitle("2007 by county")+ theme(axis.text.x = element_text( angle = 25,vjust=-0.1)) p2010 = ggplot(CA2010 ,aes(x = long, y = lat,group=group,fill=Total.Size)) + geom_polygon(colour = "white", size = 0.1) + theme_bw(base_size = 16)+ scale_fill_gradientn(name="Installed nkW/year",colours = brewer.pal(8,"YlOrRd"),limits=c(0,max(countyData$Total.Size)))+ ggtitle("2010 by county")+ theme(axis.text.x = element_text( angle = 25,vjust=-0.1)) p2013 = ggplot(CA2013 ,aes(x = long, y = lat,group=group,fill=Total.Size)) + geom_polygon(colour = "white", size = 0.1) + theme_bw(base_size = 16)+ scale_fill_gradientn(name="Installed nkW/year",colours = brewer.pal(8,"YlOrRd"),limits=c(0,max(countyData$Total.Size)))+ ggtitle("2013 by county")+ theme(axis.text.x = element_text( angle = 25,vjust=-0.1)) ## create statewide plot pCA = ggplot(CA_comb_data,aes(x=year,y=CAtotal/1000))+ geom_bar(stat="identity",fill="red")+ xlim(2007,2015)+ theme_bw(base_size=16)+ ylab("CA annually installedn residential solar (MW)")
Below is the code required to generate my plot
##create a new page for plot grid.newpage() ##tell viewport that you want 2 rows x 3 cols pushViewport(viewport(layout = grid.layout(2, 3))) vplayout <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y) ##specify where each plot should be located print(p2007, vp = vplayout(1, 1)) print(p2010, vp = vplayout(1, 2)) print(p2013, vp = vplayout(1, 3)) ##plots can take up more that one spot in your layout print(pCA, vp=vplayout(2,1:3))
Looking at the plots by county it’s clear that the applications are more focused in southern CA. In a later post we’ll try and figure out how much of this is due to the high population in southern CA and how much is due to the increased solar insolation. One thing that lookes really strange/surprising to me is the drop in applications in 2014, during this period of time the cost of solar was dropping and the economy was improving. I reached out to the California Energy Comission to ask them about the drop in applications for incentives and they explained that while the CSI (California Solar Incentive) program was intended to last until 2016, the total budget for the program was starting to dry up in 2014 due the larger than anticipated growth in solar. So the drop in applications isn’t related to a decline in PV, just a lack of budget. It would be interesting to see how much the incentives running out changed the number of PV system installs but I haven’t been able to find a source for that data.
In my next post we’ll take a closer look at this data set and see what else we can learn.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.