[This article was first published on Blag's bag of rants, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Usually…when I use R…I try to use it with SAP HANA as well…and the most simple way to make them work is for sure an ODBC connection….fast and simple…Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
First, I went to my SAP HANA Studio to get the query…which is basically…get all the seats (First, Business, Economy) from all flights that happened between 2010 and 2012.
Query from HANA Studio |
---|
SELECT FLDATE, SEATSOCC, SEATSOCC_B, SEATSOCC_F FROM SFLIGHT.SFLIGHT WHERE YEAR(FLDATE) BETWEEN '2010' AND '2012' |
The result will look like this…
But…we want to have all seats summed up in one variable and also organized by month/year. And while we can do that using SQL…that will take away the fun from R…
Getting and formatting data |
---|
library("RODBC") format_date<-function(p_date){ p_date<-as.Date(as.character(p_date),"%Y%m%d") p_date<-format(p_date,"%Y%m") return(p_date) } ch<-odbcConnect("HANA",uid="SYSTEM",pwd="********") result<-sqlQuery(ch,"SELECT FLDATE, SEATSOCC, SEATSOCC_B, SEATSOCC_F FROM SFLIGHT.SFLIGHT WHERE YEAR(FLDATE) BETWEEN '2010' AND '2012'") odbcCloseAll() dates<-result$FLDATE dates<-format_date(dates) result$FLDATE = dates result_agg<-aggregate(cbind(SEATSOCC,SEATSOCC_B,SEATSOCC_F)~.,data=result,FUN=sum) result_total<-data.frame(FLDATE=result_agg$FLDATE,SEATS=result_agg$SEATSOCC+ result_agg$SEATSOCC_B+result_agg$SEATSOCC_F,stringsAsFactors=FALSE) |
First, by using the RODBC library we connect to our SAP HANA Server. Then, we grab the dates and using a custom function, convert them to Month/Year. We do an aggregation to get the sum of all seats and then built a data frame to hold the data.
When we print out the final result…we will realize something that for sure, escaped our eyes before…
For 2010, the months start on April, meaning that from January to March there’s no information…and the same happens to 2012 where the information end on April and from June to December there’s nothing.
As we want to do a Forecasting using Neural Network…this incomplete data is never going to work…so we need to do something to fix it…
One thing to do at least for 2012, it’s the Moving Average…which is basically grab the values from January to April, sum them, divide them by the number of months and then assign this value to May (201205)…then…grab the value from February to May and do the same thing for June…and so on -:)
For 2010 seems look a little bit more complicated…but it’s almost the same…I used something that I like to call Backward Average…despite what the real name is -:P Basically, we grab the values from December to April, sum them, divide the value by the number of months and determine the value for March…and so on…
Let’s see the code…
Moving and Backward Average |
---|
library("RODBC") format_date<-function(p_date){ p_date<-as.Date(as.character(p_date),"%Y%m%d") p_date<-format(p_date,"%Y%m") return(p_date) } moving_average<-function(p_values,year_start,month_start,year_end,month_end){ month<-as.numeric(month_start) - 1 init_month<-"01" if(length(month)==1){ init_date<-paste(year_start,"0",month,sep='') } base_date<-paste(year_start,init_month,sep='') counter<-as.numeric(month_end) - as.numeric(month_start) values<-p_values for(i in 0:counter){ values<-subset(values, FLDATE <= init_date & FLDATE >= base_date) new_value<-floor(mean(values$SEATS)) new_values<-data.frame(FLDATE=paste(year_start,month_start,sep=''),SEATS=new_value) p_values<-rbind(p_values,new_values) month_start<-as.numeric(month_start) + 1 if(nchar(month_start)==1){ month_start<-paste("0",month_start,sep='') } values<-rbind(values,new_values) month<-month + 1 init_month<-as.numeric(init_month) + 1 if(nchar(month)==1){ init_date<-paste(year_start,"0",month,sep='') }else{ init_date<-paste(year_start,month,sep='') } if(nchar(init_month)==1){ base_date<-paste(year_start,"0",init_month,sep='') }else{ base_date<-paste(year_start,init_month,sep='') } } return(p_values) } backward_average<-function(p_values,year_start,month_start,year_end,month_end){ month<-as.numeric(month_start) - 1 init_month<-"12" if(length(month)==1){ init_date<-paste(year_start,"0",month,sep='') } base_date<-paste(year_start,init_month,sep='') counter<-as.numeric(month_start) - as.numeric(month_end) values<-p_values for(i in 0:counter){ values<-subset(values, FLDATE <= base_date & FLDATE >= init_date) new_value<-floor(mean(values$SEATS)) new_values<-data.frame(FLDATE=paste(year_start,month_start,sep=''),SEATS=new_value) p_values<-rbind(p_values,new_values) month_start<-as.numeric(month_start) - 1 if(nchar(month_start)==1){ month_start<-paste("0",month_start,sep='') } values<-rbind(values,new_values) month<-month + 1 init_month<-as.numeric(init_month) - 1 if(nchar(month)==1){ init_date<-paste(year_start,"0",month,sep='') }else{ init_date<-paste(year_start,month,sep='') } if(nchar(init_month)==1){ base_date<-paste(year_start,"0",init_month,sep='') }else{ base_date<-paste(year_start,init_month,sep='') } } return(p_values) } ch<-odbcConnect("HANA",uid="SYSTEM",pwd="********") result<-sqlQuery(ch,"SELECT FLDATE, SEATSOCC, SEATSOCC_B, SEATSOCC_F FROM SFLIGHT.SFLIGHT WHERE YEAR(FLDATE) BETWEEN '2010' AND '2012'") odbcCloseAll() dates<-result$FLDATE dates<-format_date(dates) result$FLDATE = dates result_agg<-aggregate(cbind(SEATSOCC,SEATSOCC_B,SEATSOCC_F)~.,data=result,FUN=sum) result_total<-data.frame(FLDATE=result_agg$FLDATE,SEATS=result_agg$SEATSOCC+ result_agg$SEATSOCC_B+result_agg$SEATSOCC_F,stringsAsFactors=FALSE) result_total<-moving_average(result_total,"2012","05","2012","12") result_total<-backward_average(result_total,"2010","03","2010","01") |
When we run and execute this…we will see that in fact…we now have all the 3 years completed -;)
Now…we can finally use the Forecasting -;)
Neural_Network_Forecasting.R |
---|
library("RODBC") library("forecast") format_date<-function(p_date){ p_date<-as.Date(as.character(p_date),"%Y%m%d") p_date<-format(p_date,"%Y%m") return(p_date) } moving_average<-function(p_values,year_start,month_start,year_end,month_end){ month<-as.numeric(month_start) - 1 init_month<-"01" if(length(month)==1){ init_date<-paste(year_start,"0",month,sep='') } base_date<-paste(year_start,init_month,sep='') counter<-as.numeric(month_end) - as.numeric(month_start) values<-p_values for(i in 0:counter){ values<-subset(values, FLDATE <= init_date & FLDATE >= base_date) new_value<-floor(mean(values$SEATS)) new_values<-data.frame(FLDATE=paste(year_start,month_start,sep=''),SEATS=new_value) p_values<-rbind(p_values,new_values) month_start<-as.numeric(month_start) + 1 if(nchar(month_start)==1){ month_start<-paste("0",month_start,sep='') } values<-rbind(values,new_values) month<-month + 1 init_month<-as.numeric(init_month) + 1 if(nchar(month)==1){ init_date<-paste(year_start,"0",month,sep='') }else{ init_date<-paste(year_start,month,sep='') } if(nchar(init_month)==1){ base_date<-paste(year_start,"0",init_month,sep='') }else{ base_date<-paste(year_start,init_month,sep='') } } return(p_values) } backward_average<-function(p_values,year_start,month_start,year_end,month_end){ month<-as.numeric(month_start) - 1 init_month<-"12" if(length(month)==1){ init_date<-paste(year_start,"0",month,sep='') } base_date<-paste(year_start,init_month,sep='') counter<-as.numeric(month_start) - as.numeric(month_end) values<-p_values for(i in 0:counter){ values<-subset(values, FLDATE <= base_date & FLDATE >= init_date) new_value<-floor(mean(values$SEATS)) new_values<-data.frame(FLDATE=paste(year_start,month_start,sep=''),SEATS=new_value) p_values<-rbind(p_values,new_values) month_start<-as.numeric(month_start) - 1 if(nchar(month_start)==1){ month_start<-paste("0",month_start,sep='') } values<-rbind(values,new_values) month<-month + 1 init_month<-as.numeric(init_month) - 1 if(nchar(month)==1){ init_date<-paste(year_start,"0",month,sep='') }else{ init_date<-paste(year_start,month,sep='') } if(nchar(init_month)==1){ base_date<-paste(year_start,"0",init_month,sep='') }else{ base_date<-paste(year_start,init_month,sep='') } } return(p_values) } ch<-odbcConnect("HANA",uid="SYSTEM",pwd="********") result<-sqlQuery(ch,"SELECT FLDATE, SEATSOCC, SEATSOCC_B, SEATSOCC_F FROM SFLIGHT.SFLIGHT WHERE YEAR(FLDATE) BETWEEN '2010' AND '2012'") odbcCloseAll() dates<-result$FLDATE dates<-format_date(dates) result$FLDATE = dates result_agg<-aggregate(cbind(SEATSOCC,SEATSOCC_B,SEATSOCC_F)~.,data=result,FUN=sum) result_total<-data.frame(FLDATE=result_agg$FLDATE,SEATS=result_agg$SEATSOCC+ result_agg$SEATSOCC_B+result_agg$SEATSOCC_F,stringsAsFactors=FALSE) result_total<-moving_average(result_total,"2012","05","2012","12") result_total<-backward_average(result_total,"2010","03","2010","01") result_total <- result_total[order(result_total$FLDATE),] result_ts<-ts(result_total$SEATS,frequency=12,start=c(2010,1)) fit <- nnetar(result_ts) fcast <- forecast(fit) plot(fcast) |
Let’s take a lot at the generated plot…
As we can see…the prediction for 2013 to 2015 is very low…but that’s due to the fact that 2012 was a very low year…
Greetings,
Blag.
To leave a comment for the author, please follow the link and comment on their blog: Blag's bag of rants.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.