Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I do a alot of my modelling on Rstudio hosted on EC2 instances. If you don’t use, I would highly recommend. A brilliant tool. Kudos to the Rstudio team. I have made a personal and professional pledge to obsessively use version control. I hope to show a quick example of how to use version control in the modelling context, even if you are not tweaking the linux kernel. I know Rstudio has version control as an enhancement very soon and i am eagerly awaiting its release, like a virgin on prom night. (perhaps given the nerdy crowd that could bring up painful memories).
The first and obvious use of version control is to keep track of scripts over time. This is exactly what it is designed to do. Even if you came to R as a non-programmer(like me) it good to adopt the best parts. If you are at all familiar with version control this should make plenty of sense. If you are not here are some better high level and tutorials better then i could write:
http://chronicle.com/blogs/profhacker/a-gentle-introduction-to-version-control/23064
More technical introduction svn:
Great Git intro(I professinoally use SVN and just started Git, the jump can be a little steep):
http://library.edgecase.com/git_immersion/index.html
Now to why you are reading this article, the R. So you crunch an afternoon, add some clever backflip to your R project, you should save it, then flip over to the command line and commit to you repo. This hopefully will get even easier with aforementioned Rstudio enhancement. However what about your models, the actual R objects. If you are just exploring and hacking a data set then the longevity of your models are not too important, you got the insight and you keep modeling. However, when you are constantly tweaking and comparing similar models then tracking these changes is incredibly important.
library(randomForest) seed<-123 set.seed(seed) data(iris) #Load Data start<-Sys.time() iris_rf <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE) end<-Sys.time() time<-(end-start) #Save the model save(iris_rf, file ="~/R/model/iris_rf") #Now we add the the SVN Repo system("svn add ~/R/model/iris_rf") #Build Comment and Commit svn.comment <- paste("RandomForest for Iris Data with seed of: ",seed," and run time of ", time," Secs", sep ="") eval(parse(text =as.expression( paste("system(\"svn commit -m '",svn.comment,"' ~/R/model/iris_rf\")", sep ="") ))) #Now get SVN Revision Number model_ver<-system("svn info ~/R/models/iris_rf | awk '/Last Changed Rev: /{print $4}'", intern = TRUE) #This awk command grabs the version number, #the 'intern =TRUE' redirects the output of the system command to your object instead of stout #a brilliant useful tool on a linux system print(model_ver) #Now we have our revision number
This is clearly a toy example but it illustrates the use. Because I do a lot of database development, something is not useful to me until it stored in a database. That depends on what you are going to use them models for but I like to store out of sample error rates, cross validation results, the specific parameters used in each model run and of course the revision number of the SVN repo. That ties it all together. Now for a further example say we want to tweak the example.
args1 <- commandArgs(TRUE)[1] library(randomForest) seed<-args1 set.seed(seed) system("svn update ~/R/model/") load("~/R/model/iris_rf") data(iris) #Load Data start<-Sys.time() iris_rf2 <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE) end<-Sys.time() time<-(end-start) #we are combining the old model with the new model to add trees to the forest rf.all <- combine(iris_rf,iris_rf2) #Save the model save(rf.all, file ="~/R/model/iris_rf") #Now we add the the SVN Repo system("svn add ~/R/model/iris_rf") #Build Comment and Commit, svn.comment <- paste("RandomForest for Iris Data with seed of: ",seed," and run time of ", time," Secs", sep ="") eval(parse(text =as.expression( paste("system(\"svn commit -m '",svn.comment,"' ~/R/model/iris_rf\")", sep ="") ))) #Now get SVN Revision Number model_ver< -system("svn info ~/R/models/iris_rf | awk '/Last Changed Rev: /{print $4}'", intern = TRUE)
I have added a little tweak to the top of the this script the really opens up the power of this approach.
commandArgs(TRUE)[1]
This functions parses command line arguments so they can be used within your R script. Say we saved this file as rf_Iris.R. you could call it from the command line or from within a shell script(where this REALLY gets fun, ok maybe i have a tainted idea of whats fun) like:
Rscript ./rf_Iris.R 456
Now this example of resetting the seed is a little toy, however you could pass in the target variable, model name for rerunning/updating, parametrize a sql call. It get cool i promise you. It turns R from a interactive scripting tool to a batch process. I will post some really cool shell scripts that allow you to spin up EC2 machines to remotely execute Rscripts on the cloud. Combining the randomForest Combine technique used above and a couple EC2 machines you can build a distributed training ‘cluster’. Or more accurately described by Mike Driscoll as “Bash Reduce”. At minimum you can expand your computing power from the comfort of your own command line. (as well as really useful story to pick up girls at the bar. little know fact, talk of distributed cloud based modelling really gets the women).
This framework can be used to dynamically build R statements using command-line arguments. I also like the three line split because you can just run the paste command to see what would run. It is totally kludgey the more complicated the statement it can really get difficult to debug. If someone has a more elegant solution i would love to hear.
eval(parse(text =as.expression( paste("system(\"svn commit -m '",svn.comment,"' ~/R/model/iris_rf\")", sep ="") )))
A bit of WARNING: this type of concatenate command building should get your DBA/SysAdmin alarms going off. You should not expose any script that is this hack to web server or let i run by a potentially malicious soul on your machine. It is just asking for a sting injection attack. If you planning to access these scripts through a rApache or make them public facing, you should NOT be concatenating system commands. If you are skilled enough to do that, then you probably don’t need me to tell you.
Cheers
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.