Site icon R-bloggers

R in production systems

[This article was first published on Erehweb's Blog » r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R is great for prototyping models.  Not so great if those models have to run a business.  Here’s some tips to help with that:

  1. Validate, alert, and monitor
  2. Sink
  3. Use 64-bit Linux
  4. Write your own functions
  5. tryCatch

Validate, alert, and monitor:  Sooner or later something is going to go wrong with your model.  Maybe some parameter will get the wrong sign and it will recommend selling iPads for a nickel.  You can guard against this by constrained optimization, but really you need to have an automated check on any results before they go into production.  If model results change a lot between runs, you should be automatically notified.  And even if the model is running fine, you should produce summaries of its performance, and how it’s changing over time.  To email yourself the string message_text with subject my_subject in Unix, do:

string_to_execute <- paste("echo -e ", "\"", message_text, " \"",  " | mutt -s ", "\"", my_subject, "\" ", "erehweb@madeupemail.com", sep = "")
system(string_to_execute)

Sink: When things go wrong, you’ll need to debug your code.  Make it easier by writing all R output from print, cat and errors to a log file – e.g.

log_file <- file("/mydir/my_log_file.txt")
sink(log_file)
sink(log_file, type = "message")    # So you catch the errors as well

# Your code goes here

sink(type = "message")
sink()
close(log_file)

If you want to get fancy, you can build the date/time into the file name  by

log_time <- Sys.time()

file_suffix <- paste(format(log_time, "%m"), format(log_time, "%d"), format(log_time, "%y"), "_", format(log_time, "%H"), format(log_time, "%M"), sep = "")

log_file <- file(paste("/mydir/my_log_file_", file_suffix, ".txt", sep = "")

Use 64-bit Linux: R is bad at memory management.  You can try to use smaller structures, garbage collection (gc), rm structures where possible, but the best solution is to run it on 64-bit Linux with lots of memory.  Anything else is gambling.

Write your own functions: R has a lot of functions.  Unfortunately, many of them are buggy.  For example, here’s what the author of bayesglm has to say about it and glm:

… glm(), and its workhorse, glm.fit(), are a mess: They’re about 10 lines of functioning code, plus about 20 lines of necessary front-end, plus a couple hundred lines of naming, exception-handling, repetitions of chunks of code, pseudo-structured-programming-through-naming-of-variables, and general buck-passing. I still don’t know if my modifications [to produce bayesglm] are quite right–I did what was needed to the meat of the function but no way can I keep track of all the if-else possibilities.

Do you really want that code in a production system?  Copy it and call it my_glm or my_bayesglm.  That way it’s under your control, and will be easier to debug and fix.

tryCatch: Well, at least if you do run into an error, you can send yourself a nice email saying what went wrong and where – a little more elegant than just relying on your log file.

So should you use R in a production system?  Well, it’s free, and quick to develop in, so go ahead, but definitely keep your eyes open.


To leave a comment for the author, please follow the link and comment on their blog: Erehweb's Blog » r.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.