Adopting R for experienced developers
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I am sure some people will disagree with at least some of my views here. There may or may not be a technically correct or true answer, instead it is a subjective outlook, so keep in mind this is just one persons view.
A brief history of R, and why it has become so popular in certain circles
Should I use R ?
I made a small flowchart on draw.io to assist with this question
General Points
- Get some books. The two I would recommend are the R Cookbook by Paul Teetor and Advanced R by Hadley Wickham. The latter in particular is the type of documentation people with experience programming will value. If you try to get yourself up to speed with just the built in documentation you are making things a lot harder for yourself.
- The principle of least surprise does not apply. The type system and scoping rules takes some getting used to, and the way R works might be quite different from what you are used to.
Don’t assume anything and be sure to check details. If you do any serious work with R you will likely run into weird type or scope related errors at some point. Hadley’s Advanced R book is very valuable here.
- R will not teach you statistics. For example if you don’t know what variance is or what homoscedasticity is and why it matters (like me when I was starting out), you will find limited value in R or any other statistics/machine learning software.
I had the luxury of being able to go off and do an MSc in Math/Stats, which I whole-heartedly recommend if possible. Nowadays there are some great books and MOOCs. There’s no quick or easy answer to this one unfortunately. You can learn statistics with R, but you can’t expect R to fill in the blanks for you.
- Sometimes R tries to be helpful. Statisticians probably appreciate this but programmers probably wont. For example it will automatically convert strings to the factor data type by default when reading in data. It may also do implicit type conversion where other languages would generate an error. Understand the basic types and how to check your assumptions wrt type. I wrote about this a bit here.
- Defensive programming is not common. Many packages do not check data is in the format or type it expects. This can lead to weird and/or inscrutable error messages.
There are some built in debugging tools that can help, you might need to dive into the source to find out what’s going on now and then. I describe some here, and there is some official documentation here.
- Think in terms of tables, vectors and lists.At a high level you are doing statistics on data, and this is generally what R is expecting. I found working with R got a bit easier once I made this cognitive shift.
- Learn to use apply and friends. I wrote an intro to this here. It is a more “functional” way of working. You have your data as a table or whatever, and you apply functions to the rows and columns.
- People use ‘.’ in variable names. This is annoying but the best advice is to just get over it. I believe this is becoming less prevalent, but you will likely still come across it.
- The OO system is a bit of a mess. There are three different types of OO in R. Yes really. The whole way it works is a bit smoke and mirrors and basically I am not really a fan. Reference classes are probably the closest conceptually to OO in other languages.
I personally would not develop a big OO system in R, however other people certainly have and been successful with it. Again, Hadley’s book is a great resource.
- Read R Bloggers. This is a great resource where people write and share code about all sorts of cool stuff. Search for packages names or technologies (like “hadoop” or “ec2”) to get going.
Packages
- The package system is really good. There are all sorts of packages that do all sorts of things. In general, the distribution and update mechanisms all work pretty well.
Packaging is a hard and thankless task, the big linux distros might have hundreds of people solely dedicated to packaging and distribution, and R does a very good job of it with relatively few people. The task views provide an overview of packages available for various types of tasks (time series, machine learning).
- Often the answer to “how do I do x” is “install package y to do it for you.” Sometimes the way base R works can seem a bit convoluted or difficult. Usually, someone has written a package that makes it a lot easier. Just install the package, use it and move on.
- You will end up using a lot of packages.This used to really bother me. In the enterprise world, using third party packages usually required a lengthy approval process as lawyers checked the licensing and potential IP conflicts, as well as a separate process to get things installed and deployed to production by the admin teams.
If you are working in such an environment, be aware you are likely going to end up using a bunch of packages. Thankfully the packages mostly use standard free licenses, which may ease the process somewhat.
Another consideration is that this can make dependency management a bit involved. Typically though, most packages are quite small and do only a few specific things.
- Package code quality can be variable. Many of these are developed by academic statisticians. As a rule these are smart people but their code might not live up to your internal standards for software engineering practice. Many of the packages are excellent, and most that you are likely to use are very stable.
This can become more of an issue as you get into the more obscure areas of statistics, at which point you might want to look over the package source before committing to it.
- Embrace the “Hadleyverse.” Hadley Wickham has written a bunch of great packages that are very useful when working with R day to day. He is now a member of the R core. Someone else wrote more about the Hadleyverse here.
- If you are doing machine learning, use caret. In a nutshell it provides an abstraction over the various machine learning algorithms, and a whole bunch of useful stuff for model building, tuning and evaluation. It has good docs and there is also a great book Applied Predictive Modelling in R you should check out if you are doing ML in R, which I reviewed here
- Rcpp lets you easily use snippets of C++ in R. It’s really cool. I wrote a cheatsheet for common linear algebra related R operations here. Some people say R is slow, and there is a small element of truth there, but in general I feel if someone complains that a language is slow they should probably write better programs and/or buy a faster computer. Rcpp can help though. There is a good book by the package author (who is also an R core developer now) available as well.
Outro
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.