Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
By Andy Nicholls, Head of Consulting
As companies increasingly look beyond the scope of what is logistically possible in Excel more and more companies are approaching Mango looking for help with connecting to Excel from R. With over 6,500 packages now on CRAN it should come as no surprise that there are quite a few packages that have be written in order to connect to Excel from R. So which is the best? Unfortunately it really does depend on what you want to do but here’s a quick guide to some of the main packages available from CRAN.
The “All-Rounders”
Overview
There are four “all-rounder” packages that can both read and write to Excel: XLConnect written and maintained by Mirai Solutions; xlsx by Adrian Dragulescu; openxlsx by Alexander Walker; excel.link by Gregory Demin. With each of these packages it is possible to follow a structured workflow in which a user connects to a workbook, extracts data, processes the data in R and then writes back data, graphics or even Excel formulae back to the workbook. It is possible to add new sheets, recolour, add formulae and so on. There are however some important differences that users should be aware of.
Formal Comparison of “All-Rounders”
At Mango we regularly run tests to assess developments around these packages in order to assess their suitability for various projects. The test is fairly straightforward and simply involves connecting to a 4MB file, reading in some different data formats and writing back data and graphics to the spreadsheet. The results are shown in the table below.
General Observations
For the general consumer I personally think XLConnect is the best all-rounder though functionally there’s not much difference between this and xlsx. Both XLConnect and xlsx depend on rJava but the Java elements are hidden away from the user with XLConnect and primarily for that reason I have a slight personal preference for XLConnect. If your workbook is full of “=sum()” or other formulae then it is also worth noting that XLConnect will read in the result of the calculation whereas xlsx interprets the formula as NA. Personally I am not a big fan of the XLConnect header formatting when writing data to Excel however. If you’re an xlsx pro user though there’s certainly no reason that I can think of to drop the package and switch to XLConnect.
One potential downside of both XLConnect and xlsx is their Java dependency. Users with medium to large workbooks may soon encounter “Java heap space” error messages as memory is consumed in huge quantities. Those working for companies where IT have ultimate rule over your laptop may also find it difficult to get set up in the first place by ensuring that Java is installed and available in the right location. This is essentially a binary hurdle that you can either overcome or you can’t. If you can it’s worth continuing as these are geat packages.
openxlsx is fairly new and a package that I hadn’t tried until recently. I have to say I’ve been impressed with what I’ve seen thus far. Crucially it does not have a dependency on Java though Windows users will need a zip application (eg Rtools). The functionality and consistency is not quite at the level of XLConnect or xlsx yet. For example the read.xlsx has startRow argument but no startCol. You can use the argument cols to specify start and end columns but it feels more like a workaround in my opinion. Dates are read in as numeric by default as well which most users will find frustrating (unlike XLConnect or xlsx which convert dates to POSIXct). Plus it’s noticabley slower than the aforementioned packages for large workbooks. Despite all of that however, it is the only all-rounder package that I’ve used that can easily connect to and import a protected sheet from a large “xlsm” test file that we have at Mango, pumped full of VBA. And for what it’s worth it is also the only all-rounder that interprets a “#NUM” in Excel as NaN in R; XLConnect and xlsx read such values as NA whilst excel.link fails with an error. Graphics may also be written directly to Excel without having to generate an intermediate file (though often an intermediate file can be a useful output).
That leaves excel.link. I left this one until last as it’s very different from the other three all-rounder packages. excel.link uses RDCOMClient and any edits to a workbook are live edits in an open workbook, making it easier to develop a script. It also passes Mango’s “xlsm” test, albeit only after the protected sheet is unprotected. However it’s quite tough to pick up for the lay-user and for some reason it doesn’t appear to be able to read hidden sheets. Speed is also an issue that users will notice if their script is particularly long and the workbook large. That said it certainly offers something different to the other packages however and if you learn it well then it’s a powerful ally.
Reading Structured Data from Excel
The all-rounders are great but if you are fortunate enough to have structured data, which in this context means your datasets are stored in the top left-hand corner of separate tabs, then there are a few other options to consider which may be much faster than the ‘all-rounder’ packages for reading in data. The multi-purpose RODBC is very mature and really easy to use but some users can be limited by their driver set-up. RJDBC is a viable, albeit slower alternative that has it’s own (Java) restriction. Hadley Wickham has a habit of finding problems that need solving and for those who are struggling with either of these packages readxl is the new kid on the block for structured data that everyone will probably be using by the end of the year.
What Else is There?
There are further specific packages available such as Guido van Steen’s dataframes2xls package which uses Python’s pyexcelerator to write to xls and Marc Schartz’s WriteXLS which uses Perl to write to xls and xlsx files. Another Perl implemntation is Gregory Warnes’s gdata which can be used to read data from Excel. And the list goes on but I have to stop writing at some point!
Conclusion
If you’ve found a good package stick to it! If you’re starting out it’s worth considering the structure of your data and the end users of your code. Are they going to have all the freedom you have to configure Java, install drivers and so on? There are plenty of packages available and most of them are pretty good so long as you understand their limitations.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.