R in Business Intelligence
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Business Intelligence (BI) can be simply described as extracting useful informations from the data. This is quite a broad process as the source data structure (and quality) can vary, as well the useful information structure can vary. More technically process of such transformation can be described as ETL (extract, transform, load), plus presentation of the useful information.
The idea to review usefulness of R in those processes came after I saw Wes McKinney's slides about python pandas and BI: Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013).
I recommend to preview the slides, just for better understanding where pandas is (been in 2013) in BI. Python pandas is presented more or less as BI ready processing engine.
R in BI
In my opinion R is fully capable (even more than pandas) to serve as engine for BI related processes.
R has naturally (developed for decades) broad range of statistical tools available (multiple repositories with thousands of packages). I will skip this enormous feature of R and just focus on simple BI case of extraction, transformation, loading and presentation.
Below are listed packages which directly address the steps in basic BI process.
Extraction
- DBI – native database drivers for multiple vendors, top performance.
- RODBC – ODBC database driver connection.
- RJDBC – JDBC database driver connection.
- data.table's fread – very fast csv files reader.
- tons other packages to support different format of data (e.g.: xlsx, xml, json, sas, spss, stata).
Good overview of R database interfaces available on Burns Statistics.
Transformation
- data.table – powerful data transformation tool, uses
from[ where, select|update, group by ][...]
syntax. - dplyr – also powerful, but less scalable, data transformation tool, uses
from %>% where %>% group by %>% select %>% ...
syntax. Pivot and unpivot (cast and melt) are located in tidyr package.
Loading
- DBI – native database drivers for multiple vendors, top performance.
- RODBC – ODBC database driver connection.
- RJDBC – JDBC database driver connection.
Presentation
Presentation of useful information is totallly different task than ETL process, it can be easily outsourced to any BI dashboard tool by simply populating the data structure expected by particular tool. Yet when using R you don't even need to push prepared data to external presentation tool. You can produce a web application dashboard directly from R.
- shiny – Web Application Framework for R.
- opencpu – HTTP API to R.
- httpuv – HTTP and WebSocket server library, also the core of shiny package.
- Rook – web server interface for R.
Using mentioned packages you are capable to host interactive web applications. Those can generate interactive plots, interactivly query the data. All is working on engine of R session, can execute R functions so it can also utilize huge plotting and statistical analysis capabilities available in any R packages. As of now I haven't heard of any open source end-to-end dashboard product which could allow to simply populate data marts and query against them in shiny, but as you can see in Extras section below it is not so hard to build a fully customized one. Yet there exists some big dashboard products for R, e.g. Mango Navigator, but I'm not sure how well they can cover BI needs.
Extras
Packages mentioned above are core “engines” for each of the section, there are multiple packages which simply wraps functions in those packages, unify or simplify its usage. Those are listed below, surely not all which exists:
- ETLUtils – read (only) data from multiple database. Unify DBI, RODBC, RJDBC calls.
- db.r – database schema explorer.
- dplyr – it can use DBI database connections behind the scene.
- sqldf – helper to operate on R data.frame using SQL statements, not so efficient as data.table or dplyr, yet may be quite useful for SQL folks.
- dwtools – data warehouse related functions. Database queries details logging (statement, nrow in, nrow out, timing in ns, etc.), processing details logging. Unify DBI, RODBC calls. (self-promotion)
- shinyBI – shiny application as package, can perform interactive aggregation on user provided dataset (using data.table as backend for top performance), and interactive plotting of the aggregated data. (self-promotion)
The sentence from pandas slides “Stop believing in the 'one tool to rule them all'” is very true, but if we consider R as container for tools, which are the packages, then we might end up with one container to rule them all.
Crucial factors for business to adopt R
Performance and scalability
Extraction / Loading
I doubt if DBI performance of extraction or loading can be easily beaten. Still the most common interface for ETL tools is ODBC connection which is dramatically slower than native drivers available via DBI. Similarly data.table's fread (fast read for csv files) can be hard to beat.
Detailed benchmark of DBI-compliant package ROracle is available at Oracle blogs:
- R to Oracle Database Connectivity: Use ROracle for both Performance and Scalability – Jun 2013 – ROracle (DBI) vs RODBC vs RJDBC.
Transformation
This may suprise you but R beats python pandas in terms of performance. There are benchmarks in the web which may state the opposite, but in fact they are old, or they were made against the bug in the opposite tool.
Linking the recent benchmarks of R vs pandas:
- Grouping benchmark up to 2 000 000 000 rows by Matt Dowle – Oct 2014 – grouping benchmark.
- Very basic benchmark by Szilard Pafka – Dec 2014 – among others: join benchmark.
- R beats Python! R beats Julia! anyone else wanna challenge R? by Norman Matloff – May 2014 – R vs Python vs Julia review.
Presentation
Measuring performance of web apps seems to be totally different story which I will not cover. Important factor is that within you web application you can make all the computation in any R package you prefer as the web application is already working in R. Good example is shinyBI shiny app mentioned above in Extras which uses data.table backend for best performance.
Support and licensing
Open source is often considered as problem in terms of support and licensing.
On the enterprise level it is usually necessary to implement the solutions for which support is already contracted.
It is very resonable, but people should be aware of very good alternatives.
This gap has been filled by companies like Revolution Analytics or RStudio.
Besides of big R companies there are other ways to get (even more) effective support like contracting support from:
- packages authors, contributors
- stackoverflow top answerers for a tag
- tons of R consultancy companies
In terms of licensing, there are many products which are available under dual license. If there is no such for a product you are interested in, then you can try to integrate to the current license, it is often possible, or ask the author for a dual license version.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.