R in big data pipeline
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R is my fabovite tool for research. There are still quite a few things that only R can do or quicker/easier with R.
But unfortunately a lot of people think R becomes less powerful at production stage where you really need to make sure all the functionalities run as you planned against incoming big data.
Personally, what makes R special in the data field is its ability to become friend with many other tools. R can easily ask JavaScript
for data visualization, node.js
for interactive web app and data pipeline tools/databases for production ready big data system.
In this post I address how to use R stably combined with other tools in big data pipeline without losing its awesomeness.
tl;dr
You’ll find how to include R into luigi
, light weight python data workflow management library. You can still use R’s awesomeness in complex big data pipeline while handling big data tasks by other appropriate tools.
I’m not covering luigi basics in this post. Please refer to luigi website if necesary.
Simple pipeline
Here is a very simple example;
-
HiveTask1: Wait for external hive data task (table named “externaljob” partitioned by timestamp)
-
RTask: Run awesome R code as soon as pre-aggregation finishes
-
HiveTask2: Upload it back to Hive as soon as the above job finishes (table names “awesome” partitioned by timestamp)
and you wanna do this job everyday in an easily debuggable fashion with fancy workflow UI.
That’s super easy, just run
python awesome.py --HiveTask1-timestamp 2015-08-20
This runs python
file called awesome.py
. --HiveTask1-timestamp 2015-08-20
sets 2015-08-20 as timestamp argument in HiveTask1 class.
Yay, all the above tasks are now connected in the luigi task UI!
Notice our workflow goes from bottom to top.
You can see there is an error in the very first HiveTask2 but this is just by design.
Codes
Let’s take a look at awesome.py
:
Basically the file only contains three classes (HiveTask1, RTask and HiveTask2) and their dependency is specified by
luigi checks dependencies and outputs of each step so it checks existense of;
'/user/storage/externaljob/timestamp=%s' % self.timestamp.strftime('%Y%m%d')
'awesome_is_here_%s.txt' % self.timestamp.strftime('%Y%m%d')
'/user/hive/warehouse/awesome/timestamp=%s' % self.timestamp.strftime('%Y%m%d')
The most important thing here is using python’s subprocess module with shell=True
, so you can run your R file
The timestamp argument you gave at the very beginning is stored as global variable timestamp
(well, this is not necessarily the coolest option)
and can be used in other tasks by
Moreover, you can pass timestamp
to R file by
Then let’s take a look at awesome.R
In R side, you can receive timestamp
argument you passed from python by
Similarly, update2hive.R
can look like
One last thing you might like to do is to set a cronjob.
0 1 * * * python awesome.py --HiveTask1-timestamp `date --date='+1 days' +\%Y-\%m-\%d`
This one for example runs the whole thing at 1 a.m everyday.
Conclusion
In this post I’ve shown simple example of how to quickly convert your research R project into solid deployable product. This is not limited to simple R-hive integration but you can let R, spark, databases, stan/bugs, H2O, vowpal wabbit and millions of other data tools dance together as you wish. and you’ll recognize R still plays a central role in the play.
Codes
The full codes are available from here.
R in big data pipeline was originally published by Kirill Pomogajko at Opiate for the masses on August 16, 2015.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.