ggplot2 in Python: A major barrier broken
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have been working with Python recently and I have to say, I love it. There’s a learning curve, of course, which has been frustrating. However, once I got comfortable with it (and continue to do so), I found that working with dataframes in Python (via pandas) is easy, fast, and can do some awesome things pretty simple manner. That’s a subject for another time.
One other thing (among many) that I like about Python is matplotlib, the plotting module. It makes great plots in very few lines of code. It’s default settings are great, so it doesn’t take a whole lot of tweaking to make a publishable plot. It is also very intuitive.
There are two major barriers that stop me from adopting Python all out. First is the infancy of its stats packages, and since I do mostly data analysis as opposed to modeling this is a problem. Python has a large number of basic stats modules, including some that allow a formula interface akin to R. However, it lacks things like statistical non-linear regression (beyond simple curve fitting) and mixed effects models. Given the rate of development in Python, I doubt that this problem will last much longer. R’s stats functions are much more numerous and under much more rapid development for specialized applications like animal movement, species distribution modeling, econometrics, etc. However, the recent implementation of Bayesian modelling via STAN into Python (as Pystan) by Andrew Gelman and his team has removed the major issue, as I can now do almost any test (linear, non-linear, multi-level) that I can write a Bayesian MCMC model (more on this later, because Pystan is awesome but still young. In particular, implementing mcmcplots for stan fits would be amazing).
The second major barrier was trellis plots. R has ggplot2, which makes trellis plotting ludicrously simple. It also allows me to plot summary functions (like means and S.E.) without having to actually aggregate these values out. This isn’t a problem on a simple plot but it rapidly becomes cumbersome on a multi-panel plot wherein these summary statistics need to be calculated for each panel. ggplot2 allows me to circumvent this. There is an active port of ggplot2 to Python ongoing, but it too is still young and many functions are incomplete (i.e. boxplots). The good news is that I’ve discovered rpy2, which allows me to call R functions (like ggplot!) in Python. I can do all of my data sorting, stats, etc. in Python, then use rpy2/ggplot to make the plots.
A short example:
# Import the necessary modules import numpy as np import pandas as pd import rpy2.robjects as robj import rpy2.robjects.pandas2ri # for dataframe conversion from rpy2.robjects.packages import importr # First, make some random data x = np.random.normal(loc = 5, scale = 2, size = 10) y = x + np.random.normal(loc = 0, scale = 2, size = 10) # Make these into a pandas dataframe. I do this because # more often than not, I read in a pandas dataframe, so this # shows how to use a pandas dataframe to plot in ggplot testData = pd.DataFrame( {'x':x, 'y':y} ) # it looks just like a dataframe from R print testData # Next, you make an robject containing function that makes the plot. # the language in the function is pure R, so it can be anything # note that the R environment is blank to start, so ggplot2 has to be # loaded plotFunc = robj.r(""" library(ggplot2) function(df){ p <- ggplot(df, aes(x, y)) + geom_point( ) print(p) } """) # import graphics devices. This is necessary to shut the graph off # otherwise it just hangs and freezes python gr = importr('grDevices') # convert the testData to an R dataframe robj.pandas2ri.activate() testData_R = robj.conversion.py2ri(testData) # run the plot function on the dataframe plotFunc(testData_R) # ask for input. This requires you to press enter, otherwise the plot # window closes immediately raw_input() # shut down the window using dev_off() gr.dev_off() # you can even save the output once you like it plotFunc_2 = robj.r(""" library(ggplot2) function(df){ p <- ggplot(df, aes(x, y)) + geom_point( ) + theme( panel.background = element_rect(fill = NA, color = 'black') ) ggsave('rpy2_magic.pdf', plot = p, width = 6.5, height = 5.5) } """) plotFunc_2(testData_R)
So there you have it. Python now has the capability to access ggplot2, although it can be a bit cumbersome. Rpy2 is pretty great, and allows access to any function, albeit with a bit of work. Thus, Python can serve as a main platform which can access R functions for statistics and graphics on an as-needed basis. My hope is that the yhat team finished the port soon, and I’ll bet the statistics packages catch up in the next few years.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.