Site icon R-bloggers

Using PL/R and PL/Python in Postgres

[This article was first published on Data Science Riot!, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve recently been exploring options to calculate median and quartiles in my Postgres database. If you’re familiar with quartiles you know how handy they can be. There’s a few different options in the Postgres universe to accomplish this, so I figured I would give them all a whirl and see which was the friendliest (and fastest) on my CPU.

The Data

I’m using the “batting” table for Sean Lahman’s baseball database as my proof of concept. The table has just under 100,000 rows. Not too big, but a good test case. For my example here, I’m using the “r” column, which indicates total runs scored for a season.

The R Method

PL/R is a popular Postgres extension. If you haven’t checked it out, I would highly recommend it.

CREATE OR REPLACE FUNCTION r_quartile(ANYARRAY) RETURNS ANYARRAY AS $$
quantile(arg1, probs = seq(0, 1, 0.25), names = FALSE)
$$ LANGUAGE 'plr';

CREATE AGGREGATE quartile (ANYELEMENT) (
sfunc = array_append,
stype = ANYARRAY,
finalfunc = r_quartile,
initcond = '{}');

The Python Method

I’m a big fan of Python in general. It’s currently one of my favorite ETL languages. Here I’m using yet another great Postgres extension called PL/Python.

CREATE TYPE boxplot_values AS (
  min       numeric,
  q1        numeric,
  median    numeric,
  q3        numeric,
  max       numeric
);

CREATE OR REPLACE FUNCTION public._final_boxplot(strarr numeric[])
  RETURNS boxplot_values AS
$BODY$
    x = strarr
    a.sort()
    i = len(a)
    return ( a[0], a[i//4], a[i//2], a[i*3//4], a[-1] )
$BODY$
  LANGUAGE plpythonu IMMUTABLE
  COST 100;
 
CREATE AGGREGATE boxplot(numeric) (
  SFUNC=array_append,
  STYPE=numeric[],
  FINALFUNC=_final_boxplot,
  INITCOND='{}'
);

The C Method

Everyone remember C from your CS-101 class in college? Yeah, that’s why no one likes to write it. Fortunately, this is a pre-packaged Postgres extension written in C called Quantile. I’m not going to post the mile-long C code here, but you can see it on the GitHub repo.

To leave a comment for the author, please follow the link and comment on their blog: Data Science Riot!.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.