A room full of Julians
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Despite winter rain, I was delighted to head uptown last week to Skills Matter on the old Goswell Road for the first ever London Julia meetup. The first thing I learnt was that Julia’s friends are called Julians.
If you don’t know it yet, Julia is a pretty new (v 0.3 is current) programming language for fast numerical computing. Everything is designed from the ground up for speed, by some very clever people. They claim speeds consistently close to compiled machine code, which is generally the upper limit, like the speed of light. But a few facts make it potentially Revolutionary Computing: you don’t have to compile it before running, you can mess about in a command-line interface to learn it, it’s free and open source, you can directly call C++ functions from inside normal Julia code – and vice versa, and the syntax is LISP-ish and light as eiderdown (there are some nice comparative examples of this on the homepage).
The focus was on getting started, and the room was packed. Personally, I spent some time playing with it last year and then let it lapse, but now with v0.3 out there it seems to be time to get back up to speed.
For stats people, there are a few important packages to install: Distributions, Stats, DataFrames, HypothesisTests, and possibly Optim, MCMC, depending on your own interests. That’s all pretty straightforward, but when you start up Julia or load one of the packages like this:
using(HypothesisTests)
it takes a noticeable while to get ready. This is an artefact of the just-in-time compiler and open source programming. Almost all of the packages and the standard library are written in Julia itself. When you first need it, it gets compiled, and after that it should be superfast. Apparently a package is on the way to supply a pre-compiled standard library, to increase startup speeds.
Here’s a little power simulation I tried out afterwards:
using(HypothesisTests) starttime=time() nsig=0; for (i in 1:100000) xx=140+(15*randn(10)); yy=135+(15*randn(10)); sig= pvalue(EqualVarianceTTest(xx,yy))<0.05 ? 1 : 0; nsig = nsig+sig; end time()-starttime
This does 100,000 simulations of independent-samples t-tests with sample size 10 per group, means 140 and 135, and SD 15, and took 5.05 seconds on a teeny weeny Samsung N110 ‘netbook’ with 1.6GHz Atom CPU and 1GB RAM (not what you would normally use!) once the package was loaded.
In R, you could do this at least two ways. First a supposedly inefficient looped form:
Sys.time() nsig<-0 for (i in 1:100000) { xx<-rnorm(10,mean=140,sd=15) yy<-rnorm(10,mean=135,sd=15) if(t.test(xx,yy)$p.value<0.05) { nsig<-nsig+1 } } Sys.time() print(nsig)
Next, a supposedly more efficient vectorized form:
tp<-function(x) { return(t.test(x[,1],x[,2])$p.value) } Sys.time() nsig<-0 xx<-array(c(rnorm(1000000,mean=140,sd=15), rnorm(1000000,mean=135,sd=15)), dim=c(100000,10,2)) pp<-apply(xx,1,tp) ppsig<-(pp<0.05) table(ppsig) #nsig<-sum(apply(xx,1,tp)<0.05) Sys.time() print(nsig)
In fact, the first version was slightly quicker at 2 minutes 3 seconds, compared to 2 minutes 35. While we’re about it, let’s run it in Stata too:
clear all timer on 1 set obs 10 local p = 0 gen x=. gen y=. forvalues i=1/1000 { qui replace x=rnormal(140,15) qui replace y=rnormal(135,15) qui ttest x==y, unpaired if r(p)<0.05 local p = `p'+1 } dis `p' timer off 1 timer listThat took 30 seconds so we’re looking at 50 minutes to do the whole 100,000 simulations, but Stata black belts would complain that the standard language is not the best tool for this sort of heavy duty number-crunching. I asked top clinical trial statistician Dan Bratton for some equivalent code in the highly optimised Mata language:
timer clear 1 timer on 1 mata: reps = 100000 n = (10 \ 10) m = (140 , 135) s = (15 , 15) pass = 0 for (i=1;i<=reps;i++) { X = rnormal(10,1,m,s) mhat = mean(X) v = variance(X) df = n[1]+n[2]-2 t = (mhat[1]-mhat[2])/sqrt((1/n[1]+1/n[2])*((n[1]-1)*v[1,1]+(n[2]-1)*v[2,2])/df) p = 2*ttail(df,t) if (p<0.05) pass = pass+1 } pass/reps end timer off 1 timer list 1
… which clocked in at 7 seconds. I’m not going to try anything more esoteric because I’m interested in the speed for those very pragmatic simulations such as sample size calculations, which the jobbing statistician must do quite often. (Actually, there is an adequate approximation formula for t-tests that means you would never do this simulation.)
That time difference surprised me, to say the least. It means that Julia is an option to take very seriously indeed for heavy-duty statistical calculations. It really isn’t hard to learn. However, I don’t know of any scientific papers published yet that used Julia instead of any more established software. Perhaps the version 0.x would worry editors and reviewers, but surely v1.0 is not far away now.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.