Writing Julia functions in R with examples
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Julia programming language is growing fast and its efficiency and speed is now well-known. Even-though I think R is the best language for Data Science, sometimes we just need more. Modelling is an important part of Data Science and sometimes you may need to implement your own algorithms or adapt existing models to your problems.
If performance is not essential and the complexity of your problem is small, R alone is enough. However, if you need to run the same model several times on large datasets and available implementations are not suit to your problem, you will need to go beyond R. Fortunately, you can go beyond R in R, which is great because you can do your analysis in R and call complex models from elsewhere. The book “Extending R” from John Chambers presents interfaces in R for C++, Julia and Python. The last two are in the XRJulia and in the XRPython packages, which are very straightforward.
XRJulia (Small example)
Now I will show how to write a small function in Julia and call it from R. You can also call existing Julia functions and run Julia packages (see the book and the package documentation for more information).
First let’s see if there is an existing Julia application in the system:
library(XRJulia) findJulia(test = TRUE) ## [1] TRUE
Great! My Julia is up and running. A good way to define your Julia functions is though the JuliaEval (you can also use JuliaSource to run scripts). This function evaluates a Julia code and returns an object. However, this does not allow us to call the Julia function. Next we need to tell R that this object is a function with JuliaFunction. Note that everything I am doing here is in Julia 0.5.2 and there may be some differences if you use older or newer versions.
# = Define a function in Julia for a linear regression = # regjl <- juliaEval(" function reg(x,y) n=size(x,1) xreg=hcat(ones(size(x)[1],1),x) k=size(xreg,2) p1=((xreg'xreg)^(-1)) b=p1*xreg'y r=y-xreg*b sig=(r'r)/(n-k) vmat=sig[1]*p1 sigb=sqrt(diag(vmat)) t=b./sigb return (b,t) end ") # = Tell R regjl is a function = # regjl_function=JuliaFunction(regjl)
Now we can just call the Julia function as if it was an R function. The data will automatically be converted for you by the package. However, there are some possible pitfalls. For example, Julia’s coercion is different from R. It does not understand that a round float may be interpreted as an integer. If, for example, one of the arguments in the Julia function is used to index variables, it needs to be integer and you may have to convert in the Julia function or use L in R (0L, 1L).
Let us generate some goofy data just to run a regression. We have 30 explanatory variables. Once we call the Julia regression function we will get an object that prints something like this: Julia proxy object Server Class: Array{Array{Float64,1},1}; size: 2.
# = Generate Data = # set.seed(1) x=matrix(rnorm(9000),300,30) y=rnorm(300) # = Run Julia Regression = # test=regjl_function(x,y) # = Convert Julia object to R = # JLreg=juliaGet(test)[[1]] # = Run regression using lm and compare = # Rreg=coef(lm(y~x)) cbind(Rreg,JLreg) ## Rreg JLreg ## (Intercept) 0.039077527 0.039077527 ## x1 -0.007701335 -0.007701335 ## x2 -0.056821632 -0.056821632 ## x3 0.105108373 0.105108373 ## x4 0.067983306 0.067983306 ## x5 0.008417105 0.008417105 ## x6 0.078409799 0.078409799 ## x7 0.125869518 0.125869518 ## x8 -0.083682827 -0.083682827 ## x9 0.071740308 0.071740308 ## x10 -0.005219314 -0.005219314 ## x11 -0.031737503 -0.031737503 ## x12 -0.059554921 -0.059554921 ## x13 -0.006208602 -0.006208602 ## x14 -0.020050620 -0.020050620 ## x15 0.008620698 0.008620698 ## x16 -0.030422210 -0.030422210 ## x17 -0.070605635 -0.070605635 ## x18 0.085396818 0.085396818 ## x19 -0.039982253 -0.039982253 ## x20 0.066555544 0.066555544 ## x21 -0.059725849 -0.059725849 ## x22 -0.037880127 -0.037880127 ## x23 -0.009742695 -0.009742695 ## x24 0.074308087 0.074308087 ## x25 0.067308613 0.067308613 ## x26 0.049848935 0.049848935 ## x27 -0.020548783 -0.020548783 ## x28 -0.014468850 -0.014468850 ## x29 -0.038258085 -0.038258085 ## x30 0.102003013 0.102003013
A big Example: Performance
If you just needed an example to adapt to your own problems you do not need to read the rest of this post. However, it is interesting to see how a complex Julia call performs in R. The main question is: is it worthy to use Julia in R even with all the data conversions we need to do from R to Julia and Julia to R? I will answer with an example!!!
The model I will estimate is the Complete Subset Regression (CSR). Basically, it is a combination of many (really many!!) linear regressions. If you want more information on the CSR click here. In this design, we will need to estimate 4845 regressions. The Julia function for the CSR is in the chunk below and the R function can be downloaded from my github here or using install_github(“gabrielrvsc/HDeconometrics”) (devtools must be loaded).
# = Load Combinatorics package in Julia. You need to install it. = # # = In julia, type: Pkg.add("Combinatorics") comb=juliaEval("using Combinatorics") # = Evaluate the CSR function = # csrjl <- juliaEval(" function csr(x,y,k,K,fixed_controls) fixed_controls=floor(Int,fixed_controls) if fixed_controls!=0 w=x[:,fixed_controls] nonw=setdiff(1:size(x,2),fixed_controls) end if fixed_controls[1]==0 nonw=1:size(x,2) end save_stat=zeros(size(x,2),2) save_stat[:,2]=1:size(x,2) for i in nonw if fixed_controls[1]==0 (betas,tstat)=reg(x[:,i],y) save_stat[i,1]=abs(tstat[2]) end if fixed_controls!=0 (betas,tstat)=reg(hcat(w,x[:,i]),y) save_stat[i,1]=abs(tstat[end]) end end t_ord = sortperm(save_stat[:,1],rev=true) save_stat=save_stat[t_ord,:] selected=save_stat[1:K,2] comb=collect(Base.combinations(selected,k)) m=size(comb)[1] final_coef=zeros(m,size(x,2)) final_const=zeros(m) if fixed_controls!=0 for i in 1:m xr=hcat(x[:,fixed_controls],x[:,floor(Int,comb[i])]) (model,tstat)=reg(xr,y) final_const[i]=model[1] model1=model[2:end] if length(fixed_controls)>1 final_coef[i,fixed_controls]=model1[1:(size(fixed_controls)[1])] model2=model1[size(fixed_controls)[1]+1:end] else final_coef[i,fixed_controls]=model1[1] model2=model1[2:end] end final_coef[i,floor(Int,comb[i])]=model2 end elseif fixed_controls[1]==0 for i in 1:m (model,tstat)=reg(x[:,floor(Int,comb[i])],y) final_const[i]=model[1] final_coef[i,floor(Int,comb[i])]=model[2:end] end end aux=hcat(final_const,final_coef) result=[mean(aux[:,i]) for i in 1:size(aux)[2]] end ") # = Define it as a function to R = # csrjl_function=JuliaFunction(csrjl)
Finally, lets compare the R implementation and the Julia implementation. The R CSR is not as good as it could be, I still have some improvements to do. However, the Julia CSR is also far from perfect given my skills in Julia.
# = Run Julia CSR = # t1=Sys.time() test=csrjl_function(x,y,4L,20L,0) t2=Sys.time() t2-t1 ## Time difference of 0.5599275 secs # = Run R CSR = # t1=Sys.time() testR=HDeconometrics::csr(x,y) t2=Sys.time() t2-t1 ## Time difference of 5.391584 secs
As you can see, the Julia implementation is nearly 10 times faster than the R implementation. Most fast machine learning functions implemented in R such as glmnet or RandomForest are not written in R. Instead, these functions calls other languages like C++ and Fortran. Now you can do these things yourself with Julia and Python, which are much easier to learn. The last chunk below just shows that the two functions return the same result.
cbind("R"=colMeans(coef(testR)),"Julia"=juliaGet(test)) ## R Julia ## intersect 0.033839276 0.033839276 ## 0.000000000 0.000000000 ## -0.010053924 -0.010053924 ## 0.020620605 0.020620605 ## 0.011568725 0.011568725 ## 0.011807227 0.011807227 ## 0.010009517 0.010009517 ## 0.028981984 0.028981984 ## -0.008858019 -0.008858019 ## 0.013655475 0.013655475 ## 0.000000000 0.000000000 ## -0.008555735 -0.008555735 ## -0.014146991 -0.014146991 ## 0.000000000 0.000000000 ## 0.000000000 0.000000000 ## 0.000000000 0.000000000 ## 0.000000000 0.000000000 ## -0.010730430 -0.010730430 ## 0.015761635 0.015761635 ## -0.009543594 -0.009543594 ## 0.015337589 0.015337589 ## -0.008913380 -0.008913380 ## -0.009944097 -0.009944097 ## 0.000000000 0.000000000 ## 0.008933383 0.008933383 ## 0.010226204 0.010226204 ## 0.000000000 0.000000000 ## 0.000000000 0.000000000 ## 0.000000000 0.000000000 ## -0.007886220 -0.007886220 ## 0.021236322 0.021236322
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.