R for actuarial science

Posted on January 10, 2013 by arthur charpentier in Uncategorized | 0 Comments

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As mentioned in the Appendix of Modern Actuarial Risk Theory, “R (and S) is the ‘lingua franca’ of data analysis and statistical computing, used in academia, climate research, computer science, bioinformatics, pharmaceutical industry, customer analytics, data mining, finance and by some insurers. Apart from being stable, fast, always up-to-date and very versatile, the chief advantage of R is that it is available to everyone free of charge. It has extensive and powerful graphics abilities, and is developing rapidly, being the statistical tool of choice in many academic environments.”

R is based on the S statistical programming language developed by Joe Chambers at Bell labs in the 80’s. To be more specific, R is an open-source implementation of the S language, developed by Robert Gentlemn and Ross Ihaka. It is a vector based language, which makes it extremely interesting for actuarial computations. For instance, consider some Life Tables,

> TD[39:52,]       > TV[39:52,]
     Age    Lx         Age    Lx
  39  38 95237          38 97753
  40  39 94997          39 97648
  41  40 94746          40 97534
  42  41 94476          41 97413
  43  42 94182          42 97282
  44  43 93868          43 97138
  45  44 93515          44 96981
  46  45 93133          45 96810
  47  46 92727          46 96622
  48  47 92295          47 96424
  49  48 91833          48 96218
  50  49 91332          49 95995
  51  50 90778          50 95752
  52  51 90171          51 95488

Those (French) Life Tables can be found here

> TD <- read.table(
+ "http://perso.univ-rennes1.fr/arthur.charpentier/TD8890.csv",sep=";",header=TRUE)
> TV <- read.table(
+ "http://perso.univ-rennes1.fr/arthur.charpentier/TV8890.csv",sep=";",header=TRUE)

From those vectors, it is possible to construct the matrix of death probabilities, $http://latex.codecogs.com/gif.latex?\boldsymbol{P}=[\text{%20}_{k}p_x]$ , using for instance

>  Lx <- TD$Lx
>  m <- length(Lx)
>  p <- matrix(0,m,m); d <- p
>  for(i in 1:(m-1)){
+  p[1:(m-i),i] <- Lx[1+(i+1):m]/Lx[i+1]
+  d[1:(m-i),i] <- (Lx[(1+i):(m)]-Lx[(1+i):(m)+1])/Lx[i+1]}
>  diag(d[(m-1):1,]) <- 0
>  diag(p[(m-1):1,]) <- 0
>  q <- 1-p

One can compute easily, e.g., the (curtate) expectation of life defined as

$http://latex.codecogs.com/gif.latex?e_x%20=\mathbb{E}(K_x)=\sum_{k=1}^\infty%20k\cdot%20\text{%20}_{k|1}q_x%20=%20\sum_{k=1}^\infty%20\text{%20}_{k}p_x$

and one can compute the vector of life expectancy, at various ages $http://latex.codecogs.com/gif.latex?\boldsymbol{e}=[e_x]$ , as

> life.exp = function(x){sum(p[1:nrow(p),x])}
> e = Vectorize(life.exp)(1:m)

An actually, any kind of actuarial quantity can be derived from those matrices. The expected present value (or actuarial value) of a temporary life annuity-due is, for instance,

$http://latex.codecogs.com/gif.latex?\ddot{a}_{x:\overline{n}|}=\sum_{k=0}^{n-1}%20\nu^k%20\cdot%20{}_{k}p_x%20=\frac{1-A_{x:\overline{n}|}}{1-\nu}$

The code to compute those functions is here

> for(j in 1:(m-1)){ adots[,j]<-cumsum(1/(1+i)^(0:(m-1))*c(1,p[1:(m-1),j])) }

or consider the expected present value of a term insurance

$http://latex.codecogs.com/gif.latex?%20A^1_{x:\overline{n}|}%20=\sum_{k=0}^{n-1}%20\nu^{k+1}%20\cdot%20\text{%20}_{k|}q_x$

with the following code

> for(j in 1:(m-1)){ A[,j]<-cumsum(1/(1+i)^(1:m)*d[,j]) }

Some more details can be found in the first part of the notes of the crash courses of last summer, in Meielisalp. Vector – or matrices – are extremely convenient to work with, when dealing with life contingencies. It is also possible to model prospective mortality. Here, the mortality is not only function of the age $http://latex.codecogs.com/gif.latex?x$ , but also time $http://latex.codecogs.com/gif.latex?t$ ,

> t(DTF)[1:10,1:10]
    1899  1900  1901  1902  1903  1904  1905  1906  1907  1908
0  64039 61635 56421 53321 52573 54947 50720 53734 47255 46997
1  12119 11293 10293 10616 10251 10514  9340 10262 10104  9517
2   6983  6091  5853  5734  5673  5494  5028  5232  4477  4094
3   4329  3953  3748  3654  3382  3283  3294  3262  2912  2721
4   3220  3063  2936  2710  2500  2360  2381  2505  2213  2078
5   2284  2149  2172  2020  1932  1770  1788  1782  1789  1751
6   1834  1836  1761  1651  1664  1433  1448  1517  1428  1328
7   1475  1534  1493  1420  1353  1228  1259  1250  1204  1108
8   1353  1358  1255  1229  1251  1169  1132  1134  1083   961
9   1175  1225  1154  1008  1089   981  1027  1025   957   885

Thus, we now have a force of mortality matrix $http://latex.codecogs.com/gif.latex?\boldsymbol{\mu}=[\mu_{x,t}]$ , or surface

It is also possible to use R packages to estimate a Lee-Carter model of the mortality rate,

$http://latex.codecogs.com/gif.latex?\log%20\mu%20_{x,t}%20=\alpha%20_{x}%20+\beta%20_{x}%20\cdot%20\kappa_{t}%20+\varepsilon%20_{x,t}$

> library(demography)
> MUH =matrix(DEATH$Male/EXPOSURE$Male,nL,nC)
> POPH=matrix(EXPOSURE$Male,nL,nC)
> BASEH <- demogdata(data=MUH, pop=POPH, ages=AGE, years=YEAR, type="mortality",
+ label="France", name="Hommes", lambda=1)
> RES=residuals(LCH,"pearson")

One can easily study residuals, for instance as a function of the age,

or a function of the year,

Some more details can be found in the second part of the notes of the crash courses of last summer, in Meielisalp.

R is also interesting because of its huge number of libraries, that can be used for predictive modeling. One can easily use smoothing functions in regression, or regression trees,

> TREE = tree((nbr>0)~ageconducteur,data=sinistres,split="gini",mincut = 1)
> age = data.frame(ageconducteur=18:90)
> y1 = predict(TREE,age)
> reg = glm((nbr>0)~bs(ageconducteur),data=sinistres,family="binomial")
> y = predict(reg,age,type="response")

Some practitioners might be scared because the legend claims that R is not as good as SAS to handle large databases. Actually, a lot of functions can be used to import datasets. The most convenient one is probably

> baseCOUT = read.table("http://freakonometrics.free.fr/baseCOUT.csv",
+  sep=";",header=TRUE,encoding="latin1")
>  tail(baseCOUT,4)
     numeropol  debut_pol    fin_pol freq_paiement langue  type_prof alimentation type_territoire
6512     87291 2002-10-16 2003-01-22       mensuel      A Professeur   Vegetarien          Urbain
6513     87301 2002-10-01 2003-09-30       mensuel      A Technicien   Vegetarien          Urbain
6514     87417 2002-10-24 2003-10-21       mensuel      F Technicien   Vegetalien     Semi-urbain
6515     88128 2003-01-17 2004-01-16       mensuel      F     Avocat   Vegetarien     Semi-urbain
             utilisation presence_alarme marque_voiture sexe exposition age duree_permis age_vehicule i   coutsin
6512 Travail-occasionnel             oui           FORD    M  0.2684932  47           29           28 1 1274.5901
6513              Loisir             oui          HONDA    M  0.9972603  44           24           25 1  278.0745
6514 Travail-occasionnel             non     VOLKSWAGEN    F  0.9917808  23            3           11 1  403.1242
6515              Loisir             non           FIAT    F  0.9972603  23            4           11 1  230.9565

But if the dataset is too large, it is also possible to specify which variables might be interesting, using

> mycols = rep("NULL", 18)
> mycols[c(1,4,5,12,13,14,18)] <- NA
> baseCOUTsubC = read.table("http://freakonometrics.free.fr/baseCOUT.csv",
+  colClasses = mycols,sep=";",header=TRUE,encoding="latin1")
> head(baseCOUTsubC,4)
  numeropol freq_paiement langue sexe exposition age    coutsin
1         6        annuel      A    M  0.9945205  42   279.5839
2        27       mensuel      F    M  0.2438356  51   814.1677
3        27       mensuel      F    M  1.0000000  53   136.8634
4        76       mensuel      F    F  1.0000000  42   608.7267

It is also possible (before running a code on the entire dataset) to import only the first lines of the dataset.

> baseCOUTsubCR = read.table("http://freakonometrics.free.fr/baseCOUT.csv",
+  colClasses = mycols,sep=";",header=TRUE,encoding="latin1",nrows=100)
> tail(baseCOUTsubCR,4)
    numeropol freq_paiement langue sexe exposition age   coutsin
97       1193       mensuel      F    F  0.9972603  55  265.0621
98       1204       mensuel      F    F  0.9972603  38 9547.7267
99       1231       mensuel      F    M  1.0000000  40  442.7267
100      1245        annuel      F    F  0.6767123  48  179.1925

It is also possible to import a zipped file. The file itself has a smaller size, and it can usually be imported faster.

> import.zip = function(file){
+ temp = tempfile()
+ download.file(file,temp);
+ read.table(unz(temp, "baseFREQ.csv"),sep=";",header=TRUE,encoding="latin1")}
> system.time(import.zip("http://freakonometrics.free.fr/baseFREQ.csv.zip"))
trying URL 'http://freakonometrics.free.fr/baseFREQ.csv.zip'
Content type 'application/zip' length 692655 bytes (676 Kb)
opened URL
==================================================
downloaded 676 Kb
   user  system elapsed 
      0.762       0.029       4.578 
> system.time(read.table("http://freakonometrics.free.fr/baseFREQ.csv", 
+ sep=";",header=TRUE,encoding="latin1"))
   user  system elapsed 
      0.591       0.072       9.277

Finally, note that it is possible to import any kind of dataset, not only a text file. Even a Microsoft Excel folder. On a Windows computer, one can use SQL queries

> sheet = "c:\\Documents and Settings\\user\\excelsheet.xls"
> connection = odbcConnectExcel(sheet)
> spreadsheet = sqlTables(connection)
> query = paste("SELECT * FROM",spreadsheet$TABLE_NAME[1],sep=" ")
> result = sqlQuery(connection,query)

Then, once the dataset is imported, several functions can be used,

> cost = aggregate(coutsin~ AgeSex,mean, data=baseCOUT)
> frequency = merge(aggregate(nbsin~ AgeSex,sum, data=baseFREQ),
+ aggregate(exposition~ AgeSex,sum, data=baseFREQ))
> frequency$freq = frequency$nbsin/frequency$exposition
> base.freq.cost = merge(frequency, cost)

Finally, R is interesting for its graphical interface. “If you can picture it in your head, chances are good that you can make it work in R. R makes it easy to read data, generate lines and points, and place them where you want them. Its very flexible and super quick. When youve only got two or three hours until deadline, R can be brilliant” as said Amanda Cox, a graphics editor at the New York Times. “R is particularly valuable in deadline situations when data is scant and time is precious.”.
Several cases were considered on the blog http ://chartsnthings.tumblr.com/…. First, we start with a simple graph, here State Government control in the US

Then try to find a nice visual representation, e.g.

And finally, you can just print it in your favorite newspaper,

And you can get any kind of graphs,

And not only about politics,

Graphs are important. “Its not just about producing graphics for publication. Its about playing around and making a bunch of graphics that help you explore your data. This kind of graphical analysis is a really useful way to help you understand what you’re dealing with, because if you cant see it, you cant really understand it. But when you start graphing it out, you can really see what you’ve got” as said Peter Aldhous, San Francisco bureau chief of New Scientist magazine. Even for actuaries. “The commercial insurance underwriting process was rigorous but also quite subjective and based on intuition. R enables us to communicate our analytic results in appealing and innovative ways to non-technical audiences through rapid development lifecycles. R helps us show our clients how they can improve their processes and effectiveness by enabling our consultants to conduct analyses efficiently”, as explained by John Lucker, team of advanced analytics professionals at Deloitte Consulting Principal, in http://blog.revolutionanalytics.com/r-is-hot/. See also Andrew Gelman’s view, on graphs, http://www.stat.columbia.edu/…

So yes, actuaries might be interested to use R for actuarial communication, as mentioned in http ://www.londonr.org/…

The Actuarial Toolkit (see http ://www.actuaries.org.uk/…) stresses the interest of R, “The power of the language R lies with its functions for statistical modelling, data analysis and graphics ; its ability to read and write data from various data sources; as well as the opportunity to embed R in excel or other languages like VBA. In the way SAS is good for data manipulations, R is superior for modelling and graphical output“.

From 2011, Asia Capital Reinsurance Group (ACR) uses R to Solve Big Data Challenges (see http ://www.reuters.com/…). And Lloyd’s uses motion charts created with R to provide analysis to investors (as discussed on http ://blog.revolutionanalytics.com/…)

A lot of information can be found on http ://jeffreybreen.wordpress.com/…

Markus Gesmann mentioned on his blog a lot of interesting graphs used for actuarial reporting, http ://lamages.blogspot.ca/…

Further, R is free. Which can be compared with SAS, $6,000 per PC, or $28,000 per processor on a server (as mentioned on http ://en.wikipedia.org/…)

It is also becoming more and more popular, as a programming language. As mentioned on this month Transparent Language Popularity (see http ://lang-index.sourceforge.net/), R is ranked 12. Far away after C or Java, but before Matlab (22) or SAS (27). On StackOverFlow (see http ://stackoverflow.com/) is also far being C++ (399,232 occurrences) or Java (348,418), but with 21,818 occurrences, it appears before Matlab (14,580) and SAS (899). As mentioned on http ://r4stats.com/articles/popularity/ R is becoming more and more popular, on listserv discussion traffic

It is clearly the most popular software in data analysis, as mentioned by the Rexer Analytics survey, in 2009

What about actuaries ? In a survey (see http ://palisade.com/…), R was not extremely popular.

If we consider only statistical softwares, SAS is still far ahead, among UK and CAS actuaries

But, as mentioned by Mike King, Quantitative Analyst, Bank of America, “I cant think of any programming language that has such an incredible community of users. If you have a question, you can get it answered quickly by leaders in the field. That means very little downtime.” This was also mentioned by Glenn Meyers, in the Actuarial Review “The most powerful reason for using R is the community” (in http ://nytimes.com/…). For instance, http ://r-bloggers.com/ has contributions from more than 425 R users.

As said by Bo Cowgill, from Google “The best thing about R is that it was developed by statisticians. The worst thing about R is that it was developed by statisticians.”

Arthur Charpentier

Arthur Charpentier, professor at UQaM in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.