R for actuarial science
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As mentioned in the Appendix of Modern Actuarial Risk Theory, “R (and S) is the ‘lingua franca’ of data analysis and statistical computing, used in academia, climate research, computer science, bioinformatics, pharmaceutical industry, customer analytics, data mining, finance and by some insurers. Apart from being stable, fast, always up-to-date and very versatile, the chief advantage of R is that it is available to everyone free of charge. It has extensive and powerful graphics abilities, and is developing rapidly, being the statistical tool of choice in many academic environments.”
R is based on the S statistical programming language developed by Joe Chambers at Bell labs in the 80’s. To be more specific, R is an open-source implementation of the S language, developed by Robert Gentlemn and Ross Ihaka. It is a vector based language, which makes it extremely interesting for actuarial computations. For instance, consider some Life Tables,
> TD[39:52,] > TV[39:52,] Age Lx Age Lx 39 38 95237 38 97753 40 39 94997 39 97648 41 40 94746 40 97534 42 41 94476 41 97413 43 42 94182 42 97282 44 43 93868 43 97138 45 44 93515 44 96981 46 45 93133 45 96810 47 46 92727 46 96622 48 47 92295 47 96424 49 48 91833 48 96218 50 49 91332 49 95995 51 50 90778 50 95752 52 51 90171 51 95488
Those (French) Life Tables can be found here
> TD <- read.table( + "http://perso.univ-rennes1.fr/arthur.charpentier/TD8890.csv",sep=";",header=TRUE) > TV <- read.table( + "http://perso.univ-rennes1.fr/arthur.charpentier/TV8890.csv",sep=";",header=TRUE)
From those vectors, it is possible to construct the matrix of death probabilities, , using for instance
> Lx <- TD$Lx > m <- length(Lx) > p <- matrix(0,m,m); d <- p > for(i in 1:(m-1)){ + p[1:(m-i),i] <- Lx[1+(i+1):m]/Lx[i+1] + d[1:(m-i),i] <- (Lx[(1+i):(m)]-Lx[(1+i):(m)+1])/Lx[i+1]} > diag(d[(m-1):1,]) <- 0 > diag(p[(m-1):1,]) <- 0 > q <- 1-p
One can compute easily, e.g., the (curtate) expectation of life defined as
and one can compute the vector of life expectancy, at various ages , as
> life.exp = function(x){sum(p[1:nrow(p),x])} > e = Vectorize(life.exp)(1:m)
An actually, any kind of actuarial quantity can be derived from those matrices. The expected present value (or actuarial value) of a temporary life annuity-due is, for instance,
The code to compute those functions is here
> for(j in 1:(m-1)){ adots[,j]<-cumsum(1/(1+i)^(0:(m-1))*c(1,p[1:(m-1),j])) }
or consider the expected present value of a term insurance
with the following code
> for(j in 1:(m-1)){ A[,j]<-cumsum(1/(1+i)^(1:m)*d[,j]) }
Some more details can be found in the first part of the notes of the crash courses of last summer, in Meielisalp. Vector – or matrices – are extremely convenient to work with, when dealing with life contingencies. It is also possible to model prospective mortality. Here, the mortality is not only function of the age , but also time ,
> t(DTF)[1:10,1:10] 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 0 64039 61635 56421 53321 52573 54947 50720 53734 47255 46997 1 12119 11293 10293 10616 10251 10514 9340 10262 10104 9517 2 6983 6091 5853 5734 5673 5494 5028 5232 4477 4094 3 4329 3953 3748 3654 3382 3283 3294 3262 2912 2721 4 3220 3063 2936 2710 2500 2360 2381 2505 2213 2078 5 2284 2149 2172 2020 1932 1770 1788 1782 1789 1751 6 1834 1836 1761 1651 1664 1433 1448 1517 1428 1328 7 1475 1534 1493 1420 1353 1228 1259 1250 1204 1108 8 1353 1358 1255 1229 1251 1169 1132 1134 1083 961 9 1175 1225 1154 1008 1089 981 1027 1025 957 885
Thus, we now have a force of mortality matrix , or surface
It is also possible to use R packages to estimate a Lee-Carter model of the mortality rate,
> library(demography) > MUH =matrix(DEATH$Male/EXPOSURE$Male,nL,nC) > POPH=matrix(EXPOSURE$Male,nL,nC) > BASEH <- demogdata(data=MUH, pop=POPH, ages=AGE, years=YEAR, type="mortality", + label="France", name="Hommes", lambda=1) > RES=residuals(LCH,"pearson")
One can easily study residuals, for instance as a function of the age,
or a function of the year,
Some more details can be found in the second part of the notes of the crash courses of last summer, in Meielisalp.
R is also interesting because of its huge number of libraries, that can be used for predictive modeling. One can easily use smoothing functions in regression, or regression trees,
> TREE = tree((nbr>0)~ageconducteur,data=sinistres,split="gini",mincut = 1) > age = data.frame(ageconducteur=18:90) > y1 = predict(TREE,age) > reg = glm((nbr>0)~bs(ageconducteur),data=sinistres,family="binomial") > y = predict(reg,age,type="response")
Some practitioners might be scared because the legend claims that R is not as good as SAS to handle large databases. Actually, a lot of functions can be used to import datasets. The most convenient one is probably
> baseCOUT = read.table("http://freakonometrics.free.fr/baseCOUT.csv", + sep=";",header=TRUE,encoding="latin1") > tail(baseCOUT,4) numeropol debut_pol fin_pol freq_paiement langue type_prof alimentation type_territoire 6512 87291 2002-10-16 2003-01-22 mensuel A Professeur Vegetarien Urbain 6513 87301 2002-10-01 2003-09-30 mensuel A Technicien Vegetarien Urbain 6514 87417 2002-10-24 2003-10-21 mensuel F Technicien Vegetalien Semi-urbain 6515 88128 2003-01-17 2004-01-16 mensuel F Avocat Vegetarien Semi-urbain utilisation presence_alarme marque_voiture sexe exposition age duree_permis age_vehicule i coutsin 6512 Travail-occasionnel oui FORD M 0.2684932 47 29 28 1 1274.5901 6513 Loisir oui HONDA M 0.9972603 44 24 25 1 278.0745 6514 Travail-occasionnel non VOLKSWAGEN F 0.9917808 23 3 11 1 403.1242 6515 Loisir non FIAT F 0.9972603 23 4 11 1 230.9565
But if the dataset is too large, it is also possible to specify which variables might be interesting, using
> mycols = rep("NULL", 18) > mycols[c(1,4,5,12,13,14,18)] <- NA > baseCOUTsubC = read.table("http://freakonometrics.free.fr/baseCOUT.csv", + colClasses = mycols,sep=";",header=TRUE,encoding="latin1") > head(baseCOUTsubC,4) numeropol freq_paiement langue sexe exposition age coutsin 1 6 annuel A M 0.9945205 42 279.5839 2 27 mensuel F M 0.2438356 51 814.1677 3 27 mensuel F M 1.0000000 53 136.8634 4 76 mensuel F F 1.0000000 42 608.7267
It is also possible (before running a code on the entire dataset) to import only the first lines of the dataset.
> baseCOUTsubCR = read.table("http://freakonometrics.free.fr/baseCOUT.csv", + colClasses = mycols,sep=";",header=TRUE,encoding="latin1",nrows=100) > tail(baseCOUTsubCR,4) numeropol freq_paiement langue sexe exposition age coutsin 97 1193 mensuel F F 0.9972603 55 265.0621 98 1204 mensuel F F 0.9972603 38 9547.7267 99 1231 mensuel F M 1.0000000 40 442.7267 100 1245 annuel F F 0.6767123 48 179.1925
It is also possible to import a zipped file. The file itself has a smaller size, and it can usually be imported faster.
> import.zip = function(file){ + temp = tempfile() + download.file(file,temp); + read.table(unz(temp, "baseFREQ.csv"),sep=";",header=TRUE,encoding="latin1")} > system.time(import.zip("http://freakonometrics.free.fr/baseFREQ.csv.zip")) trying URL 'http://freakonometrics.free.fr/baseFREQ.csv.zip' Content type 'application/zip' length 692655 bytes (676 Kb) opened URL ================================================== downloaded 676 Kb user system elapsed 0.762 0.029 4.578 > system.time(read.table("http://freakonometrics.free.fr/baseFREQ.csv", + sep=";",header=TRUE,encoding="latin1")) user system elapsed 0.591 0.072 9.277
Finally, note that it is possible to import any kind of dataset, not only a text file. Even a Microsoft Excel folder. On a Windows computer, one can use SQL queries
> sheet = "c:\\Documents and Settings\\user\\excelsheet.xls" > connection = odbcConnectExcel(sheet) > spreadsheet = sqlTables(connection) > query = paste("SELECT * FROM",spreadsheet$TABLE_NAME[1],sep=" ") > result = sqlQuery(connection,query)
Then, once the dataset is imported, several functions can be used,
> cost = aggregate(coutsin~ AgeSex,mean, data=baseCOUT) > frequency = merge(aggregate(nbsin~ AgeSex,sum, data=baseFREQ), + aggregate(exposition~ AgeSex,sum, data=baseFREQ)) > frequency$freq = frequency$nbsin/frequency$exposition > base.freq.cost = merge(frequency, cost)
Finally, R is interesting for its graphical interface. “If you can picture it in your head, chances are good that you can make it work in R. R makes it easy to read data, generate lines and points, and place them where you want them. Its very flexible and super quick. When youve only got two or three hours until deadline, R can be brilliant” as said Amanda Cox, a graphics editor at the New York Times. “R is particularly valuable in deadline situations when data is scant and time is precious.”.
Several cases were considered on the blog http ://chartsnthings.tumblr.com/…. First, we start with a simple graph, here State Government control in the US
Then try to find a nice visual representation, e.g.
And finally, you can just print it in your favorite newspaper,
And you can get any kind of graphs,
And not only about politics,
Graphs are important. “Its not just about producing graphics for publication. Its about playing around and making a bunch of graphics that help you explore your data. This kind of graphical analysis is a really useful way to help you understand what you’re dealing with, because if you cant see it, you cant really understand it. But when you start graphing it out, you can really see what you’ve got” as said Peter Aldhous, San Francisco bureau chief of New Scientist magazine. Even for actuaries. “The commercial insurance underwriting process was rigorous but also quite subjective and based on intuition. R enables us to communicate our analytic results in appealing and innovative ways to non-technical audiences through rapid development lifecycles. R helps us show our clients how they can improve their processes and effectiveness by enabling our consultants to conduct analyses efficiently”, as explained by John Lucker, team of advanced analytics professionals at Deloitte Consulting Principal, in http://blog.revolutionanalytics.com/r-is-hot/. See also Andrew Gelman’s view, on graphs, http://www.stat.columbia.edu/…
So yes, actuaries might be interested to use R for actuarial communication, as mentioned in http ://www.londonr.org/…
The Actuarial Toolkit (see http ://www.actuaries.org.uk/…) stresses the interest of R, “The power of the language R lies with its functions for statistical modelling, data analysis and graphics ; its ability to read and write data from various data sources; as well as the opportunity to embed R in excel or other languages like VBA. In the way SAS is good for data manipulations, R is superior for modelling and graphical output“.
From 2011, Asia Capital Reinsurance Group (ACR) uses R to Solve Big Data Challenges (see http ://www.reuters.com/…). And Lloyd’s uses motion charts created with R to provide analysis to investors (as discussed on http ://blog.revolutionanalytics.com/…)
A lot of information can be found on http ://jeffreybreen.wordpress.com/…
Markus Gesmann mentioned on his blog a lot of interesting graphs used for actuarial reporting, http ://lamages.blogspot.ca/…
Further, R is free. Which can be compared with SAS, $6,000 per PC, or $28,000 per processor on a server (as mentioned on http ://en.wikipedia.org/…)
It is also becoming more and more popular, as a programming language. As mentioned on this month Transparent Language Popularity (see http ://lang-index.sourceforge.net/), R is ranked 12. Far away after C or Java, but before Matlab (22) or SAS (27). On StackOverFlow (see http ://stackoverflow.com/) is also far being C++ (399,232 occurrences) or Java (348,418), but with 21,818 occurrences, it appears before Matlab (14,580) and SAS (899). As mentioned on http ://r4stats.com/articles/popularity/ R is becoming more and more popular, on listserv discussion traffic
It is clearly the most popular software in data analysis, as mentioned by the Rexer Analytics survey, in 2009
What about actuaries ? In a survey (see http ://palisade.com/…), R was not extremely popular.
If we consider only statistical softwares, SAS is still far ahead, among UK and CAS actuaries
But, as mentioned by Mike King, Quantitative Analyst, Bank of America, “I cant think of any programming language that has such an incredible community of users. If you have a question, you can get it answered quickly by leaders in the field. That means very little downtime.” This was also mentioned by Glenn Meyers, in the Actuarial Review “The most powerful reason for using R is the community” (in http ://nytimes.com/…). For instance, http ://r-bloggers.com/ has contributions from more than 425 R users.
As said by Bo Cowgill, from Google “The best thing about R is that it was developed by statisticians. The worst thing about R is that it was developed by statisticians.”
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.