Tidbits from the Books that Defined S (and R)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Why R? Because S!
R is the open source implementation (and a pun!) of S, a language for statistical computing that was developed at Bell Labs in the late 1970s. After that, the implementation of S underwent a number of major revisions documented in a series of seminal books, often just referred to by the color of their cover: The Brown Book, the Blue Book, the White Book and the Green Book. To satisfy my techno-historical lusts I recently acquired all these books and I though I would share some tidbits from them, highlighting how S (and thus R) developed into what we today love and cherish. But first, here are the books in chronological order from left to right:
- S: An Interactive Environment for Data Analysis and Graphics by Richard A. Becker and John M. Chambers (1984), A.K.A. the Brown Book.
- Extending The S System by by Richard A. Becker and John M. Chambers (1984).
- The New S Language: A Programming Environment for Data Analysis and Graphics by Richard A. Becker, John M. Chambers and Allan R. Wilks (1988), A.K.A. the Blue Book.
- Statistical Models in S edited by John M. Chambers and Trevor J. Hastie (1992), A.K.A. the White Book.
- Programming with Data: A Guide to the S Language John M. Chambers (1998), A.K.A. the Green Book.
Most of these are out of print, but all can be bought second hand on, for example, Amazon (where is where I got them and where the links above lead).
S: An Interactive Environment for Data Analysis and Graphics (1984) A.K.A. the Brown Book
by Richard A. Becker and John M. Chambers
This book from 1984 describes not the first version of S, but the second (S2) according to the versioning used here by Chambers. It describes a language that is very similar to modern R (but also very different). We recognize friends like c
…
… and plot
:
But note that plot
was only for scatter plots and was not a generic function producing different types of plots as in modern R. This, because S didn’t yet have objects and classes. S had, however, state of the art graphing capabilities from the start, implementing the plot types described in Graphical Methods for Data Analysis (1983) (also written by John M. Chambers and which I’ve written about here). For example, the very useful pairs
function was already there:
While many things were similar to modern R, not everything was. For one thing, you could not define your own functions! Instead you would have to rely on macros:
Here ?T
in the macro is another macro producing a temporary variable name in order to not clash with any global variable name, crazy!
We also find answer to why some of the peculiarities of modern R exists. Have you ever wondered why many function and parameter names in R are period.separated
rather than underscored_spearated
? Well, because in S2 underscore was an alias for <-
!
On to a surprise finding… Rstudio are doing great things and for a while it has been possible to make slides using R markdown in Rstudio. Is this great? Sure! Is it new? Nope… :) Slide construction was already easy to do in S anno 1984 using the vu
function. This function took a string written in a special markup language…
… and produced slides on the graphic device, such as this:
Unfortunately vu
didn’t make it all the way to modern R.
I don’t want to brag, but I’m gonna do it anyway: I recently got my copy of the “Brown book” signed by John Chambers himself at the UseR 2014 conference! 😀
S: An Interactive Environment for Data Analysis and Graphics on Amazon
Extending the S System (1985)
by Richard A. Becker and John M. Chambers
This book is not part of the color book canon, but I’ll include it for completeness anyway. Published the year after the Brown book, it describes how to implement new functions in S. However, as S only had support for macros, these functions would have to be written in another language (say FORTRAN) and then connected to S using a special interface language:
While not relevant to modern R, this interface language is the “ancestor” of modern day interfaces such as Rcpp and Rcpp11.
Extending The S System on Amazon
The New S Language: A Programming Environment for Data Analysis and Graphics (1988) A.K.A. the Blue Book
by Richard A. Becker, John M. Chambers and Allan R. Wilks
This book introduces S version three (S3) which was a major revision of S2. While S2 was primarily programmed in FORTRAN, S3 was mainly done in C. The interface language was now gone and instead C functions could be directly invoked from S functions. But what’s more, users could now easily define functions themselves!
Functions were also first class citizens and could be passed around thus enabling the modern apply
type functions:
Computation on the language was also now possible, for example, by using substitute
. Some things were still different from modern day R, take a look at the following statement:
Why lottery.number
and lottery.payoff
instead of lottery$number
and lottery$payoff
? Because data.frames
didn’t yet exist! (Though it would still have been possible to stick two vectors inside a list.)
The New S Language: A Programming Environment for Data Analysis and Graphics on Amazon
Statistical Models in S (1992) A.K.A. the White Book
edited by John M. Chambers and Trevor J. Hastie
This book “completes” the specification of S3 with three biggies: (1) data frames, (2) formulas…
… and (3) object orientation:
While the earlier books are more focused on graphics and programming, this book is all about statistical models (the title of the book might be a hint). Here we get introduced to workhorses like glm
, gam
, nls
, tree
and, not to forget, lm
:
There is, however, no mention of the classical *.test
functions such as t.test
, binom.test
and cor.test
(Do anybody know when they appeared in S/R?). The focus is also more on prediction and estimation rather than testing, for example, p-values were not reported as part of summary.lm
(which they are in modern R):
Other things that are new are ?
, which can now be used to look up help pages, and that there is a new datatype called factor
. And already from the start read.table
converted all strings to factors by default. 🙂 All in all, this book was interesting to read and is still, I believe, a very good introduction to the formula interface and the lm
/glm
/gam
type functions.
Statistical Models in S on Amazon
Programming with Data : A guide to the S Language (1998) A.K.A. the Green Book
by John M. Chambers
This book describes S version four and focuses almost exclusively on programming and not so much on stats and graphics. A big change from S3 was the introduction of a new, more formal, system for object oriented programming:
Other than that there weren’t any eye catching differences from S version 3. One small thing to note is that =
could now be used for assignment instead of <-
and is actually used consistently throughout the book:
Programming with Data: A Guide to the S Language on Amazon
That was all I had. If you are further interested in the history of S and R I also recommend A brief history of S (Becker, 1994) , Stages in the Evolution of S (Chamers, 2000) and R: Past and future history (Ihaka, 1998).
All images and quotes included in this review are copyrighted by their respective copyrighted holders, however I believe that the inclusion of these quotes and images in in this review constitutes fair use.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.