Computational Statistics

Posted on May 9, 2010 by xi'an in R bloggers | 0 Comments

[This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Do not resort to Monte Carlo methods unnecessarily.

When I received this 2009 Springer-Verlag book, Computational Statistics, by James Gentle a while ago, I briefly took a look at the table of contents and decided to have a better look later… Now that I have gone through the whole book, I can write a short review on its scope and contents (to be submitted). Despite its title, the book aims at covering both computational statistics and statistical computing. (With 752 pages at his disposal, Gentle can afford to do both indeed!)

The book Computational Statistics is separated into four parts:

Part I: Mathematical and statistical preliminaries.
Part II: Statistical Computing (Computer storage and arithmetic.- Algorithms and programming.- Approximation of functions and numerical quadrature.- Numerical linear algebra.- Solution of nonlinear equations and optimization.- Generation of random numbers.)
Part III: Methods of Computational Statistics (Graphical methods in computational statistics.- Tools for identification of structure in data.- Estimation of functions.- Monte Carlo methods for statistical inference.- Data randomization, partitioning, and augmentation.- Bootstrap methods.)
Part IV: Exploring Data Density and Relationship (Estimation of probability density functions using parametric models.- Nonparametric estimation of probability density functions.- Statistical learning and data mining.- Statistical models of dependencies.)

Computational inference, together with exact inference and asymptotic inference, is an important component of statistical methods.

The first part of Computational Statistics is indeed a preliminary containing essentials of math and probability and statistics. A reader unfamiliar with too many topics within this chapter should first consider improving his or her background in the corresponding area! This is a rather large chapter, with 82 pages, and it should not be extremely useful to readers, except to signal deficiencies in their background, as noted above. Given this purpose, I am not certain the selected exercises of this chapter are necessary (especially when considering that some involve tools introduced much later in the book).

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different .

The second part of Computational Statistics is truly about computing, meaning the theory of computation, i.e. of using computers for numerical approximation, with discussions about the representation of numbers in computers, approximation errors, and of course random number generators. While I judge George Fishman’s Monte Carlo to provide a deeper and more complete coverage of those topics, I appreciate the need for reminding our students of those hardware subtleties as they often seem unaware of them, despite their advanced computer skills. This second part is thus a crash course of 250 pages on numerical methods (like function approximations by basis functions and …) and on random generators, i.e. cover the same ground as Gentle’s earlier books, Random Number Generation and Monte Carlo Methods and Numerical Linear Algebra for Applications in Statistics, while the more recent Elements of Computational Statistics looks very much like a shorter entry on the same topics as those of Parts III IV of Computational Statistics. This part could certainly sustain a whole semester undergraduate course while only advanced graduate students could be expected to gain from a self-study of those topics. It is nonetheless the most coherent and attractive part of the book. It constitutes a must-read for all students and researchers engaging into any kind of serious programming. Obviously, some notions are introduced a bit too superficially, given the scope of this section (as for instance Monte Carlo methods, in particular MCMC techniques that are introduced in less than six pages), but I came to realise this is the point of the book, which provides an entry into “all” necessary topics, along with links to the relevant literature (if missing Monte Carlo Statistical Methods!). I however deplore that the important issue of Monte Carlo experiments, whose construction is often a hardship for students, is postponed till the 100 page long appendix. (I suspect that students do not read appendices is another of those folk theorems!)

Monte Carlo methods differ from other methods of numerical analysis in yielding an estimate rather than an approximation.

The third and fourth parts of the book cover methods of computational statistics, including Monte Carlo methods, randomization and cross validation, the bootstrap, probability density estimation, and statistical learning. Unfortunately, I find the level of Part III to be quite uneven, where all chapters are short and rather superficial because they try to be all-encompassing. (For instance, Chapter 8 includes two pages on the RGB colour coding.) Part IV does a better job of presenting machine learning techniques, if not with the thoroughness of Hastie et al.’s The Elements of Statistical Learning: Data Mining, Inference, and Prediction… It seems to me that the relevant sections of Part III would have fitted better where they belong in Part IV. For instance, Chapter 10 on estimation of functions only covers the evaluation of estimators of functions, postponing the construction of those estimators till Chapter 15. Jackknife is introduced on its own in Chapter 12 (not that I find this introduction ultimately necessary) without the bootstrap covered in eight pages in Chapter 13 (bootstrap for non-iid data is dismissed rather quickly, given the current research in the area). The first chapter of Part IV covers some (non-Bayesian) estimation approaches for parametric families, but I find this chapter somehow superfluous as it does not belong to the computational statistic domain (except as an approximation method, as stressed in Section 14.4). While Chapter 16 is a valuable entry on clustering and data-analysis tools like PCA, the final section on high dimensions feels out of context (and mentioning the curse of dimensionality only that close to the end of the book does not seem appropriate). Chapter 17 on dependent data is missing the rich literature on graphical models and their use in the determination of dependence structures.

Programming is the best way to learn programming (read that again) .

In conclusion, Computational Statistics is a very diverse book that can be used at several levels as textbook, as well as a reference for researchers (even if as an entry towards further and deeper references). The book is well-written, in a lively and personal style. (I however object to the reduction of the notion of Markov chains to discrete state-spaces!) There is no requirement for a specific programming language, although R is introduced in a somewhat dismissive way (R most serious flaw is usually lack of robustness since some [packages] are not of high-quality) and some exercises start with Design and write either a C or a Fortran subroutine. BUGS is not mentioned at all. The appendices of Computational Statistics also contain the solutions to some exercises, even though the level of detail is highly variable, from one word (“1″) to one page (see, e.g., Exercise 11.4). The 20 page list of references is preceded by a few pages on available journals and webpages, which could get obsolete rather quickly. Despite the reservations raised above about some parts of Computational Statistics that would benefit from a deeper coverage, I think this book is a reference book that should appear in the shortlist of any computational statistics/statistical computing graduate course as well as on the shelves of any researcher supporting his or her statistical practice with a significant dose of computing backup.