Site icon R-bloggers

Book Review – Modern Applied Statistics with S by W. N. Venables and B. D. Ripley (Springer 2003)

[This article was first published on Software for Exploratory Data Analysis and Statistical Modelling, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Order this book from Amazon

Modern Applied Statistics with S (Fourth Edition) is one of the oldest and most popular books on Applied Statistics using R and S-plus. A large number of topics in Applied Statistics are covered in this book and it is certainly not for the faint hearted. A sound knowledge of the Statistical Methods covered in each Chapter is important and there are the book includes many examples of using a wide range of techniques.

The book opens with an overview of the S programming language and has an introductory analysis session to get the reader into using the system and to provide an idea of the way analysis is undertaken and some of the methods that are available to the analyst.

The second Chapter introduces objects, which are an important part of the S programming language and the chapter covers a number of common data manipulation tasks and also the data frame which is probably the most useful object for the user. A brief coverage of data import and export follows and could possibly benefit from providing some more examples rather than focussing on describing the function arguments. There is a nice little section of working with subsets of data frames which is an important topic for analysis. The chapter ends well on creating tables and cross-tabulation of data.

The third Chapter is an overview of the S programming language and provides some good examples and sensible advice about the need to make use of the vectorised calculation features of the language rather than using loops. This is a neat feature of the S language that is very important for users to get to grips with as it will simplify and speed up code. There will of course be situations where loops are required but overall it is best to avoid then where possible. There is only a small section on classes and methods and this is probably due to the authors having a separate book that describes S programming.

Chapter four covers the base graphics system as well as Trellis graphics. This is a very good chapter that provides a large range of examples of common types of displays and highlights investigating multivariate data using the Trellis graphics paradigm. The many code examples can be adapted by the reader to create displays that are suitable for a specific application. There could have been more information about editing various components of the graphs but that would probably have been outside the scope of the book.

The fifth chapter provides coverage of a range of univariate statistical methods starting with probability distributions and generating pseudo random numbers from the range of distributions available in R. There follows a good section of histograms and the issue of bin widths which are important for getting a good idea of the shape of a set of data. Classical statistical tests are given very short coverage compared to other books but there is enough for people to get started with other tests. There is a good little section of robust statistical methods which is a topic that is infrequently covered in other texts. Density estimation is also covered and the chapter ends with examples of using the bootstrap for statistical inference.

Linear models are covered in chapter six starting with a reasonably simple example going through fitting models, checking the goodness of fit with residual diagnostics and making predictions from a linear model. There is a good little section on robust regression showing the ease of moving between models of different types. This is followed by an illustration of applying the bootstrap to parameter estimation in a linear model. Fitting analysis of variance models is covered with an example from a designed experiment which then leads to variable selection. The chapter ends with a short discussion of multiple comparison tests and post-hoc testing.

Generalized linear models (GLM) are covered in brief in chapter seven of the book starting with a couple of examples of logistic regression for binary data and then Poisson regression for data that is based on counts. The chapter provides some other useful examples but is rather short given the other potential distributions for different data sources.

In chapter eight non-linear models which often arise from theoretical considerations are investigated with details of how to fit them to data and to analyse the model outputs to determine the suitability of the model. The various issues associated with non-linear models due to the requirement for an iterative method to converge to a solution and discussed in detail along with methods for investigating the suitability of the assumptions. The second half of the chapter a wide range of extensions/alternatives to multiple linear regression covering smoothers, additive models, MARS, projection pursuit regression and neural networks. This provides a taste rather than a comprehensive coverage of these topics which is a general theme of the book.

Tree based methods are introduced and discussed in chapter nine and the technical details are covered in more detail than some of the other methods which may be beyond the interest of some readers. The authors do however then move swiftly on to practical applied examples showing how to fit tree models to data and suggestions on how to simplify tree models to a manageable size.

The tenth chapter of the book is dedicated to mixed effects models which is a framework that allows a standard linear or non-linear model to include a mixture of fixed and random effects. The focus is mainly on the nlme package which is in frequent use by analysts using R or S-plus for their work. The authors make a good effort at explaining the use and interpretation of these models which is a good starting point for the Pinheiro and Bates book on Mixed Effects Models that covers the topic in greater detail. Generalized Linear Mixed Effects Models are covered briefly at the end of the chapter.

The broad area of exploratory multivariate analysis is addressed in chapter eleven starting with the some projection techniques – from the popular principal component analysis to projection pursuit and multidimensional scaling. Next up are partitioning methods including cluster analysis used for searching for structure in data set where there is no prior information about grouping. The chapter is rounded off with a discussion of techniques that are suitable for discrete data, often in the form of a contingency table, such as mosaic plots to investigate association between variables measured in a study.

The topic of classification using statistical methods is covered in chapter twelve of the book. The chapter starts with discriminant analysis, which is one of the initial techniques used for classification, and touches on robust estimation of means and variances (location and scale) which is used to reduce the impact of unusual data points. There is then coverage of other methods used in classification such as K-means, neural networks and support vector machines. The performance of the competing methods is compared with an example on foernsice glass at the end of the chapter.

Chapter thirteen is devoted to survival analysis and covers different approaches to handling survivor curves such as the Cox Proportional Hazards Model with a couple of extensive examples to illustrate the different models. As with most chapters in the book an assumption is made with regards to knowledge of the statistical methods discussed in the text.

Time series analylsis is the topic of interest in chapter fourteen starting with the important topics of autocorrelation functions and partial autocorrelation functions. The frequently used ARIMA models are discussed next with a short discussion of model selection for these models and forecasting future values. Seasonality is also covered as would be expected. The chapter ends with short sections on regression with autocorrelated errors and analysis of financial time series.

Chapter fifteen is a short chapter on spatial statistics with three areas covered – spatial interpolation and smoothing, kriging and point process analysis. There are worked examples showing how to undertake the analysis using S and is more of a useful reference for people who understand the theory but need instructions on how to apply the methods.

The final chapter is on functions for performing general optimisation tasks and is slightly out of place compared to many of the other chapters. It is a useful topic but it is not clear how easy the reader will find it to make use of the methods based on the coverage in this chapter.

Overall comment: this is a very good book (and highly recommended book) but probably not ideal for the beginner as it covers a very wide range of applied statistics methods and is probably best as a reference book to be dipped into as and when necessary. It does however provide a nice overview of the range of statistical applications and modern methods.

To leave a comment for the author, please follow the link and comment on their blog: Software for Exploratory Data Analysis and Statistical Modelling.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.