Example 9.9: Simplifying R using the mosaic package (part 1)

Nick Horton

10 years ago

[This article was first published on SAS and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

While both SAS and R are powerful systems for statistical analysis, they can be frustrating to new users or those learning statistics for the first time.

R
The mosaic package is designed to help simplify the interface for such new users, while allowing them to undertake sophisticated analyses.

As an example of how the package simplifies life for the novice user, consider calculating summary statistics and displaying a densityplot for the CESD (measure of depressive symptom) scores by substance abuse group in the HELP dataset. Doing this in R without the package would require mastering a package such as plyr to replicate results by substance or a typing-intensive use of syntax to select rows corresponding to each substance.

ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
library(mosaic)
options(digits=3)

After loading the data and the package, and setting the number of digits to a more reasonable default, we can call the mean() function to easily calculate this statistic (denoted by S in the result) for each of the three substance abuse groups alcohol, cocaine or heroin.

> mean(cesd ~ substance, data=ds)
  substance    S   N Missing
1   alcohol 34.4 177       0
2   cocaine 29.4 152       0
3    heroin 34.9 124       0

Similar results are seen when we calculate the standard deviations per group:

> sd(cesd ~ substance, data=ds)
  substance    S   N Missing
1   alcohol 12.1 177       0
2   cocaine 13.4 152       0
3    heroin 11.2 124       0

Another function can calculate a raft of summary statistics for each group that are nicely formatted.

> summary(cesd ~ substance, data=ds, fun=favstats)
cesd    N=453
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+
|         |       |N  |min |Q1 |median |Q3 |max |mean |sd  |n  |missing |
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+
|substance|alcohol|177|4   |26 |36     |42 |58  |34.4 |12.1|177|0       |
|         |cocaine|152|1   |19 |30     |39 |60  |29.4 |13.4|152|0       |
|         |heroin |124|4   |28 |35     |43 |56  |34.9 |11.2|124|0       |
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+
|Overall  |       |453|1   |25 |34     |41 |60  |32.8 |12.5|453|0       |
+---------+-------+---+----+---+-------+---+----+-----+----+---+--------+

These commands allow quick review of the data to ensure, for example, that assumptions of equal variance are justified, or that coding errors or missing values haven’t crept in.

A graphical depiction using a set of densityplots (shown above) can be created using the command:

densityplot(~ cesd, group=substance, data=ds, auto.key=TRUE)

SAS
We’re unaware of any similar program that attempts to simplify SAS syntax for educational use. To replicate the above results, we would use the means and sgpanel procedures.

data ds;
set "C:\book\help.sas7bdat";
run;

options ls=80;
proc means data=ds fw=4
  min q1 median q3 max mean std nmiss n;
  class substance;
  var cesd;
run;
                     Analysis Variable : CESD

              N           Lower             Upper               Std
 SUBSTANCE  Obs   Min  Quartile  Median  Quartile   Max  Mean   Dev
 ------------------------------------------------------------------
 alcohol    177  4.00      26.0    36.0      42.0  58.0  34.4  12.1
 cocaine    152  1.00      19.0    30.0      39.0  60.0  29.4  13.4
 heroin     124  4.00      28.0    35.0      43.0  56.0  34.9  11.2
 ------------------------------------------------------------------

                                 N     N
                    SUBSTANCE  Obs  Miss      N
                    ---------------------------
                    alcohol    177     0    177
                    cocaine    152     0    152
                    heroin     124     0    124
                    ---------------------------

After reading the data in, the meansprocedure can produce any of the desired statistics (plus may others) directly. To replicate the mosaic package in printing a single statistic, list only that statistic in the proc means statement. Note that the overall statistic in the R table is not included. To replicate that row, you would re-run the above code, omitting the class statement.

To the best of our knowledge, there still does not exist an easy way to plot multiple densities in a single SAS plot. In example 2.6.4 we show how it can be done using proc kde, saving the density estimates, and plotting separately. (Code for this is included at the book web site.) But in the interest of simple code, we show a simpler method using proc sgpanel. The result, show below, is less useful than the R plot from the the mosaic package, but still gets the point across.

proc sgpanel data=ds;
  panelby substance / columns=1;
  density cesd / type=kernel;
run;

To leave a comment for the author, please follow the link and comment on their blog: SAS and R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.