Site icon R-bloggers

Easily explore your data using the summarytools package

[This article was first published on Anindya Mozumdar, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Whenever we start working with data with which we are not familiar, our first step is usually some kind of exploratory data analysis. We may look at the structure of the data using the str function, or use a tool like the RStudio Viewer to examine the data. We might also use the base R function summary or the describe function from the Hmisc package to obtain summary statistics on our data.

In this article, we will look at the summarytools package, which provides a few functions to neatly summarize your data. This is especially useful in a Shiny application or a RMarkdown report as you can directly render the summary outputs to HTML and then display it in the application or report. In fact, I discovered this package while playing around with the Shiny application radiant.

The four main functions in this package are dfSummary, freq, descr and ctable. The print.summarytools function is used to print summarytools objects with the appropriate rendering method.

dfSummary is used to summarize an entire data frame; descriptive statistics, along with plots to show the distribution of the data is automatically produced by the function. Let us apply it on the tobacco dataset which is included as part of this package.

print(dfSummary(tobacco), method = "render")

Data Frame Summary

tobacco

Dimensions: 1000 x 9
Duplicates: 2
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 gender [factor] 1. F 2. M
489(50.0%)
489(50.0%)
978 (97.8%) 22 (2.2%)
2 age [numeric] Mean (sd) : 49.6 (18.3) min < med < max: 18 < 50 < 80 IQR (CV) : 32 (0.4) 63 distinct values 975 (97.5%) 25 (2.5%)
3 age.gr [factor] 1. 18-34 2. 35-50 3. 51-70 4. 71 +
258(26.5%)
241(24.7%)
317(32.5%)
159(16.3%)
975 (97.5%) 25 (2.5%)
4 BMI [numeric] Mean (sd) : 25.7 (4.5) min < med < max: 8.8 < 25.6 < 39.4 IQR (CV) : 5.7 (0.2) 974 distinct values 974 (97.4%) 26 (2.6%)
5 smoker [factor] 1. Yes 2. No
298(29.8%)
702(70.2%)
1000 (100%) 0 (0%)
6 cigs.per.day [numeric] Mean (sd) : 6.8 (11.9) min < med < max: 0 < 0 < 40 IQR (CV) : 11 (1.8) 37 distinct values 965 (96.5%) 35 (3.5%)
7 diseased [factor] 1. Yes 2. No
224(22.4%)
776(77.6%)
1000 (100%) 0 (0%)
8 disease [character] 1. Hypertension 2. Cancer 3. Cholesterol 4. Heart 5. Pulmonary 6. Musculoskeletal 7. Diabetes 8. Hearing 9. Digestive 10. Hypotension [ 3 others ]
36(16.2%)
34(15.3%)
21(9.5%)
20(9.0%)
20(9.0%)
19(8.6%)
14(6.3%)
14(6.3%)
12(5.4%)
11(5.0%)
21(9.5%)
222 (22.2%) 778 (77.8%)
9 samp.wgts [numeric] Mean (sd) : 1 (0.1) min < med < max: 0.9 < 1 < 1.1 IQR (CV) : 0.2 (0.1)
0.86!:267(26.7%)
1.04!:249(24.9%)
1.05!:324(32.4%)
1.06!:160(16.0%)
! rounded
1000 (100%) 0 (0%)

Generated by summarytools 0.9.3 (R version 3.6.0)
2019-05-12

As can be seen in the output above, the type of variable is shown along with the variable name. This is followed by basic descriptive statistics for numeric variables or the values for categorical variables. A simple histogram or bar chart is plotted showing the distribution of the variable. Overall, it provides you with a concise summary of the variables in the data. The output can also be controlled using a vast number of arguments, but the defaults may be good enough when quickly exploring data.

To obtain more detailed statistics on numeric variables, use the descr function. In this example, we use the default ASCII text output.

descr(tobacco$BMI)
## Descriptive Statistics  
## tobacco$BMI  
## N: 1000  
## 
##                        BMI
## ----------------- --------
##              Mean    25.73
##           Std.Dev     4.49
##               Min     8.83
##                Q1    22.93
##            Median    25.62
##                Q3    28.65
##               Max    39.44
##               MAD     4.18
##               IQR     5.72
##                CV     0.17
##          Skewness     0.02
##       SE.Skewness     0.08
##          Kurtosis     0.26
##           N.Valid   974.00
##         Pct.Valid    97.40

One useful option supported by this function is the weights argument. So if your data has sample weights, they can directly be used. However, not all statistics will be adjusted using the weights; refer to the documentation for more details.

descr(tobacco$BMI, weights = tobacco$samp.wgts)
## Weighted Descriptive Statistics  
## tobacco$BMI  
## Weights: samp.wgts  
## N: 1000  
## 
##                      BMI
## --------------- --------
##            Mean    25.83
##         Std.Dev     4.48
##             Min     8.83
##          Median    25.68
##             Max    39.44
##             MAD     4.13
##              CV     0.17
##         N.Valid   973.85
##       Pct.Valid    97.38

While descr is most useful for numeric variables, freq is most useful for categorical variables. This also supports the weights argument, and provides the frequency as well as the cumulative frequency for each distinct value.

freq(tobacco$disease, weights = tobacco$samp.wgts)
## Weighted Frequencies  
## tobacco$disease  
## Type: Character  
## Weights: samp.wgts  
## 
##                            Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## --------------------- --------- --------- -------------- --------- --------------
##                Cancer     33.66     15.03          15.03      3.37           3.37
##           Cholesterol     21.11      9.42          24.45      2.11           5.48
##              Diabetes     14.36      6.41          30.87      1.44           6.91
##             Digestive     11.49      5.13          36.00      1.15           8.06
##               Hearing     14.15      6.32          42.31      1.41           9.48
##                 Heart     19.89      8.88          51.20      1.99          11.46
##          Hypertension     36.52     16.31          67.51      3.65          15.12
##           Hypotension     11.37      5.08          72.58      1.14          16.25
##       Musculoskeletal     19.59      8.75          81.33      1.96          18.21
##          Neurological     10.14      4.53          85.86      1.01          19.23
##                 Other      2.09      0.93          86.79      0.21          19.44
##             Pulmonary     20.29      9.06          95.85      2.03          21.46
##                Vision      9.29      4.15         100.00      0.93          22.39
##                  <NA>    776.07                              77.61         100.00
##                 Total   1000.00    100.00         100.00    100.00         100.00

Finally, ctable is used to cross-tabulate frequencies for a pair of categorical variables.

print(
  ctable(tobacco$disease, tobacco$gender, weights = tobacco$samp.wgts),
  method = "render"
)

Cross-Tabulation, Row Proportions

disease * gender

Data Frame: tobacco
gender
disease F M <NA> Total
Cancer 15.9 ( 47.2% ) 17.8 ( 52.8% ) 0.0 ( 0.0% ) 33.7 ( 100.0% )
Cholesterol 10.3 ( 48.8% ) 10.8 ( 51.2% ) 0.0 ( 0.0% ) 21.1 ( 100.0% )
Diabetes 8.2 ( 57.3% ) 5.3 ( 36.7% ) 0.9 ( 6.0% ) 14.4 ( 100.0% )
Digestive 4.7 ( 40.9% ) 6.8 ( 59.1% ) 0.0 ( 0.0% ) 11.5 ( 100.0% )
Hearing 5.1 ( 35.8% ) 9.1 ( 64.2% ) 0.0 ( 0.0% ) 14.1 ( 100.0% )
Heart 8.5 ( 42.8% ) 11.4 ( 57.2% ) 0.0 ( 0.0% ) 19.9 ( 100.0% )
Hypertension 18.2 ( 49.7% ) 17.3 ( 47.4% ) 1.1 ( 2.9% ) 36.5 ( 100.0% )
Hypotension 7.3 ( 64.6% ) 4.0 ( 35.4% ) 0.0 ( 0.0% ) 11.4 ( 100.0% )
Musculoskeletal 8.2 ( 42.0% ) 10.3 ( 52.6% ) 1.1 ( 5.4% ) 19.6 ( 100.0% )
Neurological 7.2 ( 70.7% ) 3.0 ( 29.3% ) 0.0 ( 0.0% ) 10.1 ( 100.0% )
Other 1.0 ( 50.1% ) 1.0 ( 49.9% ) 0.0 ( 0.0% ) 2.1 ( 100.0% )
Pulmonary 9.3 ( 45.8% ) 11.0 ( 54.2% ) 0.0 ( 0.0% ) 20.3 ( 100.0% )
Vision 6.1 ( 66.0% ) 3.2 ( 34.0% ) 0.0 ( 0.0% ) 9.3 ( 100.0% )
<NA> 378.9 ( 48.8% ) 378.0 ( 48.7% ) 19.2 ( 2.5% ) 776.1 ( 100.0% )
Total 488.9 ( 48.9% ) 488.9 ( 48.9% ) 22.2 ( 2.2% ) 1000.0 ( 100.0% )

Generated by summarytools 0.9.3 (R version 3.6.0)
2019-05-12

It is often desirable to obtain the output of descr or freq functions in a data frame. To do this, the package provides a function tb which can convert the output of the above functions into a tibble. If you are not aware, the tibble is the data frame of the tidyverse and are usually nicer to use compared to data frames in base R.

To conclude, whenever you start working with new data, consider generating an HTML report using dfSummary to quickly understand the variables in the data and their distributions. Use the other functions if you need to understand the individual variables further.

To leave a comment for the author, please follow the link and comment on their blog: Anindya Mozumdar.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.