Easily explore your data using the summarytools package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Whenever we start working with data with which we are not familiar, our first step is usually some kind of exploratory data analysis. We may look at the structure of the data using the str function, or use a tool like the RStudio Viewer to examine the data. We might also use the base R function summary or the describe function from the Hmisc package to obtain summary statistics on our data.
In this article, we will look at the summarytools package, which provides a few functions to neatly summarize your data. This is especially useful in a Shiny application or a RMarkdown report as you can directly render the summary outputs to HTML and then display it in the application or report. In fact, I discovered this package while playing around with the Shiny application radiant.
The four main functions in this package are dfSummary, freq, descr and ctable. The print.summarytools function is used to print summarytools objects with the appropriate rendering method.
dfSummary is used to summarize an entire data frame; descriptive statistics, along with plots to show the distribution of the data is automatically produced by the function. Let us apply it on the tobacco dataset which is included as part of this package.
print(dfSummary(tobacco), method = "render")
Data Frame Summary
tobacco
Dimensions: 1000 x 9Duplicates: 2
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | gender [factor] | 1. F 2. M |
|
978 (97.8%) | 22 (2.2%) | |||||||||||||||||||||||||||||||||||||||||||||
2 | age [numeric] | Mean (sd) : 49.6 (18.3) min < med < max: 18 < 50 < 80 IQR (CV) : 32 (0.4) | 63 distinct values | 975 (97.5%) | 25 (2.5%) | |||||||||||||||||||||||||||||||||||||||||||||
3 | age.gr [factor] | 1. 18-34 2. 35-50 3. 51-70 4. 71 + |
|
975 (97.5%) | 25 (2.5%) | |||||||||||||||||||||||||||||||||||||||||||||
4 | BMI [numeric] | Mean (sd) : 25.7 (4.5) min < med < max: 8.8 < 25.6 < 39.4 IQR (CV) : 5.7 (0.2) | 974 distinct values | 974 (97.4%) | 26 (2.6%) | |||||||||||||||||||||||||||||||||||||||||||||
5 | smoker [factor] | 1. Yes 2. No |
|
1000 (100%) | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
6 | cigs.per.day [numeric] | Mean (sd) : 6.8 (11.9) min < med < max: 0 < 0 < 40 IQR (CV) : 11 (1.8) | 37 distinct values | 965 (96.5%) | 35 (3.5%) | |||||||||||||||||||||||||||||||||||||||||||||
7 | diseased [factor] | 1. Yes 2. No |
|
1000 (100%) | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
8 | disease [character] | 1. Hypertension 2. Cancer 3. Cholesterol 4. Heart 5. Pulmonary 6. Musculoskeletal 7. Diabetes 8. Hearing 9. Digestive 10. Hypotension [ 3 others ] |
|
222 (22.2%) | 778 (77.8%) | |||||||||||||||||||||||||||||||||||||||||||||
9 | samp.wgts [numeric] | Mean (sd) : 1 (0.1) min < med < max: 0.9 < 1 < 1.1 IQR (CV) : 0.2 (0.1) |
|
1000 (100%) | 0 (0%) |
Generated by summarytools 0.9.3 (R version 3.6.0)
2019-05-12
As can be seen in the output above, the type of variable is shown along with the variable name. This is followed by basic descriptive statistics for numeric variables or the values for categorical variables. A simple histogram or bar chart is plotted showing the distribution of the variable. Overall, it provides you with a concise summary of the variables in the data. The output can also be controlled using a vast number of arguments, but the defaults may be good enough when quickly exploring data.
To obtain more detailed statistics on numeric variables, use the descr function. In this example, we use the default ASCII text output.
descr(tobacco$BMI) ## Descriptive Statistics ## tobacco$BMI ## N: 1000 ## ## BMI ## ----------------- -------- ## Mean 25.73 ## Std.Dev 4.49 ## Min 8.83 ## Q1 22.93 ## Median 25.62 ## Q3 28.65 ## Max 39.44 ## MAD 4.18 ## IQR 5.72 ## CV 0.17 ## Skewness 0.02 ## SE.Skewness 0.08 ## Kurtosis 0.26 ## N.Valid 974.00 ## Pct.Valid 97.40
One useful option supported by this function is the weights argument. So if your data has sample weights, they can directly be used. However, not all statistics will be adjusted using the weights; refer to the documentation for more details.
descr(tobacco$BMI, weights = tobacco$samp.wgts) ## Weighted Descriptive Statistics ## tobacco$BMI ## Weights: samp.wgts ## N: 1000 ## ## BMI ## --------------- -------- ## Mean 25.83 ## Std.Dev 4.48 ## Min 8.83 ## Median 25.68 ## Max 39.44 ## MAD 4.13 ## CV 0.17 ## N.Valid 973.85 ## Pct.Valid 97.38
While descr is most useful for numeric variables, freq is most useful for categorical variables. This also supports the weights argument, and provides the frequency as well as the cumulative frequency for each distinct value.
freq(tobacco$disease, weights = tobacco$samp.wgts) ## Weighted Frequencies ## tobacco$disease ## Type: Character ## Weights: samp.wgts ## ## Freq % Valid % Valid Cum. % Total % Total Cum. ## --------------------- --------- --------- -------------- --------- -------------- ## Cancer 33.66 15.03 15.03 3.37 3.37 ## Cholesterol 21.11 9.42 24.45 2.11 5.48 ## Diabetes 14.36 6.41 30.87 1.44 6.91 ## Digestive 11.49 5.13 36.00 1.15 8.06 ## Hearing 14.15 6.32 42.31 1.41 9.48 ## Heart 19.89 8.88 51.20 1.99 11.46 ## Hypertension 36.52 16.31 67.51 3.65 15.12 ## Hypotension 11.37 5.08 72.58 1.14 16.25 ## Musculoskeletal 19.59 8.75 81.33 1.96 18.21 ## Neurological 10.14 4.53 85.86 1.01 19.23 ## Other 2.09 0.93 86.79 0.21 19.44 ## Pulmonary 20.29 9.06 95.85 2.03 21.46 ## Vision 9.29 4.15 100.00 0.93 22.39 ## <NA> 776.07 77.61 100.00 ## Total 1000.00 100.00 100.00 100.00 100.00
Finally, ctable is used to cross-tabulate frequencies for a pair of categorical variables.
print( ctable(tobacco$disease, tobacco$gender, weights = tobacco$samp.wgts), method = "render" )
Cross-Tabulation, Row Proportions
disease * gender
Data Frame: tobaccogender | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
disease | F | M | Total | |||||||||||||
Cancer | 15.9 | ( | 47.2% | ) | 17.8 | ( | 52.8% | ) | 0.0 | ( | 0.0% | ) | 33.7 | ( | 100.0% | ) |
Cholesterol | 10.3 | ( | 48.8% | ) | 10.8 | ( | 51.2% | ) | 0.0 | ( | 0.0% | ) | 21.1 | ( | 100.0% | ) |
Diabetes | 8.2 | ( | 57.3% | ) | 5.3 | ( | 36.7% | ) | 0.9 | ( | 6.0% | ) | 14.4 | ( | 100.0% | ) |
Digestive | 4.7 | ( | 40.9% | ) | 6.8 | ( | 59.1% | ) | 0.0 | ( | 0.0% | ) | 11.5 | ( | 100.0% | ) |
Hearing | 5.1 | ( | 35.8% | ) | 9.1 | ( | 64.2% | ) | 0.0 | ( | 0.0% | ) | 14.1 | ( | 100.0% | ) |
Heart | 8.5 | ( | 42.8% | ) | 11.4 | ( | 57.2% | ) | 0.0 | ( | 0.0% | ) | 19.9 | ( | 100.0% | ) |
Hypertension | 18.2 | ( | 49.7% | ) | 17.3 | ( | 47.4% | ) | 1.1 | ( | 2.9% | ) | 36.5 | ( | 100.0% | ) |
Hypotension | 7.3 | ( | 64.6% | ) | 4.0 | ( | 35.4% | ) | 0.0 | ( | 0.0% | ) | 11.4 | ( | 100.0% | ) |
Musculoskeletal | 8.2 | ( | 42.0% | ) | 10.3 | ( | 52.6% | ) | 1.1 | ( | 5.4% | ) | 19.6 | ( | 100.0% | ) |
Neurological | 7.2 | ( | 70.7% | ) | 3.0 | ( | 29.3% | ) | 0.0 | ( | 0.0% | ) | 10.1 | ( | 100.0% | ) |
Other | 1.0 | ( | 50.1% | ) | 1.0 | ( | 49.9% | ) | 0.0 | ( | 0.0% | ) | 2.1 | ( | 100.0% | ) |
Pulmonary | 9.3 | ( | 45.8% | ) | 11.0 | ( | 54.2% | ) | 0.0 | ( | 0.0% | ) | 20.3 | ( | 100.0% | ) |
Vision | 6.1 | ( | 66.0% | ) | 3.2 | ( | 34.0% | ) | 0.0 | ( | 0.0% | ) | 9.3 | ( | 100.0% | ) |
|
378.9 | ( | 48.8% | ) | 378.0 | ( | 48.7% | ) | 19.2 | ( | 2.5% | ) | 776.1 | ( | 100.0% | ) |
Total | 488.9 | ( | 48.9% | ) | 488.9 | ( | 48.9% | ) | 22.2 | ( | 2.2% | ) | 1000.0 | ( | 100.0% | ) |
Generated by summarytools 0.9.3 (R version 3.6.0)
2019-05-12
It is often desirable to obtain the output of descr or freq functions in a data frame. To do this, the package provides a function tb which can convert the output of the above functions into a tibble. If you are not aware, the tibble is the data frame of the tidyverse and are usually nicer to use compared to data frames in base R.
To conclude, whenever you start working with new data, consider generating an HTML report using dfSummary to quickly understand the variables in the data and their distributions. Use the other functions if you need to understand the individual variables further.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.