Describing Data: Frequently Used Commands
[This article was first published on Coffee and Econometrics in the Morning, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Obtaining a coherent numerical summary of data is a common task, and it is common to want to port these summary statistics into a table of results. When I am in interactive mode with my data, I use the summary() command applied to my data frame. For example, the following code loads and summarizes a data frame on Yogurt advertising and prices:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
library(Ecdat) ## Econometrics Data (useful!)
data(Yogurt) ## Loads Yogurt from Ecdat
summary(Yogurt) ## Summarizes Yogurt
For each quantitative variable, the summary() command provides a five-number summary (min, max, Q1, Q3, median) plus the mean. For categorical variables, the counts of each level are provided. This provides an excellent summary measure of each variable, but you may prefer a richer set of information (especially when it comes to typing up tables).
I recently discovered a great way to obtain a richer set of information on a data frame. This method involves using the psych library, which contains functions describe() and describe.by(). Continuing with the code from above, here is the basic syntax:
library(psych)
describe(Yogurt) ## Describes in more detail the Yogurt data frame
Suppose you also want to break your summary statistics into two (or four) tables for comparison sake (perhaps to illustrate stark differences across select subsets of your data). The describe.by() command is a convenient technique to break the data down by the levels of a factor. Here’s an example with on the Yogurt data.
describe.by(Yogurt, Yogurt$choice)
Finally, you may want to port your data into LaTeX format and/or select particular summary statistics from the list. I wrote a function that serves as a convenience interface to describe.by() and toLatex(). As toLatex() does not work directly on objects created using describe.by(), you might find this helpful.
If you do not like knowing about the kurtosis of your data, you could read up on the options of describe.by() to learn about how to shut it down. If you’re going to port it into a LaTeX table anyway, you could also just modify the code I wrote here to eliminate the summary statistics you don’t want and produce LaTeX output.
FYI: Quick R has a nice summary of some other methods for summarizing data. Of the methods at Quick R that I didn’t describe, pastecs looks most like a method I would use.
To leave a comment for the author, please follow the link and comment on their blog: Coffee and Econometrics in the Morning.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.