Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A codebook is a technical document that provides an
overview of and information about the variables in a dataset. The
codebook ensures that the statistician has the complete background
information necessary to undertake the analysis, and a codebook
documents the data to make sure that the data is well understood and
reusable in the future. Here we will show how to create codebooks in R
using the dataMaid
packages.
The help pages for the datasets in R packages usually provide thorough
information although the level of detail may vary quite substantially
from dataset to dataset. As an example we will consider the iris
dataset. The help page gives decent information so we will just use it
to show how we would create a codebook.
help(iris)
Real datasets, however, are messy and not as polished as the datasets found in R packages. A substantial amount of data wrangling, tweaks, cleaning, and custom solutions are necessary to transform the data into shape before it is ready for statistical analysis. Creating the finished dataset is not enough as it is also necessary to produce the corresponding data documentation.
We have previously
shown how the
dataMaid
package can produce automated reports to summarise
datasets, to identify potential errors, and to check the data quality
and integrity.
The dataMaid
package produces an Rmarkdown
summary document with
information on each variable in the data frame, and the document can
be rendered to a report in HTML, pdf, or word. The final report can be
given to scientific collaborators since proper data validation often
requires a collaborative effort between an expert in the field and a
data scientist. It is easy to tweak report generated by dataMaid
to
obtain a document that can serve as codebook for the cleaned dataset.
The function makeCodebook()
accepts a data frame and produces a
document that provides a summary of the data frame and its variables.
library(dataMaid) makeCodebook(iris)
The result is the 3-page document reproduced below in the two figures. The codebook consists of 4 parts: the first two parts are tables giving an overview of the data frame and the variables. Here we see the number of observations, the number of variables, their class type, and the proportion of missing observations.
Part 3 lists each variable and provides class-dependent summary
statistics and a data-visualisation. If a variable is a factor
then
the unique factor levels are listed beneath the summary statistics.
Part 4 documents the report generation information (who made it, when, directory, the function call, the operating system platform).
While the information about each variable (pages 1 and 2) serves as a reference manual when doing subsequent analyses, the last page provides meta-information about the codebook to ensure documentable and reproducible research.
The codebooks can be improved by adding additional information about
the variables. There are two ways to add extra information to the
codebook. The first uses the same approach as the labelled
package
where the attribute labels
can be set for a variable in a data frame
to contain label information. These labels are set directly
attr(iris$Sepal.Length, "labels") <- "Sepal length in cm"
or it is possible to use the functions from the labelled
package. The labels
attribute is intended for condensed information
and it is particularly useful if the variable names are not
meaningful. When variable names are not self-explanatory we can keep
the original variable names from the raw data but provide meaningful, explanatory
labels through the labels
attribute.
Another type of label is the shortDescription
attribute. This is
intended to provide additional details that might come in handy
later. The shortDescription
attribute is set similarly to the
labels
attribute.
attr(iris$Sepal.Length, "shortDescription") <- "Measured using a line gauge produced by Acme factories." attr(iris$Species, "shortDescription") <- paste0( "Two of the three species were collected in the Gaspé ", "Peninsula all from the same pasture, and picked on the ", "same day and measured at the same time by the same ", "person with the same apparatus")
When we run makeCodebook()
again (with argument replace=TRUE
to
overwrite the report we generated earlier) we can see the additional
information appear in the codebook produced.
makeCodebook(iris, replace=TRUE)
makeCodebook()
works by tweaking the arguments for
makeDataReport()
from the dataMaid
package. The makeDataReport()
function is very
versatile and
it is possible to change the arguments to modify the content of the
material that goes into the codebook.
Hopefully, the makeCodebook()
function in the dataMaid
package
should make it easier to create and provide codebooks for small and
larger projects, and will encourage more people to provide
documentable and reproducible research. Comments and suggestions to
expand the codebook possibilities are very
welcome.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.