Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A codebook is a technical document that provides an overview of and information about the variables in a dataset. The codebook ensures that the statistician has the complete background information necessary to undertake the analysis, and a codebook documents the data to make sure that the data is well understood and reusable in the future. Here we will show how to create codebooks in R using the dataMaid
packages.
The help pages for the datasets in R packages usually provide thorough information although the level of detail may vary quite substantially from dataset to dataset. As an example we will consider the iris
dataset. The help page gives decent information so we will just use it to show how we would create a codebook.
help(iris)
Real datasets, however, are messy and not as polished as the datasets found in R packages. A substantial amount of data wrangling, tweaks, cleaning, and custom solutions are necessary to transform the data into shape before it is ready for statistical analysis. Creating the finished dataset is not enough as it is also necessary to produce the corresponding data documentation.
We have previously shown how the dataMaid
package can produce automated reports to summarise datasets, to identify potential errors, and to check the data quality and integrity.
The dataMaid
package produces an Rmarkdown
summary document with information on each variable in the data frame, and the document can be rendered to a report in HTML, pdf, or word. The final report can be given to scientific collaborators since proper data validation often requires a collaborative effort between an expert in the field and a data scientist. It is easy to tweak report generated by dataMaid
to obtain a document that can serve as codebook for the cleaned dataset.
The function makeCodebook()
accepts a data frame and produces a document that provides a summary of the data frame and its variables.
library(dataMaid) makeCodebook(iris)
The result is the 3-page document reproduced below in the two figures. The codebook consists of 4 parts: the first two parts are tables giving an overview of the data frame and the variables. Here we see the number of observations, the number of variables, their class type, and the proportion of missing observations.
Part 3 lists each variable and provides class-dependent summary statistics and a data-visualisation. If a variable is a factor
then the unique factor levels are listed beneath the summary statistics.
Part 4 documents the report generation information (who made it, when, directory, the function call, the operating system platform).
While the information about each variable (pages 1 and 2) serves as a reference manual when doing subsequent analyses, the last page provides meta-information about the codebook to ensure documentable and reproducible research.
The codebooks can be improved by adding additional information about the variables. There are two ways to add extra information to the codebook. The first uses the same approach as the labelled
package where the attribute label
can be set for a variable in a data frame to contain label information. These labels are set directly
attr(iris$Sepal.Length, "label") <- "Sepal length in cm"
or it is possible to use the functions from the labelled
package. The label
attribute is intended for condensed information and it is particularly useful if the variable names are not meaningful. When variable names are not self-explanatory we can keep the original variable names from the raw data but provide meaningful, explanatory labels through the label
attribute.
Another type of label is the shortDescription
attribute. This is intended to provide additional details that might come in handy later. The shortDescription
attribute is set similarly to the label
attribute.
attr(iris$Sepal.Length, "shortDescription") <- "Measured using a line gauge produced by Acme factories." attr(iris$Species, "shortDescription") <- paste0( "Two of the three species were collected in the Gaspé ", "Peninsula all from the same pasture, and picked on the ", "same day and measured at the same time by the same ", "person with the same apparatus")
When we run makeCodebook()
again (with argument replace=TRUE
to overwrite the report we generated earlier) we can see the additional information appear in the codebook produced.
makeCodebook(iris, replace=TRUE)
makeCodebook()
works by tweaking the arguments for makeDataReport()
from the dataMaid
package. The makeDataReport()
function is very versatile and it is possible to change the arguments to modify the content of the material that goes into the codebook.
Hopefully, the makeCodebook()
function in the dataMaid
package should make it easier to create and provide codebooks for small and larger projects, and will encourage more people to provide documentable and reproducible research. Comments and suggestions to expand the codebook possibilities are very welcome.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.