Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
With all of the excitement surrounding cdata
style control table based data transforms (the cdata
ideas being named as the “replacements” for tidyr
‘s current methodology, by the tidyr
authors themselves!) I thought I would take a moment to describe how they work.
cdata
defines two primary data manipulation operators: rowrecs_to_blocks()
and blocks_to_rowrecs()
. These are the fundamental transforms that convert between data representations. The two representations it converts between are:
- A world where all facts about an instance or record are in a single row (“rowrecs”).
- A world where all facts about an instance or record are in groups of rows (“blocks”).
It turns out once you develop the idea of specifying the data transformation as explicit data (an application of Erick S. Raymond’s admonition: “fold knowledge into data, so program logic can be stupid and robust.”), you have also a great tool for reasoning and teaching data transforms.
For example:
rowrecs_to_blocks()
does the following. For each row record, make a replicant of the of the control table with values filled in. In relational termsrowrecs_to_blocks()
is therefore a join of the data to the control table. Converselyblocks_to_rowrecs()
combines groups of rows into single rows, so in relational terms it is an aggregation or projection. If each of these operations is faithful (keeps enough information around) they are then inverse of each other.
We share some nifty tutorials on the ideas here:
One can build fairly clever illustrations and animations to teach the above.
The most common special cases of the above have been popularized in R
as unpivot
/pivot
(pivot
invented by Pito Salas), stack
/unstack
, melt
/cast
, or gather
/spread
. These special cases are handled in cdata
by convenience functions unpivot_to_blocks()
and pivot_to_rowrecs()
. A great example of a “higher order” transform that isn’t one of the common ones is given here.
Note: the above theory and implementation is joint work of Nina Zumel and John Mount and can be found here. We would really appreciate any citations or credit you can send our way (or even politely correcting those who don’t attribute the work or attribute the work to others, as there are already a lot of mentions without credit or citation).
citation("cdata") To cite package ‘cdata’ in publications use: John Mount and Nina Zumel (2019). cdata: Fluid Data Transformations. https://github.com/WinVector/cdata/, https://winvector.github.io/cdata/. A BibTeX entry for LaTeX users is @Manual{, title = {cdata: Fluid Data Transformations}, author = {John Mount and Nina Zumel}, year = {2019}, note = {https://github.com/WinVector/cdata/, https://winvector.github.io/cdata/}, }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.