Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The advantages of data_algebra and cdata are:
- The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
- The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
- The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).
Let’s re-work a small R cdata example, using the Python package data_algebra.
An Example
First we import some modules and packages, and import some notional data.
# https://pandas.pydata.org
import pandas
# PyYAML from https://pyyaml.org
import yaml
# pip install https://github.com/WinVector/data_algebra/raw/master/dist/data_algebra-0.1.3.tar.gz
import data_algebra.cdata
import data_algebra.cdata_impl
import data_algebra.data_ops
import data_algebra.yaml
import data_algebra.SQLite
# ask YAML to write simpler structures
data_algebra.yaml.fix_ordered_dict_yaml_rep()
# read our example data
iris = pandas.read_csv('iris_small.csv')
# print it out
iris
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | id | |
|---|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 0 | 
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 1 | 
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 2 | 
Or, more concretely, we want our data to look like the following.
answer = pandas.read_csv("answer.csv")
answer
| id | Species | Part | Measure | Value | |
|---|---|---|---|---|---|
| 0 | 0 | setosa | Petal | Length | 1.4 | 
| 1 | 0 | setosa | Petal | Width | 0.2 | 
| 2 | 0 | setosa | Sepal | Length | 5.1 | 
| 3 | 0 | setosa | Sepal | Width | 3.5 | 
| 4 | 1 | setosa | Petal | Length | 1.4 | 
| 5 | 1 | setosa | Petal | Width | 0.2 | 
| 6 | 1 | setosa | Sepal | Length | 4.9 | 
| 7 | 1 | setosa | Sepal | Width | 3.0 | 
| 8 | 2 | setosa | Petal | Length | 1.3 | 
| 9 | 2 | setosa | Petal | Width | 0.2 | 
| 10 | 2 | setosa | Sepal | Length | 4.7 | 
| 11 | 2 | setosa | Sepal | Width | 3.2 | 
This sort of conversion can be called an anti-pivot. In Python these sorts of transforms are specified with pandas.DataFrame.pivot, pandas.pivot_table, pandas.melt,
and others. Many R packages supply related operators: reshape, reshape2, data.table, tidyr, cdata, and more. We feel the data_algebra and cdata methods (which only perform data-reshaping, and not aggregation) offer a number of significant advantages (some of which we have already mentioned, and a few more of which we will see demonstrated in this note).
Back to our example.
To specify our desired transform, we build a structure describing what a data record looks like. We can crib this specification from the answer as follows.
control_table = answer.loc[answer.id == 0, ['Part', 'Measure']]
control_table = control_table.reset_index(inplace=False, drop=True)
control_table["Value"] = [control_table['Part'][i] + '.' + control_table['Measure'][i] for 
                            i in range(control_table.shape[0])]
control_table
| Part | Measure | Value | |
|---|---|---|---|
| 0 | Petal | Length | Petal.Length | 
| 1 | Petal | Width | Petal.Width | 
| 2 | Sepal | Length | Sepal.Length | 
| 3 | Sepal | Width | Sepal.Width | 
For each record we take care to identify what keys identify records (the record_keys) and want parts identify rows within the record (the control_table_keys).  We suppress the record_key when writing the control table, as these are exactly the values that do not vary inside each record.
We combine all of these specification into a RecordSpecification as follows:
record_spec = data_algebra.cdata.RecordSpecification(
    control_table,
    control_table_keys = ['Part', 'Measure'],
    record_keys = ['id', 'Species']
    )
record_spec
RecordSpecification
   record_keys: ['id', 'Species']
   control_table_keys: ['Part', 'Measure']
   control_table:
       Part Measure         Value
   0  Petal  Length  Petal.Length
   1  Petal   Width   Petal.Width
   2  Sepal  Length  Sepal.Length
   3  Sepal   Width   Sepal.Width
control_table_keys are “value stand ins”.  During data transfrom these cells will be replaced by values coming from the columns named by these cells.
Now we can transform our original row-record oriented data into general block records.  To do this we specify a RecordMap using our record specification to describe the outgoing record structure. The incoming record structure is implicitly assumed to be single-row records, unless we specify otherwise (using the blocks_in argument).
mp_to_blocks = data_algebra.cdata_impl.RecordMap(blocks_out=record_spec) print(str(mp_to_blocks))
Transform row records of the form:
  record_keys: ['id', 'Species']
 ['id', 'Species', 'Petal.Length', 'Petal.Width', 'Sepal.Length', 'Sepal.Width']
to block records of structure:
RecordSpecification
   record_keys: ['id', 'Species']
   control_table_keys: ['Part', 'Measure']
   control_table:
       Part Measure         Value
   0  Petal  Length  Petal.Length
   1  Petal   Width   Petal.Width
   2  Sepal  Length  Sepal.Length
   3  Sepal   Width   Sepal.Width
arranged_blocks = mp_to_blocks.transform(iris) arranged_blocks
| id | Species | Part | Measure | Value | |
|---|---|---|---|---|---|
| 0 | 0 | setosa | Petal | Length | 1.4 | 
| 1 | 0 | setosa | Petal | Width | 0.2 | 
| 2 | 0 | setosa | Sepal | Length | 5.1 | 
| 3 | 0 | setosa | Sepal | Width | 3.5 | 
| 4 | 1 | setosa | Petal | Length | 1.4 | 
| 5 | 1 | setosa | Petal | Width | 0.2 | 
| 6 | 1 | setosa | Sepal | Length | 4.9 | 
| 7 | 1 | setosa | Sepal | Width | 3.0 | 
| 8 | 2 | setosa | Petal | Length | 1.3 | 
| 9 | 2 | setosa | Petal | Width | 0.2 | 
| 10 | 2 | setosa | Sepal | Length | 4.7 | 
| 11 | 2 | setosa | Sepal | Width | 3.2 | 
Inverting the transform
An inverse transform is simply expressed by reversing the roles of the blocks_out and blocks_in arguments. In this case the output is row-records, as we didn’t specify an outgoing block structure with blocks_out.
mp_to_rows = data_algebra.cdata_impl.RecordMap(blocks_in=record_spec) print(str(mp_to_rows))
Transform block records of structure:
RecordSpecification
   record_keys: ['id', 'Species']
   control_table_keys: ['Part', 'Measure']
   control_table:
       Part Measure         Value
   0  Petal  Length  Petal.Length
   1  Petal   Width   Petal.Width
   2  Sepal  Length  Sepal.Length
   3  Sepal   Width   Sepal.Width
to row records of the form:
  record_keys: ['id', 'Species']
 ['id', 'Species', 'Petal.Length', 'Petal.Width', 'Sepal.Length', 'Sepal.Width']
arranged_rows = mp_to_rows.transform(arranged_blocks) arranged_rows
| id | Species | Petal.Length | Petal.Width | Sepal.Length | Sepal.Width | |
|---|---|---|---|---|---|---|
| 0 | 0 | setosa | 1.4 | 0.2 | 5.1 | 3.5 | 
| 1 | 1 | setosa | 1.4 | 0.2 | 4.9 | 3.0 | 
| 2 | 2 | setosa | 1.3 | 0.2 | 4.7 | 3.2 | 
Arbitrary transforms
Arbitrary record to record transforms can be specified by setting both blocks_in (to describe incoming structure) and blocks_out (to describe outgoing structure) at the same time.
Transforms in databases
data_algebra also implements all the transform steps in databases using SQL (via row_recs_to_blocks_query() and blocks_to_row_recs_query()).
These queries can be seen below.
db_model = data_algebra.SQLite.SQLiteModel()
print(db_model.row_recs_to_blocks_query(
    source_view=data_algebra.data_ops.describe_pandas_table(
        iris, 'iris'),
    record_spec=record_spec,
    record_view=data_algebra.data_ops.describe_pandas_table(
        record_spec.control_table, "control_table")
))
SELECT a."id" AS "id", a."Species" AS "Species", b."Part" AS "Part", b."Measure" AS "Measure", CASE WHEN b."Value" = 'Petal.Length' THEN a."Petal.Length" WHEN b."Value" = 'Petal.Width' THEN a."Petal.Width" WHEN b."Value" = 'Sepal.Length' THEN a."Sepal.Length" WHEN b."Value" = 'Sepal.Width' THEN a."Sepal.Width" ELSE NULL END AS "Value" FROM ( "iris" ) a CROSS JOIN ( "control_table" ) b ORDER BY a."id", a."Species", b."Part", b."Measure"
print(db_model.blocks_to_row_recs_query(
    source_view=data_algebra.data_ops.describe_pandas_table(
        iris, 'iris'),
    record_spec=record_spec
))
SELECT "id" AS "id", "Species" AS "Species", MAX(CASE WHEN ( "Part" = 'Petal' ) AND ( "Measure" = 'Length' ) THEN "Value" ELSE NULL END) AS "Petal.Length", MAX(CASE WHEN ( "Part" = 'Petal' ) AND ( "Measure" = 'Width' ) THEN "Value" ELSE NULL END) AS "Petal.Width", MAX(CASE WHEN ( "Part" = 'Sepal' ) AND ( "Measure" = 'Length' ) THEN "Value" ELSE NULL END) AS "Sepal.Length", MAX(CASE WHEN ( "Part" = 'Sepal' ) AND ( "Measure" = 'Width' ) THEN "Value" ELSE NULL END) AS "Sepal.Width" FROM ( "iris" ) GROUP BY "id", "Species" ORDER BY "id", "Species"
SQL queries is: the queries can be used to with “CREATE TABLE table_name AS” to materialize transform results in a database, without round-tripping the data in and out of the database.
As complicated as the queries look, they actually expose some deep truths:
- The row_recs_to_blocks_query()is essentially a cross-join of the data to the record description. Each combination of data row and record description row builds a new result row.
- The blocks_to_row_recs_query()is an aggregation. Each set of rows corresponding to a given data record is aggregated into a single result row.
- Just about any arbitrary record shape to arbitrary record shape can be written as a transform from the first record shape to row-records (record sets that have exactly one row per record), followed by a transform from the row-records to the new format. This transform can preserve column types as in the intermediate form each different record entry has its own column. This is an advantage of using a “thin” intermediate form such as RDF triples.
This leads us to believe that transforming to and from single-row records are in fact fundemental operations, and not just implementation details.
The R cdata version
The data_algebra had been previously implemented in
R in the cdata, rquery, and rqdatatable packages.
This packages support both in-memory operations and translation of the transforms to SQL.
We would perform the above transforms in R as follows.
First we write out a copy of the transform.
# convert objects to a YAML string
xform_rep = yaml.dump(mp_to_blocks.to_simple_obj())
# write to file
with open("xform_yaml.txt", "wt") as f:
    print(xform_rep, file=f)
%load_ext rpy2.ipython
%%R
# install.packages("cdata")
library(cdata)
library(yaml)
# https://github.com/WinVector/data_algebra/blob/master/Examples/cdata/cdata_yaml.R
source("cdata_yaml.R")
iris <- read.csv('iris_small.csv')
print(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species id 1 5.1 3.5 1.4 0.2 setosa 0 2 4.9 3.0 1.4 0.2 setosa 1 3 4.7 3.2 1.3 0.2 setosa 2
%%R
r_yaml <- yaml.load_file("xform_yaml.txt")
transform <- convert_yaml_to_cdata_spec(r_yaml)
print(transform)
{
 row_record <- wrapr::qchar_frame(
   "id"  , "Species", "Petal.Length", "Petal.Width", "Sepal.Length", "Sepal.Width" |
     .   , .        , Petal.Length  , Petal.Width  , Sepal.Length  , Sepal.Width   )
 row_keys <- c('id', 'Species')
 # becomes
 block_record <- wrapr::qchar_frame(
   "id"  , "Species", "Part" , "Measure", "Value"      |
     .   , .        , "Petal", "Length" , Petal.Length |
     .   , .        , "Petal", "Width"  , Petal.Width  |
     .   , .        , "Sepal", "Length" , Sepal.Length |
     .   , .        , "Sepal", "Width"  , Sepal.Width  )
 block_keys <- c('id', 'Species', 'Part', 'Measure')
 # args: c(checkNames = TRUE, checkKeys = FALSE, strict = FALSE, allow_rqdatatable = TRUE)
}
%%R iris %.>% transform
id Species Part Measure Value 1 0 setosa Petal Length 1.4 2 0 setosa Petal Width 0.2 3 0 setosa Sepal Length 5.1 4 0 setosa Sepal Width 3.5 5 1 setosa Petal Length 1.4 6 1 setosa Petal Width 0.2 7 1 setosa Sepal Length 4.9 8 1 setosa Sepal Width 3.0 9 2 setosa Petal Length 1.3 10 2 setosa Petal Width 0.2 11 2 setosa Sepal Length 4.7 12 2 setosa Sepal Width 3.2
Cross-language work
As the record transform specifications, both in Python data_algebra and R cata are simple data structures (just the control table, and a few lists of key column names), they can be moved from one language to another by YAML (as we also demonstrated in the logistic scoring example.
data_algebra supplies a write method, so cross-language interoperation is just a matter of adding additional read/write methods.
print(yaml.dump(mp_to_blocks.to_simple_obj()))
type: data_algebra.cdata_impl.RecordMap
blocks_out:
  type: data_algebra.cdata.RecordSpecification
  record_keys:
  - id
  - Species
  control_table_keys:
  - Part
  - Measure
  control_table:
    Part:
    - Petal
    - Petal
    - Sepal
    - Sepal
    Measure:
    - Length
    - Width
    - Length
    - Width
    Value:
    - Petal.Length
    - Petal.Width
    - Sepal.Length
    - Sepal.Width
Conclusion
The cdata and data_algebra systems yield powerful implementations, and deep understanding of the nature of record transformations.  They allow one to reshape data quickly and conveniently either in R. Python/Pandas, or even SQL.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
