Advanced Data Reshaping in Python and R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The advantages of data_algebra and cdata are:
- The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
- The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
- The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).
Let’s re-work a small R cdata example, using the Python package data_algebra.
An Example
First we import some modules and packages, and import some notional data.
# https://pandas.pydata.org
import pandas
# PyYAML from https://pyyaml.org
import yaml
# pip install https://github.com/WinVector/data_algebra/raw/master/dist/data_algebra-0.1.3.tar.gz
import data_algebra.cdata
import data_algebra.cdata_impl
import data_algebra.data_ops
import data_algebra.yaml
import data_algebra.SQLite
# ask YAML to write simpler structures
data_algebra.yaml.fix_ordered_dict_yaml_rep()
# read our example data
iris = pandas.read_csv('iris_small.csv')
# print it out
iris
Or, more concretely, we want our data to look like the following.
answer = pandas.read_csv("answer.csv")
answer
This sort of conversion can be called an anti-pivot. In Python these sorts of transforms are specified with pandas.DataFrame.pivot, pandas.pivot_table, pandas.melt,
and others. Many R packages supply related operators: reshape, reshape2, data.table, tidyr, cdata, and more. We feel the data_algebra and cdata methods (which only perform data-reshaping, and not aggregation) offer a number of significant advantages (some of which we have already mentioned, and a few more of which we will see demonstrated in this note).
Back to our example.
To specify our desired transform, we build a structure describing what a data record looks like. We can crib this specification from the answer as follows.
control_table = answer.loc[answer.id == 0, ['Part', 'Measure']]
control_table = control_table.reset_index(inplace=False, drop=True)
control_table["Value"] = [control_table['Part'][i] + '.' + control_table['Measure'][i] for
i in range(control_table.shape[0])]
control_table
For each record we take care to identify what keys identify records (the record_keys
) and want parts identify rows within the record (the control_table_keys
). We suppress the record_key
when writing the control table, as these are exactly the values that do not vary inside each record.
We combine all of these specification into a RecordSpecification
as follows:
record_spec = data_algebra.cdata.RecordSpecification(
control_table,
control_table_keys = ['Part', 'Measure'],
record_keys = ['id', 'Species']
)
record_spec
control_table_keys
are “value stand ins”. During data transfrom these cells will be replaced by values coming from the columns named by these cells.
Now we can transform our original row-record oriented data into general block records. To do this we specify a RecordMap
using our record specification to describe the outgoing record structure. The incoming record structure is implicitly assumed to be single-row records, unless we specify otherwise (using the blocks_in
argument).
mp_to_blocks = data_algebra.cdata_impl.RecordMap(blocks_out=record_spec)
print(str(mp_to_blocks))
arranged_blocks = mp_to_blocks.transform(iris)
arranged_blocks
Inverting the transform
An inverse transform is simply expressed by reversing the roles of the blocks_out
and blocks_in
arguments. In this case the output is row-records, as we didn’t specify an outgoing block structure with blocks_out
.
mp_to_rows = data_algebra.cdata_impl.RecordMap(blocks_in=record_spec)
print(str(mp_to_rows))
arranged_rows = mp_to_rows.transform(arranged_blocks)
arranged_rows
Arbitrary transforms
Arbitrary record to record transforms can be specified by setting both blocks_in
(to describe incoming structure) and blocks_out
(to describe outgoing structure) at the same time.
Transforms in databases
data_algebra
also implements all the transform steps in databases using SQL
(via row_recs_to_blocks_query()
and blocks_to_row_recs_query()
).
These queries can be seen below.
db_model = data_algebra.SQLite.SQLiteModel()
print(db_model.row_recs_to_blocks_query(
source_view=data_algebra.data_ops.describe_pandas_table(
iris, 'iris'),
record_spec=record_spec,
record_view=data_algebra.data_ops.describe_pandas_table(
record_spec.control_table, "control_table")
))
print(db_model.blocks_to_row_recs_query(
source_view=data_algebra.data_ops.describe_pandas_table(
iris, 'iris'),
record_spec=record_spec
))
SQL
queries is: the queries can be used to with “CREATE TABLE table_name AS
” to materialize transform results in a database, without round-tripping the data in and out of the database.
As complicated as the queries look, they actually expose some deep truths:
- The
row_recs_to_blocks_query()
is essentially a cross-join of the data to the record description. Each combination of data row and record description row builds a new result row. - The
blocks_to_row_recs_query()
is an aggregation. Each set of rows corresponding to a given data record is aggregated into a single result row. - Just about any arbitrary record shape to arbitrary record shape can be written as a transform from the first record shape to row-records (record sets that have exactly one row per record), followed by a transform from the row-records to the new format. This transform can preserve column types as in the intermediate form each different record entry has its own column. This is an advantage of using a “thin” intermediate form such as RDF triples.
This leads us to believe that transforming to and from single-row records are in fact fundemental operations, and not just implementation details.
The R cdata
version
The data_algebra
had been previously implemented in
R in the cdata
, rquery
, and rqdatatable
packages.
This packages support both in-memory operations and translation of the transforms to SQL.
We would perform the above transforms in R as follows.
First we write out a copy of the transform.
# convert objects to a YAML string
xform_rep = yaml.dump(mp_to_blocks.to_simple_obj())
# write to file
with open("xform_yaml.txt", "wt") as f:
print(xform_rep, file=f)
%load_ext rpy2.ipython
%%R
# install.packages("cdata")
library(cdata)
library(yaml)
# https://github.com/WinVector/data_algebra/blob/master/Examples/cdata/cdata_yaml.R
source("cdata_yaml.R")
iris <- read.csv('iris_small.csv')
print(iris)
%%R
r_yaml <- yaml.load_file("xform_yaml.txt")
transform <- convert_yaml_to_cdata_spec(r_yaml)
print(transform)
%%R
iris %.>% transform
Cross-language work
As the record transform specifications, both in Python data_algebra
and R cata
are simple data structures (just the control table, and a few lists of key column names), they can be moved from one language to another by YAML
(as we also demonstrated in the logistic scoring example.
data_algebra
supplies a write method, so cross-language interoperation is just a matter of adding additional read/write methods.
print(yaml.dump(mp_to_blocks.to_simple_obj()))
Conclusion
The cdata
and data_algebra
systems yield powerful implementations, and deep understanding of the nature of record transformations. They allow one to reshape data quickly and conveniently either in R. Python/Pandas, or even SQL.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.