Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
With enough effort it is possible to fit a square peg into a round hole. But we have all learned — sometimes more than once — that it is much easier if peg and hole have the same shape.
Data managers also need to carefully consider the shape of their data to determine which data structures best describe their situation. Choosing data formats and software tools that match a dataset’s intrinsic structure will allow the data to slide into place with a minimum of hammering
Far too often, those tasked with managing data are familiar with a fairly small set of tools for getting the job done. In this case, the Law of the Instrument applies to data management just as it does to carpentry:
If all you have is a hammer, everything looks like a nail.
If all you know is SQL, all data look relational.
Many datasets, however, are not relational at all and are better stored in tabular or gridded formats. In this post we will review two of the most popular data structures and describe how they differ and when to choose one over the other. Even if most of your work involves data of one particular type it is a valuable exercise to consider how else data can be structured. And it is always good to expand your knowledge of other tools.
Tabular Data
For most people working with small amounts of data, the data table is the fundamental unit of organization. The data table, arguably the oldest data structure, is both a way of organizing data for processing by machines and of presenting data visually for consumption by humans. Elementary students learn how to organize data into rows and columns at a very early age while high school students master the intricacies of spreadsheets. Even RDBMS (Relation Data Base Management Systems) have the data table as their fundamental unit of organization.
Let’s review the basic properties that make a dataset intrinsically tabular:
1) Every record shares the same set of variables.
Another way of describing this in terms of rows and columns would be: “Every row has the same set of column headers.” Tabular data are inherently rectangular and cannot have “ragged rows”. If any row is lacking information for a particular column a missing value must be stored in that cell. (See Zero vs. Missing for a general discussion of missing values.)
2) Typical queries will map a record identifier onto one or more variables.
Here we see how the anticipated use of data affects how the data should be structured. It is best to think of tabular data as being ‘organized by row’ where each row corresponds to a unique identifier such as the time a measurement was made. When data are organized like this it is easy to answer the question: “What set of measurements was collected at time … ?” by simply pulling out a single row of data. Storing data this way also makes it easy to extract data for use in time series and correlation plots by pulling out selected columns.
On the other hand, asking questions about relationships between measurements does not fall out of this structure so easily. This does not mean that data immediately need to be stored in a relational database to answer relational questions; just that some software will have to read all of the data into memory before generating a data subset such as “A where B > C”.
As a general rule, tabular structure and basic formats like CSV are preferred when data are collected as long time series regardless of what you intend to do with the data later.
3) Lack of ‘normalization’ does not unreasonably increase data volumes.
Experienced database designers go to great lengths to follow the principles of database normalization. Even when working with CSV files or spreadsheets it is important to pay attention to First Normal Form which specifies “no repeating groups” and Second Normal Form which demands that “each column must depend on the primary key”.
While generally following these excellent normalization tips for tabular data, real world situations will sometimes favor the simplicity of a tabular structure even if the table violates second normal form. For example, data on ecological sampling of streams can be completely accommodated in a simple StreamData table even if some columns contain repeated data:
site_name | state | elevation | MMI | EPT_PTAX | data_source | agency |
---|---|---|---|---|---|---|
Flat Creek | OR | 1571 | 26.31 | 21.62 | WSA | EPA |
Barlow Creek | OR | 1010 | 79.12 | 53.49 | WSA | EPA |
Stevens Creek | WA | 1051 | 61.31 | 61.54 | WSA | EPA |
… |
We could treat the data for the entire nation as a single table if this were all the information we needed to store. The only minor issue is that the agency information is associated with the data_source, not the site and is unnecessarily repeated in our table. But the overall simplicity of dealing with a single table probably outweighs the minor increase in data volume.
But if we had to store more information about the data_source such as contact personnel and addresses and phone numbers and paragraph-long disclaimers, we might start thinking about creating a separate DataSources table for each data_source and using a relational database to link our StreamData table with a DataSources table rather than repeating all of the information attached to data_source for every site. Ultimately it comes down to complexity and ease-of-use. Which style is easier to use and easier to maintain over the long term? If data volumes are small, a table with a little redundancy can allow you to choose much simpler tools for working with your data. If data volumes will break your simple tools, a relational database is probably the way to go.
Relational Data
Computer scientist E. F. Codd was working for IBM when he introduced his relational model in a 1970 paper titled: “A Relational Model of Data for Large Shared Data Banks”. The original paper is till worth reading for a better understanding of the motivation behind the model and the Standard English QUEry Language (SEQUEL or SQL) that allows for human interaction with it. From the introduction:
The relational view (or model) of data … provides a means of describing data with its natural structure only — that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation and organization of data on the other.
Clearly, one of the goals of the relational model was to hide the row – column structure of data tables and replace it with a query language that allows one to pose English language questions such as:
SELECT site_name, elevation, MMI FROM StreamData WHERE elevation > 1000 ORDER BY site_name;
With a relational database and SQL, no knowledge of the internal structure of the data store is necessary and no coding is required to subset the data as described in the query above. The structure of rows and columns within the database, after being described by the database designer, is completely invisible to the consumer of data. An additional advantage of the relational model is that it reduces data duplication when the prescriptions of database normalization are carefully followed.
The downside of using an RDBMS is that, unlike simple tables, most people do not learn about the relational model in elementary school. Database design is an advanced skill and doing it well requires both training and experience and commands a commensurately high salary. Perhaps because the relational data model and associated RDBMS are extremely successful in many business applications, the use of high-end, complex, relational databases is assumed to be a good solution for all types of data. Unfortunately, this is not the case and we have seen many examples of overly complex systems being built by self-trained data managers for data that could have been described much more simply with one or more CSV tables.
That being said, lets review the properties of datasets for which an RDBMS is the best choice:
1) Typical queries involve both data and metadata.
What we mean by data in this case is something that has a numeric value and is measured in some specific units. Some arbitrary examples:
- the yield of a crop (kg/hectare)
- the average speed of traffic past an intersection (km/hour)
- the amount paid for an item ($)
Associated metadata for each of these examples link numeric measurements with other information that may be partly numeric but often includes human readable text. Metadata for our three examples above might include:
- year, county, crop, farmer, fertilizer, application strategy, weather info
- date, intersection, neighbor, complaint, mediation strategy, weather info
- date, cashier, item, store, purchaser info, weather info(?)
Of course all datasets need to have metadata identifying at least when and where measurements were taken. But in cases like the examples above extensive metadata takes on a very data-like life of its own. An agricultural scientist will want to ask questions of the data that involve both measured variables like crop yield and textual information like ‘application info’ (and the ever important ‘weather info’). In cases like these, the SQL language makes it very easy to extract data subsets based on any combination of data and metadata.
Other software tools do exist that can read in large amounts of CSV formatted data and allow the same kind of querying — our favorite is the R Project for Statistical Computing. But you always need to keep in mind the skills and tools of your targeted audience of data consumers. If your target audience is most comfortable with SQL, give them a relational database.
2) Relational queries are expected AND the total data volume is too large to be stored in memory.
Software that reads in full data tables has a very different memory footprint from an RDBMS. In order to generate the data subset “A where B > C”, most common software tools for working with tabular data will need to read the entire dataset into memory. When the volume of data approaches available memory on your computer this can cause very slow performance as any manipulation of the data will bog down your computer’s paging system. When this happens you are left with one of three main options:
- install more RAM
- split up the data before working with it
- store the data in a relational database
Unlike software that reads in full data tables, an RDBMS can have one or more database indexes. These indexes allow for fast data lookup and retrieval using only a fraction of the space required for the full dataset. An RDBMS will be able to work with data efficiently so long as just the indexes can be read into available memory.
Again, keep your target audience in mind along with this advice: Computer memory is cheaper than human memory in the long run. If your data consumers are comfortable with SQL and relational databases then set the data up in an RDBMS. But if your users have only elementary knowledge of data management you might consider spending money to upgrade the machine(s) they work on. Time is money, after all, and the time spent designing and maintaining a relational database could buy an awful lot of RAM.
Summary
We encourage you to think hard about the shape of your data before you begin designing a data management strategy and to familiarize yourself with a variety of tools for handling data. There are many excellent Open Source software packages for working with every conceivable type of data.
When thinking about data structures, never forget that the providers and users of data may disagree about what best suits their individual needs, about their vision of the shapeof the data. (See Data producers vs. data consumers.) At times it may be necessary to provide data subsets in a special format or even alternate versions of an entire dataset. With a clear understanding of the pros and cons of different data structures and some knowledge of the different tools available for working with them, you will be able to ensure that your time is spent hammering away on important and interesting problems.
Best of Luck!
A previous version of this article originally appeared in 2010 at WorkingwithData.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.