When is a number not a number?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Have you ever asked yourself whether your telephone number is really a number? It’s got numbers in it but does it measure anything?
How about your credit card number? PO Box? Social Security Number? Zip code? What would happen if you subtracted one of these from another?
As it turns out, many of the “numbers” we deal with every day are actually identifiers and not a measure of something. Sadly, too many data managers do not distinguish between the two even though making this distinction is quite simple.
Identification vs. Measurement
We have a check stub on our desk that contains the following information:
- CHECK NO. — 12345
- OUR REF. NO. — 290227
- YOUR INVOICE NO. — 090202
- INVOICE DATE — 01/30/09
- INVOICE AMOUNT — 100.00
- AMOUNT PAID — 100.00
The first three items are used to uniquely identify the check according to three different accounting systems. We can think of this information as metadata associated with the actual measurements that the check keeps track of. Although these items all have ‘NO.’ (number) in their name, they should really be called numeric identifiers, names consisting of numbers.
The last three items, the date and amounts, are actual measurements which have units of days and dollars, respectively.
This check stub is not unlike many data records in that it contains identifiers (the ‘NO.’ numbers), spatio-temporal locators (the date) and measurements (the amounts). The world of scientific data would be a much friendlier place if data managers understood the distinction between these categories and made this distinction obvious in their datasets.
We have already dealt with Dates and Latitudes and Longitudes in earlier posts describing these spatio-temporal locators. This post will focus on the difference between identifiers and measurements.
For most non-statistical datasets, the easiest way to tell the difference between an identifier and a measurement is to ask yourself whether there are any units involved. If there are no units involved then we are usually talking about an identifier, a name, a handle on an individual piece of information. Unique identifiers are important as they allow us to be sure we are talking about a particular piece of information. You don’t want all your invoices to go out with the same invoice number. In fact, you should make every effort to ensure that no two invoices share this number.
Measurements are different. They make no claim to being unique. You may send out an invoice for $100 as many times as you like. Measurements have units. Measurements can be represented as a distance along an axis. Measurements, unlike identifiers, can be used in mathematical equations.
Strings vs. Floats and Ints
To human readers, it appears that numeric identifiers and measurements are expressed the same way — as a series of “numbers”. However, when writing software to process data, it is important to differentiate between numeric strings — arrays of ASCII characters from the set [0…9], and floating point or integer values. In typed languages like Fortran, C or Java, this distinction is enforced. In untyped languages like python and R, any lack of a clear distinction between numeric identifiers and measurements can lead to some interesting results.
Here is some python code that adds two integers:
>>> a = 7 >>> b = 7 >>> c = a + b >>> c 14
Here is similar looking code that “adds” (i.e. concatenates) two “numbers”:
>>> a = '7' >>> b = '7' >>> c = a + b >>> c '77'
Some would say that this argues for the use of typed languages when working with scientific data. We do not share this judgement. Agile programming languages like python and R offer so many advantages with respect to programmer efficiency and concise readability that it would be folly to abandon them. Instead, we advocate a more careful approach to data management that can solve the problem for both typed and untyped languages.
Machine Parsable Data
We will introduce the term machine parsable to refer to data files that adhere to certain basic rules that allow software to go beyond simply reading in the data. When good software encounters machine parsable data it can actually make initial steps toward “understanding” what the data represent.
Identifiers and Measurements in a CSV File
Comma Separated Value (CSV) files are the defacto standard for scientific data interchange involving limited amounts of data. CSV is the format of choice for many data warehouses because it can be read in and spat out by any software that purports to work with data. Unfortunately, there is no standard for exactly how to use the CSV format. The only aspect that everyone appears to agree upon is that fields should be delimited by the Unicode character ‘COMMA’ (U+002C).
Along with the use of commas as delimiters, we recommend only two additional rules to make your CSV files machine parsable.
1. Use standard names for spatio-termporal locators
Spatial-temporal locators have well recognized names and abbreviations. Fields named ‘datetime’, ‘lat’, ‘lon’ and ”depth’ should be recognized as special columns by any data management software. If CSV files always used standard names like ‘lat’ (or ‘latitude’, ‘Lat’ or ‘Latitude’), software could be written that would always be aware of which columns contained spatio-termporal variables.
2. Surround identifiers in quotes
Many CSV files use quotes sporadically, often only when they are needed because a character value has an embedded comma. It would be more consistent if quotes surrounded every identifier or native character variable (e.g. ‘name’) in each record. The only elements not enclosed in quotes would be actual numeric measurements. Software parsing CSV data like this would not need to infer data types from the column names. If a value is surrounded by quotes it is obviously of type character. All other values should be interpretable as numeric.
There are distinct advantages to creating CSV files in this manner. When a file like this is imported into a spreadsheet, for instance, character codes with leading zeros will always retain them. You wouldn’t want Boston zip codes to end up with only 4 digits? (James Bond is “007”, not 7.)
That’s all. Just two simple rules to make scientific data more useful to more people. It doesn’t sound difficult because it isn’t.
Examples
Below, we examine two federal government examples of aggregated datasets that do an excellent job of differentiating between identifiers and measures. In the first example, explicit variable names allow for the simple correction of automated parsing errors. In the second example, automatic parsing works without intervention.
Example 1) Water Quality data
We applaud the USGS and US EPA for working together to provide a publicly available, single source of water quality data through the Water Quality Portal. It is a huge effort to bring together and harmonize datasets from different agencies. Let’s see how easy they have made it to ingest this data programmatically.
From their data page, I can easily look for data within 5 miles of my location and filter for “Site Type: Stream”. Clicking on the Download button delivers a stations.csv
file. We will R’s readr package to automatically parse this data:
df <- read_csv("~/Downloads/station.csv") ... lapply(df, class) %>% str() List of 36 $ OrganizationIdentifier : chr "character" $ OrganizationFormalName : chr "character" $ MonitoringLocationIdentifier : chr "character" $ MonitoringLocationName : chr "character" $ MonitoringLocationTypeName : chr "character" $ MonitoringLocationDescriptionText : chr "logical" $ HUCEightDigitCode : chr "numeric" $ DrainageAreaMeasure/MeasureValue : chr "numeric" $ DrainageAreaMeasure/MeasureUnitCode : chr "character" $ ContributingDrainageAreaMeasure/MeasureValue : chr "logical" $ ContributingDrainageAreaMeasure/MeasureUnitCode: chr "logical" $ LatitudeMeasure : chr "numeric" $ LongitudeMeasure : chr "numeric" $ SourceMapScaleNumeric : chr "logical" $ HorizontalAccuracyMeasure/MeasureValue : chr "numeric" $ HorizontalAccuracyMeasure/MeasureUnitCode : chr "character" $ HorizontalCollectionMethodName : chr "character" $ HorizontalCoordinateReferenceSystemDatumName : chr "character" $ VerticalMeasure/MeasureValue : chr "numeric" $ VerticalMeasure/MeasureUnitCode : chr "character" $ VerticalAccuracyMeasure/MeasureValue : chr "numeric" $ VerticalAccuracyMeasure/MeasureUnitCode : chr "character" $ VerticalCollectionMethodName : chr "character" $ VerticalCoordinateReferenceSystemDatumName : chr "character" $ CountryCode : chr "character" $ StateCode : chr "numeric" $ CountyCode : chr "character" $ AquiferName : chr "logical" $ FormationTypeText : chr "logical" $ AquiferTypeName : chr "logical" $ ConstructionDateText : chr "logical" $ WellDepthMeasure/MeasureValue : chr "logical" $ WellDepthMeasure/MeasureUnitCode : chr "logical" $ WellHoleDepthMeasure/MeasureValue : chr "logical" $ WellHoleDepthMeasure/MeasureUnitCode : chr "logical" $ ProviderName : chr "character"
What fantastically named fields! They even self-describe as Identifier
, Code
, Name
, Text
, or Value
.
Some of the fields did non automatically parse into the correct type but, given the excellent naming, we can use the col_types
argument to read_csv()
to enforce proper parsing. Every ~Value
should be parsed as numeric and everything else should be parsed as character.
Example 2) Earthquake data
Here, we will look at a USGS dataset of M1.0+ Earthquakes for the past hour.
df <- readr::read_csv("https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/1.0_hour.csv") lapply(df, class) %>% str() List of 22 $ time : chr [1:2] "POSIXct" "POSIXt" $ latitude : chr "numeric" $ longitude : chr "numeric" $ depth : chr "numeric" $ mag : chr "numeric" $ magType : chr "character" $ nst : chr "numeric" $ gap : chr "numeric" $ dmin : chr "numeric" $ rms : chr "numeric" $ net : chr "character" $ id : chr "character" $ updated : chr [1:2] "POSIXct" "POSIXt" $ place : chr "character" $ type : chr "character" $ horizontalError: chr "numeric" $ depthError : chr "numeric" $ magError : chr "numeric" $ magNst : chr "numeric" $ status : chr "character" $ locationSource : chr "character" $ magSource : chr "character"
Things look pretty good. By and large, the columns have human understandable names and, for those that are not obvious, USGS provides a page with the field descriptions.
We have well named spatio-temporal locators: time, latitude, longitude, depth
.
Data columns containing non-numeric characters are automatically parsed as characters and describe identifiers of one sort or another: magType, net, place, type, status, locationSource, magSource
Best hopes for creating machine parsable data!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.