Data corruption in R 3.0.2 when using read.csv
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
It may be old news to some, but I just recently discovered that the automatic type inference system that R uses when parsing CSV files assumes that data sets will never contain 64-bit integer values.
Specially, if an integer value read from a CSV file is too large to fit in a 32-bit integer field without overflow, the column of data that contains that value will be automatically converted to floating point. This conversion will take place without any warnings, even though it may lead to data corruption.
The reason that the automatic conversion of 64-bit integer-valued data to floating point is problematic is that floating point numbers lack sufficient precision to exactly represent the full range of 64-bit integer values. As a consequence of the lower precision of floating point numbers, two unequal integer values in the input file may be converted to two equal floating point values in the data.frame
R uses to represent that data. Subsequent analysis in R will therefore treat unequal values as if they were equal, corrupting any downstream analysis that assumes that the equality predicate can be trusted.
Below, I demonstrate this general problem using two specific data sets. The specific failure case that I outline occurred for me while using R 3.0.2 on my x86_64-apple-darwin10.8.0 platform laptop, which is a “MacBook Pro Retina, 13-inch, Late 2013″ model.
Failure Case
Consider the following two tables, one containing 32-bit integer values and the other containing 64-bit integer values:
ID |
---|
1000 |
1001 |
ID |
---|
100000000000000000 |
100000000000000001 |
What happens when they are read into R using the read.csv function?
32-bit compatible integer values are parsed, correctly, using R’s integer type, which does not lead to data corruption:
1 2 3 4 5 6 7 8 9 | data <- "MySQLID\n1000\n1001" ids <- read.csv(text = data) ids[1, 1] == ids[2, 1] # [1] FALSE class(ids$MySQLID) # [1] "integer" |
64-bit compatible integer values are parsed, incorrectly, using R’s numeric type, which does lead to data corruption:
1 2 3 4 5 6 7 8 9 | data <- "MySQLID\n100000000000000000\n100000000000000001" ids <- read.csv(text = data) ids[1, 1] == ids[2, 1] # [1] TRUE class(ids$MySQLID) # [1] "numeric" |
Conclusions
What should one make of this example? At the minimum, it suggests that R’s default behaviors are not well-suited to a world in which more and more people interact with data derived from commercial web sites, where 64-bit integers are commonplace. I hope that R will change the behavior of read.csv in a future release and deprecate any attempts to treat integer literals as anything other than 64-bit integers.
But, I would argue that this example also teaches a much more general point: it suggests that the assertion that scientists can safely ignore the distinction between integer and floating point data types is false. In the example I’ve provided, the very real distinction that modern CPU’s make between integer and floating point data leads to very real data corruption occurring. How that data corruption affects downstream analyses is situation-dependent, but it is conceivable that the effects are severe in some settings. I would hope that we will stop asserting that scientists can use computers to analyze data without understanding the inherent limitations of the tools they are working with.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.