Rounding in R: Common Data Wrangling Frustrations and Workarounds in R, Julia, and Python
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If you’ve played with data for long enough, you’re sure to run into some popular dead-ends. There are many of these, from misspelled location names and addresses to placeholder values entering the data pipeline because of a bug. One of the most frustrating things, if not the prime pain point of peril, is rounding numbers.
More often than not, you do not get what you expect, regardless of what language you are working on. The good thing about a team like ours is that we share not only our wins but also things that make us pull our hair out. This blog post is inspired by one such discussion.
Tired of validating data? Automate it! and generate reports in R and Shiny.
Hey, Does This Look Right To You? The Rounding Issue in Programming
Most things start with an innocent bug, or at least, what looks like one. And that is where this story begins as well.
round(0.5)
If you enter the above in the R Console, unless you know all the context already, you would expect the standard mathematical procedure. The result should be 1, you’d tell yourself, and when you hit Return, the display will flash.
>> 0
But how? Surely, there is something wrong here. To check, let’s try another example.
round(1.5)
> 2
And if you look closely, both of those numbers share a common property. They are even numbers. That is the first thing you learn when you learn about rounding in R, even if you learn about it the hard way: R rounds to Even.
But Why Round to Even in R? Who Gains From This?
Tl;dr: You gain and it’s the only reasonable and deterministic way.
As it turns out, a lot of things. Rounding to Even is rooted in a standard called IEC 60559. The standard dictates that you round to the nearest even number. So, round(0.5) becomes 0 and even round(-1.5) becomes -2. The standard, however, is not agnostic to the operating system and representation error, which is where the second problem comes in but we will get to that. First, we must try to understand the reasoning behind this standard.
For that, let’s go to a piece of recent history and quote, Greg Snow and his famous explanation from 2008.
“The logic behind the round to even rule is that we are trying to represent an underlying continuous value and if x comes from a truly continuous distribution, then the probability that x==2.5 is 0 and the 2.5 was probably already rounded once from any values between 2.45 and 2.54999999999999…, if we use the round up on 0.5 rule that we learned in grade school, then the double rounding means that values between 2.45 and 2.50 will all round to 3 (having been rounded first to 2.5). This will tend to bias estimates upwards. To remove the bias we need to either go back to before the rounding to 2.5 (which is often impossible to impractical), or just round up half the time and round down half the time (or better would be to round proportional to how likely we are to see values below or above 2.5 rounded to 2.5, but that will be close to 50/50 for most underlying distributions). The stochastic approach would be to have the round function randomly choose which way to round, but deterministic types are not comfortable with that, so “round to even” was chosen (round to odd should work about the same) as a consistent rule that rounds up and down about 50/50.
If you are dealing with data where 2.5 is likely to represent an exact value (money for example), then you may do better by multiplying all values by 10 or 100 and working in integers, then converting back only for the final printing. Note that 2.50000001 rounds to 3, so if you keep more digits of accuracy until the final printing, then rounding will go in the expected direction, or you can add 0.000000001 (or other small number) to your values just before rounding, but that can bias your estimates upwards.” (source)
Maybe let’s take a look at a hands-on experiment.
We wanted to show you the effect of different rounding methods in R/python/javascript, but they don’t even implement methods different from Round to Even!
Using the RStudio IDE? Maximize your productivity with our favorite shortcuts and tips!
Fortunately julia implements different rounding methods and we can play with them.
The Experiment
Let’s take a large vector of thousand random numbers from 0 to 1. Then let’s round each number in this vector to 1 decimal place in three different ways, Round to Even (the default), RoundUp and RoundDown. Note that RoundUp is equivalent to our school technique of rounding. Finally, we’ll compare which mean is closer to the mean of the original vector.
using Random, Statistics x = rand(MersenneTwister(0), 1_000) y1 = round.(x, digits=1) y2 = round.(x, RoundUp, digits=1) y3 = round.(x, RoundDown, digits=1)
The Results
So what are the means?
mean(x), mean(y1), mean(y2), mean(y3) (0.5006018120380458, 0.5012000000000001, 0.5496999999999999, 0.44970000000000004)
We see that the mean of the vector after rounding to even is much closer to the mean of the original value, while rounding up or down results in the mean being 10% off. Rounding to even is a way to deal with rounding ties in a deterministic manner (i.e. without randomness) that proved to be the most simple and reliable, even though it might be weird at first.
But That’s Not The End Of It!
There’s another issue why R works this way besides the round to even rule. There is another devil at play here and that’s the finite floating-point precision.
Hold on, that’s a lot of words. Okay, let’s take it one at a time. R only stores values till about 53 binary or about 22 floating points. In other words, anything after that digit is lost and is not accounted for. While this is not a problem for a number as small as 0.5, it proves to be a big hassle when the numbers are more precise, which simply means there are more digits after the decimal point.
This is not a problem specific to R, overall, but the limits above are specific to it. There is also an infamous R FAQ question dedicated to it. The following quote is the key point in that answer.
All other numbers are internally rounded to (typically) 53 binary digits accuracy. As a result, two floating point numbers will not reliably be equal unless they have been computed by the same algorithm, and not always even then.
So, overall, unless two numbers are processed in the exact same way, it is impossible to say with good confidence how R will equate them. But, you may be wondering, how does that apply to rounding?
When you have a number that exceeds the decimal places of 22, you would see a representation of it that is untrue since the precision is truncated.
For example:
> num <- 2.499999999999999999999 > num [1] 2.5 > round(num) [1] 2
Here, when we output num, the precision is lost since the digits exceed 22. However, if we reduce the number of 9s, the precision is retained.
It is also connected to the infamous problem (when working in binary):
> 0.1 + 0.2 == 0.3 [1] FALSE
(or when working in decimal):
x = ⅓ = 0.33333, 3*x = 0.99999, 3x =0.99999 ≠1,
floor() and ceiling()
While they are great alternatives, floor() and ceiling() are often not preferred since they round to a whole number at all times. Often, the use-case is to keep some decimal places intact. When we round, we are often looking to reduce precision while keeping a representation of the digits we are letting go of intact. These functions do not preserve that.
Why Not Truncate?
Of course, truncating is an option but if we truncate 1.25 and 1.21 to one decimal place, both would be 1.2, and that would not be a correct representation either. Also, if you look at it, truncate is just round-down for positive numbers and round-up for negative numbers. We’ve seen it’s biased.
Okay, I’ll Just Use Python for My Rounding Needs
This is all a bit too much, isn’t it? But then, life is rarely as simple when you boil it down to the brass tacks. Python is not devoid of its issues as well. Nor is any language, it’s the IEEE 754 standard .
However as written in the standard, the procedure is not hardware/implementation agnostic.
It might be funny, but python build-in rounding procedure works differently than the one in numpy:
In [1]: import numpy as np In [2]: np.round(0.15, 1) Out[2]: 0.2 In [3]: round(0.15, 1) Out[3]: 0.1
Is this a problem? Usually it’s not, but occasionally it might be. Of course you can find implementation details in the documentation.
Well, Let’s Go To JavaScript Then for All Things Math
JavaScript has the Math.round() method to achieve rounding of decimals. It also has the Math.ceil() and Math.floor() methods. Math.round() method rounds to the nearest integer. If the fractional part of the number is greater than or equal to .5, the argument is rounded to the next higher integer. If the fractional part of the number is less than .5, the argument is rounded to the next lower integer.
To round off to a specific number of digits, the common solution is to divide the number by 10^x and then multiply the result by 10^x where x is the number of digits to round off to.
JavaScript seems to be more consistent with true rounding according to arithmetic principles.
What Should I Do Then? I Need Logical Rounding in R!
First of all, never use float-point numbers to represent money-like numbers in computer memory. Either use the dedicated Decimal type if your language supports one (like in python or java) or convert money-float to integer by multiplying by some factor of 10 and avoid floats in general in those cases. With quantities that you usually use floats for, this shouldn’t be the issue. If it is, don’t use floats .
If, in case, you are looking for a function to emulate the true, logical rounding in R, you can go with this alternative we found on StackOverflow.
true_round <- function(number, digits) { posneg <- sign(number) number <- abs(number) * 10^digits number <- number + 0.5 + sqrt(.Machine$double.eps) number <- trunc(number) number <- number / 10 ^ digits number * posneg }
Another solution that could be adopted from javascript is to multiply the number and divide the result by 10^x where x is the number of digits to round off to. This is not perfect and does not give desired results always, but for some cases this might work.
Note how in the first 2 examples the results are different, but in the last 2 examples the results are consistent.
When the requirement is to check for equality between decimals up to x decimal digits. Then we can also simply use the difference and add a threshold to it. And do this without any rounding offs. So:
If max(abs(y – x)) > threshold then x and y are not equal. For example:
Rounding Out R, Julia and Python – Is It Over Yet?
Yes, and to conclude, in this article, we dove into the shenanigans of rounding in R and other languages. Overall, things are messier than they appear on the surface. The decisions made to make a language work a certain way are bound to produce bad outcomes for certain use-cases. But the good thing about software, if it doesn’t work for you, there is always a way, or at least, some wiggle room to workaround.
Most languages have something wonky going on within them when it comes to rounding numbers and it is important to keep all this in mind so that when you face a perplexing number the next time during your analysis, you immediately know the usual suspect.
To round it all up (pun intended), stay sharp. It’s not the end of the world yet – it’s just an imprecise number, which may or may cause it someday.
Shiny app running slow? Don’t fret, maybe you were given a tough start with a slow database.
The post appeared first on appsilon.com/blog/.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.