[This article was first published on Taking the Pith Out of Performance, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The other day, I stumbled upon the signif function in R, so I thought I’d take a look at what it does and compare it with some results discussed in Chap. 3 “Damaging Digits in Capacity Calculations” of my GCaP book, viz., Example 3.5 on page 31. The measured numbers in that example are reproduced here in Table 1 using read.table in R.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Number Pdd SD 1 50 0 1 2 0.00341 5 3 3 1.0040 4 5 4 50.0005 4 6 5 6.751 3 4 6 0.157 3 3 7 28.0 1 3 8 40300 0 3 9 0.070 3 2 10 30.07 2 4 11 65000 0 2 12 0.0067 4 2 13 6.023*10^23 3 4
The last column, labeled SD, shows the number of significant digits as determined by applying the following recipe (Algorithm 3.1) from Chap. 3:
Let’s check this recipe against the number 314000.000314000000 with all those trailing zeros kept intentionally. Since it has a decimal point, the first non-zero digit (scanning from the left) is the ‘3’ that starts the number. Counting it and all the digits to its right, including all zeros, we get SD = 18 significant digits. If the number was just the integer 314000 with no decimal fraction then, according to the second part of Algorithm 3.1, we would get SD = 3, i.e., the ‘4’ is in position 3 and we ignore the three zeros between it and the decimal point.
Scan the number from left to right.
Is there an explicit decimal point?
Yes:No:
- Scan and locate the first nonzero digit.
- Count it and all digits to its right. (including zeros)
- Append a decimal point.
- Count the position of the last non-zero digit prior to the decimal point.
- All zeros in between should be ignored.
Applying Algorithm 3.1 to the numbers in Table 1, we get the corresponding values in the SD column. Since that’s a pain to do manually, let’s see what R produces using signif() using the following R code:
sigdigs<-function(n) { d<-0 while(signif(n,digits=d) != n) { d<-d+1 next } return(d) }The results are shown in the column labeled Rsn of Table 2.
Idx Number Pdd SD Rsn 1 50 0 1 0 2 0.00341 5 3 3 3 1.0040 4 5 4 4 50.0005 4 6 6 5 6.751 3 4 4 6 0.157 3 3 3 7 28.0 1 3 2 8 40300 0 3 3 9 0.070 3 2 0 10 30.07 2 4 4 11 65000 0 2 2 12 0.0067 4 2 2 13 602299999999999975882752 0 4 4
With this as background, let's examine the number in row 13 of Table 1, 6.023 × 1023, which is Avogadro's number (known in Germany as Loschmidt constant). It's roughly the number of water molecules in an ice cube. But that number is only a crude approximation, which we learn in chemistry as a simple mnemonic (the two 23s). The current best measured value is written more accurately as:
It's an inconceivably large number, although by no means the largest number known in physics or mathematics. To get some idea of its size, it's been said that Intel manufactured on the order of 1018 transistors last year. Even at that impressive rate, however, it would take another million years to reach Avogadro's number of transistors! As written above, you will immediately notice the strange use of parentheses. You never see digits enclosed in parens (in this way) in mathematical numbers, e.g., 3.1415926 representing π as a decimal fraction. That's because those parens indicate measurement uncertainty or lack of resolution in the 9th and 10th decimal places. It's analogous to trying to resolve more and more detail in a microscope image. Even at the highest magnification, the smallest details will remain unclear. No such thing occurs in the decimal representation of π. If you want to know the 9th and 10th decimal digits, you simply calculate them; you don't measure them. The number in row 13 of Table 1 is encoded in scientific notation, which is shorthand for:6.02214179(30) × 1023
What happens to the significant digits when Avogadro's number is expressed as an integer? Despite its resemblance to a decimal fraction, it must be an integer since it counts things like molecules. Writing it out in full and ignoring the imprecise digits (for simplicity), it looks like this:6.02214179(30) × 1.00 000 000 000 000 000 000 000
According to Algorithm 3.1, all the zeros beyond the '9' do not count as significant digits. As shown in Figure 1, they are merely scale zeros, so all is well and SD = 9 in this approximation. But wait! Look at row 13 in Table 2. There, I forced R to express Avogadro's number as an integer and it produced:602 214 179 000 000 000 000 000
which is not the same integer we've just been discussing. Where did all those non-zero digits come from? And according to Algorithm 3.1, if they're not zeros they must be significant, in which case SD = 24. Once again, the problem resides in the distinction between numerical precision and measurement precision. The numbers in Row 13 of Tables 1 and 2 are correct to 4 significant digits. In Table 2, however, R didn't know what to put in the remaining 20 places of Avogadro's integer (it can't compute it, like π), so it essentially inserts random junk that rounds up correctly to602 299 999 999 999 975 882 752
in the chemistry-mnemonic approximation, i.e., SD = 4. Obviously, you can't get extra precision out of the vacuum, even though R makes it seem like you can.602 300 000 000 000 000 000 000
It's not that R is wrong, it's that you might draw the wrong conclusion from R's output.This is a good reminder that you have to remain vigilant when using sophisticated tools like R.
We see now that it's the numerical truncation of trailing zeros that accounts for all the discrepancies between the SD and Rsn in Table 2. To avoid having your measurement zeros truncated or rounded away, those numbers need to be represented as literal strings of characters. In R, this can be accomplished using format(number, nsmall) with nsmall set to the number of digits to the right of the decimal point that you want preserved. That's the role of the Pdd (post-decimal digits) in Tables 1 and 2. Here's the R function I wrote to generate Table 2:
sigdigss<-function(n) { i <- 0 # Check for decimal point is present if(length(grep("\.", numstr[n])) > 0) { # real number # Separate integer and fractional parts intfrac <- unlist(strsplit(numstr[n], "\.")) digstring <- paste(intfrac[1], intfrac[2], sep = "") numfigs <- nchar(digstring) while(i < numfigs) { # Find index of 1st non-zero digit from LEFT if(substr(digstring,i+1,i+1) == "0") { i <- i + 1 next } else { sigfigs <- numfigs - i break } } } else { # must be an integer digstring <- numstr[n] numfigs <- nchar(digstring) while(i < numfigs) { # Find index of 1st non-zero digit from RIGHT if(substr(digstring, numfigs-i, numfigs-i) == "0") { i <- i + 1 next } else { sigfigs <- numfigs - i break } } } return(sigfigs) }You can also compare this R code with the corresponding Mathematica and Perl codes.
To leave a comment for the author, please follow the link and comment on their blog: Taking the Pith Out of Performance.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.