[This article was first published on Taking the Pith Out of Performance, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently, several performance engineers, who have been applying my universal scalability law (USL) to their throughput measurements, reported a problem whereby their Excel spreadsheet calculations produced a negative value for the coherency parameter (β < 0) on what otherwise appears to be an application that scales extremely well. You can download the Excel spreadsheet sscalc.xls from the Guerrilla class materials. Negative USL parameters are verboten.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Naturally, I’ve seen this kind of negative numbers no-no before, but usually there has been some other problem with the measurements and once that was corrected everything worked as advertised. Lately, however, something else seemed to be going on and that caused me to investigate things more closely. I discovered that there is a genuine limitation when applying the USL in Excel and that’s what I’m going to explain here. It’s a bit subtle, so fasten your seat-belt.
The following set of plots show a generic situation. The top row shows how things would go in Excel, while the bottom row shows some ideal curves that I’ll explain in a minute.
Excel Curves
Referring to the 3 plots in the top row. The series of dots in Figure A are the throughput X(N) measured at each user load (N on the x-axis). Technically, these data have been normalized to produce the relative capacity C(N) = X(N)/X(1). This is the quantity that we eventually model with the USL function:
C(N) = N/{1 + α (N − 1) + β N(N − 1)}
where α represents the degree of contention, e.g., queueing, in the system and β is the coherency delay incurred, e.g., when keeping multiple data copies consistent across caches.
Most importantly, because these parameters (α, β) have a physical meaning, i.e., they’re not just some curve-fitting parameters, they must ALWAYS be positive-valued (no ifs, ands or buts). It’s obedience to this physical constraint that allows us to know when the performance measurement process is broken (which is more often than you might dare to imagine).
The other thing to notice in Figure A is that the data points are very close to linear, in that they have been statistically fitted to a straight line (solid line). Prima facie, that looks very good, but such linear data should not be confused with ideal linear scalability in the sense of maximal bang for the buck. Ideal linearity is represented by the dotted line, which has a higher slope than the solid line.
Figure B shows the corresponding efficiency C(N)/N or the amount of capacity per user. This is the 4th column in the spreadsheet (depending on how you have things set up). As expected, it decreases gracefully in the direction of the x-axis. You can’t have more than 100% efficiency and, as a general rule, efficiency degrades with N because the degree of overhead increases with more users. In this case, the degradation is very small due to the linearity in Figure A.
Now comes the fun part. In Excel, we must do some handstands to accommodate the USL model. The drawback with Excel is that it still doesn’t know how to fit a Trend Line to a rational function like the USL model. It’s not a dialog-box option. Sigh! Therefore, we have to tip the USL upside-down, fit the denominator as a 2nd-degree polynomial (quadratic equation) with coefficients (a, b, c=0), and then transform those coefficients back to the USL parameters using the definitions on pp. 80 and 104 of my GCaP book, viz., α = b − a and β = a.
The result of this inversion procedure in Excel appears in Figure C. It can be interpreted physically as showing the deviation from linearity. Here, I’ve simply labeled it L(N) but it’s the same thing as (N/C) − 1 in the Excel spreadsheet. And this is where the trouble begins.
Notice that the data points tend to bend ever so slightly toward the x-axis as they increase. When we ask Excel to fit these points to a quadratic function—the denominator of C(N)—it produces the dashed line. The denominator of C(N) in the USL model is quadratic in N. A general quadratic function can be written: aN2 + bN + c and has the shape of a parabola. We constrain our L(N) parabola to go through the origin by setting c = 0 in the Excel dialog box, which leaves only a and b to be calculated by Excel. But Excel thinks the parabola is upside-down like MacDonald’s arches and fits the data points to the dashed line. Whether a parabola looks like like a cup (as demanded by the USL) or a cap (MacDonald’s arch), is controlled by the sign of the a-coefficient. A MacDonald’s arch has a < 0 (negative), and that in turn causes β in the USL to incorrectly obtain a negative value!
The problem lies with Excel, not the USL. There is no way to tell Excel that we want only positive a and b coefficients. At least, I don’t know how to do that in Excel and I don’t plan on wasting time writing a macro either because I’m already worried about numerical precision issues with Excel (as explained in the Preface to book). Better to go with a tool like R or Mathematica in this case. So, why is Excel running into trouble?
Idealized Curves
Referring now to the 3 plots in the bottom row, Figure D shows the near-linear data points represented this time by a continuous red line (it’s the same as the solid blue line in Figure A). The idea here is to see what happens to that continuous line as I push it through the same transformations as are applied to the data in Excel.
In Figure E the resulting theoretical efficiency curve looks very similar to Figure B. In fact, it looks like it could be a regression fit to those data. So, that’s cool. It seems to be consistent. The drama occurs in the next plot.
Figure F is the result of applying the Excel transformation to determine the deviation from linearity. Very clearly it is a cap-shaped curve (concave), rather than cup-shaped curve (as required by the USL). Put another way, the red curve in Figure F looks more like a throughput curve than a throughput curve! But that’s crazy. How can an inverted throughput curve still look like a throughput curve?
- If you start with linear data like the points in Figure A, those same data will tend to lie on the red curve shown in Figure F as a result of simply applying the inversion transformation in Excel. That’s what you see happening in Figure C.
- The red curve in Figure F is not a parabola. I won’t bother explaining what function it is because it’s not salient to this discussion.
- Excel doesn’t know any of this.
- When applying the Trend Line fit in Excel, you tell Excel that you want it to fit the data points in Figure C to a parabola. That’s the logical equivalent of telling Excel to fit to the USL model, but distanced by several levels of indirection. You do it by choosing “polynomial of degree 2” in the Excel dialog box.
- When Excel charges off to fulfill your request by doing a linear least-squares fit to a parabola, it senses that the data points in Figure C are “curved” downward and comes back with a chunk of a “MacDonald’s arch”; the dashed line in Figure C.
- Since Excel knows nothing of the USL, it thinks it has done a good job, e.g., R2 ~ 0.99.
What To Do?
Discretion being the better part of valor, I would not do any additional handstands on behalf of Excel. Rather, I suggest that you avoid all such problems having to do with the signs of a and b by avoiding a and b altogether. Instead, fit the USL parameters directly to the original data. You can do this in R, for example, using its NLS function:
# Do a standard NLS fit ... usl <- nls(Norm ~ n/(1 + alpha * (n-1) + beta* n*(n-1)), input, start=c(alpha=0.1, beta=0.01))
Simultaneously, you can constrain the values of α and β to be positive. Here's how you do that in R:
# Set positivity constraints on USL parameters ... usl <- nls(Norm ~ n/(1 + alpha * (n-1) + beta * n*(n-1)), input, start=c(alpha=0.1, beta=0.01), algorithm="port", lower=c(0,0,0))
You can learn more about applying R to quantitative scalability analysis and other performance management problems in our upcoming GDAT class.
As a final sanity check, scalability measurements that are linear rising (like Figure A) suggest the overheads are small. In particular, you can expect that there is relatively little sharing of write-data and therefore the coherency penalty (β parameter) should be vanishingly small. That doesn't mean that your system will scale linearly forever. It means that the maximum in the USL scalability curve is there, but it's going to be off the scale to the right somewhere relative to the physically allowed system size.
To leave a comment for the author, please follow the link and comment on their blog: Taking the Pith Out of Performance.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.