Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The 2-parameter USL model
The original USL model, presented in my GCAP book and updated in the blog post How to Quantify Scalability, is defined in terms of two fitting parameters $\alpha$ (contention) and $\beta$ (coherency). \begin{equation} X(N) = \frac{N \, X(1)}{1 + \alpha (N – 1) + \beta N (N – 1)} \label{eqn: usl2} \end{equation}Fitting this nonlinear USL equational model to data requires several steps:
- normalizing the throughput data, $X$, to determine relative capacity, $C(N)$.
- equation (\ref{eqn: usl2}) is equivalent to $X(N) = C(N) \, X(1)$.
- if the $X(1)$ measurement is missing or simply not available—as is often the case with data collected from production systems—the GCAP book describes an elaborate technique for interpolating the value.
- providing each term of the USL with a proper physical meaning, i.e., not treat the USL like a conventional multivariate statistical model (statistics is not math)
- satisfying the von Neumann criterion: minimal number of modeling parameters
The 3-parameter USL model
Going back to equation (\ref{eqn: usl2}), let’s just consider the simplest case where scaling is linear-rising, as would be the case for ideal parallelism. In the linear region, where $\alpha = \beta = 0$, equation (\ref{eqn: usl2}) simplifies to \begin{equation} X(N) = N \, X(1) \label{eqn: usl1} \end{equation}In other words, the overall throughput $X(N)$ increases in simple proportion to $N$. The “single-user” throughput, $X(1)$, doesn’t change and therefore acts like a constant of proportionality.
But what happens when we don’t know the value of $X(1)$? That means the $X(1)$ factor in equations (\ref{eqn: usl2}) and (\ref{eqn: usl1}) is undefined. We might denote this situation by writing
\begin{equation} X(N) = N \, ? \label{eqn: uslx} \end{equation}
Of course, that makes no sense, mathematically speaking. As already mentioned, the conventional way out of this situation is to estimate the value of $X(1)$ using mathematical interpolation. But now, the epiphany.
Rather than using the more complicated interpolation procedure, we can simply appeal to statistical regression! Yes, that’s right, we can treat the USL equation purely as a conventional multivariate statistical model. After all, we’re already using nonlinear statistical regression to determine the $\alpha$ and $\beta$ parameters. More importantly, since statistics is not math, we can replace equation ($\ref{eqn: uslx}$) with a statement about correlation, rather than strict equality. In statistical models, that’s accomplished by introducing another parameter (I’ll call it $\gamma$, since that’s the third letter of the Greek alphabet) to replace the question mark in equation ($\ref{eqn: uslx}$), namely
\begin{equation} X(N) = N \, \gamma \label{eqn: uslg} \end{equation}
The new parameter $\gamma$ is just a constant of proportionality that represents the slope of the line associated with ideal parallel scaling. See the plots below.
And here’s a little piece of magic. If we choose $N = 1$ in equation ($\ref{eqn: uslg}$), it becomes $X(1) = \gamma$. So, when the $\gamma$ parameter is determined by statistical regression, it also tells us the estimated value of $X(1)$, whether it was measured or missing. In other words, we don’t need to do any explicit interpolation because the nonlinear regression procedure does it automatically by fitting the third parameter.
Equation (\ref{eqn: usl2}) is now replaced by a 3-parameter version of the USL model: \begin{equation} X(N) = \frac{N \, \gamma}{1 + \alpha (N – 1) + \beta N (N – 1)} \label{eqn: usl3} \end{equation}
Unlike the 2-parameter USL, equation (\ref{eqn: usl3}) can be fitted directly to your measured throughput data without the need to do data normalization or interpolation. The following examples show the results of fitting the 3-parameter USL model.
Load-test data
These are load-test data and the “single-user” throughput was measured as $X(1) = 955.16$ per unit time. The 3-parameter USL fit is summarized in the following plot.The fitted value of $\gamma = 995.65$, which is the estimated value of $X(1)$. It can also be regarded as the slope of the linear-rising throughput indicated by the sloping red line on the left of the plot.
Production data
These data are from a continuously running production system and thus, no $X(1)$ was ever produced.The fitted value of $\gamma = 3.22$ is also equivalent to the estimated value of $X(1)$. Similarly, it can be regarded as the slope of the linear-rising throughput on the left of the plot. Interestingly, in these data, $\alpha = 0$, while $beta$ is non-zero. That suggests there is no significant contention in the workload but there is some data exchange coherency at play.
One word of caution. Fitting the 3-parameter USL can be more sensitive to the actual data, especially with a large number of production data scatter points. I’ll go into all this, and more, in the upcoming Guerrilla training classes.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.