Exact Complexity of Mergesort, and an R Regression Oddity
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It’s nice to be back after a pretty crazy two weeks or so.
Let me start off by stating that this blog post is simply me pondering and may not be correct. Feel free to comment on inaccuracies or improvements!
In preparation for an exam and my natural tendencies to be masochistic, I am forcing myself to find the exact complexities of some sorting algorithms and I decided to start with a favorite – mergesort. Mergesort divides an array or linked list first into two halves (or close to it) and then recursively divides the successive lists into halves until it ends up with two lists containing 1 element each – the base case. The elements are then compared and switched so that they are in order, and form their own list.
At successive levels we compare the last element of the first sublist to the first element of the second sublist and merge them together to form another list. This process continues up the recursion tree until the entire original list is sorted. For a more comprehensive and precise description, see this article on mergesort.
Easy: The Worst Case
The worst case is easy as any CS student will tell you. Looking at the recursion trees below, we see that in the worst case, we must perform “layers” of “parallel” merges corresponding to the height of the recursion tree, each merge performing comparisons. Then, the worst case complexity of merge sort is .
Three Cases
Trivially, if the number of elements is a power of 2 (figure a), all of the sublists at each level of the recursion tree will have the same size. This forms a recursion tree that is full and balanced. In this case, we have the worst case complexity because each element is involved in exactly merge operations each of which take time. In situation b the number of elements is 1 smaller than a power of 2 (i.e. 3, 7, 15). In situation c, the number of elements is 1 greater than a power of 2 (i.e. 3, 9, 17).
In situation a, the number of merge operations required for each of the elements is which yields operations.
In situations b and c, some elements require more merge operations than others; however, the number of merges differs by at most 1. The number of merges for each element is approximately and is exactly or , then, the total number of work performed is equal to
In situation a, we have
Then what? We need to find expressions for the constants and . I will call them and . I drew out several recursion trees and got the following table:
From the table, I extracted the following relationships. I will leave the proof by induction to the reader.
This is not quite exactly as I have heard some people say, but it is pretty darn close (). Using the code below, I simulated several values for and and the corresponding plot for .
n <- seq(1,2000) an <- bn <- ifelse(ceiling(log(n)/log(2))== floor(log(n)/log(2)),n/2,0) i <- seq(1,floor(log(max(n))/log(2))) bn[2**i + 1] <- n[2**i + 1] - 2 an[2**i + 1] <- 2 an[1] <- 0 bn[1] <- 0 for (i in 2:length(an)){ an[i] <- ifelse(an[i] == 0, 2 + an[i-1], an[i]) bn[i] <- ifelse(bn[i] == 0, bn[i-1] - 1, bn[i]) } T <- an*ceiling(log(n)/log(2)) + bn*floor(log(n)/log(2)) plot(T~n, type= l ,main="Exact Runtime of Mergesort", xlab=expression(italic(n)), ylab=expression(italic(T(n)))) curve(x*log(x)/log(2), add=TRUE, col="red") legend(topleft, c("Exact Solution" ,expression(n*log(n))),lwd=c(1,1), col=c( "black" , "red" )) Nlog2N <- n*log(n)/log(2) my.model <- lm(T~Nlog2N) TSS <- sum((T - mean(T))^2) RSS <- sum(my.model$residuals^2) MSS <- TSS - RSS R2 <- MSS/RSS
Note that there is not a perfect overlap here.
The Precise Regression Issue
Naturally, if the exact running time of merge sort is then I should get a regression model that has a perfect fit to the data…
Call: lm(formula = T ~ Nlog2N) Residuals: Min 1Q Median 3Q Max -90.15504 -13.84243 0.02602 19.70344 44.44878 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.016e+01 1.191e+00 8.526 <2e-16 *** Nlog2N 1.005e+00 9.825e-05 10224.908 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 28.46 on 1998 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 1.045e+08 on 1 and 1998 DF, p-value: < 2.2e-16
Hmm. The value is exactly 1. This confused me for a while, until I realized that this was a precision issue in R, and not a regression issue. Note that if this were a perfect fit, the Residuals section should be completely 0. Also, the base of the logarithm ( or 2) does not matter; the results are exactly the same! So why is ?
So why is this an issue? Most people do simply round up 0.99999809 to 1, but 1 is profound to statisticians: it means that the linear model has a perfect fit. Ok, so maybe it is a precision issue or option within R. But maybe not… look at the p-value! The question to take home is: why is it that the p-value has such precision, while the value does not? Personally, if we had to round, I would round the p-value to 0, and keep the precision of . Thoughts?
Conclusion to the Original Purpose of this Post
So, to make a really long story just long, the total running time is not exactly , rather it is .
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.