Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It’s nice to be back after a pretty crazy two weeks or so.
Let me start off by stating that this blog post is simply me pondering and may not be correct. Feel free to comment on inaccuracies or improvements!
In preparation for an exam and my natural tendencies to be masochistic, I am forcing myself to find the exact complexities of some sorting algorithms and I decided to start with a favorite – mergesort. Mergesort divides an array or linked list first into two halves (or close to it) and then recursively divides the successive lists into halves until it ends up with two lists containing 1 element each – the base case. The elements are then compared and switched so that they are in order, and form their own list.
At successive levels we compare the last element of the first sublist to the first element of the second sublist and merge them together to form another list. This process continues up the recursion tree until the entire original list is sorted. For a more comprehensive and precise description, see this article on mergesort.
Easy: The Worst Case
The worst case is easy as any CS student will tell you. Looking at the recursion trees below, we see that in the worst case, we must perform
Three Cases
Trivially, if the number of elements is a power of 2 (figure a), all of the sublists at each level of the recursion tree will have the same size. This forms a recursion tree that is full and balanced. In this case, we have the worst case complexity because each element is involved in exactly
In situation a, the number of merge operations required for each of the
In situations b and c, some elements require more merge operations than others; however, the number of merges differs by at most 1. The number of merges for each element is approximately
In situation a, we have
Then what? We need to find expressions for the constants
From the table, I extracted the following relationships. I will leave the proof by induction to the reader.
This is not quite exactly
n <- seq(1,2000) an <- bn <- ifelse(ceiling(log(n)/log(2))== floor(log(n)/log(2)),n/2,0) i <- seq(1,floor(log(max(n))/log(2))) bn[2**i + 1] <- n[2**i + 1] - 2 an[2**i + 1] <- 2 an[1] <- 0 bn[1] <- 0 for (i in 2:length(an)){ an[i] <- ifelse(an[i] == 0, 2 + an[i-1], an[i]) bn[i] <- ifelse(bn[i] == 0, bn[i-1] - 1, bn[i]) } T <- an*ceiling(log(n)/log(2)) + bn*floor(log(n)/log(2)) plot(T~n, type= l ,main="Exact Runtime of Mergesort", xlab=expression(italic(n)), ylab=expression(italic(T(n)))) curve(x*log(x)/log(2), add=TRUE, col="red") legend(topleft, c("Exact Solution" ,expression(n*log(n))),lwd=c(1,1), col=c( "black" , "red" )) Nlog2N <- n*log(n)/log(2) my.model <- lm(T~Nlog2N) TSS <- sum((T - mean(T))^2) RSS <- sum(my.model$residuals^2) MSS <- TSS - RSS R2 <- MSS/RSS
Note that there is not a perfect overlap here.
The Precise Regression Issue
Naturally, if the exact running time of merge sort is
Call: lm(formula = T ~ Nlog2N) Residuals: Min 1Q Median 3Q Max -90.15504 -13.84243 0.02602 19.70344 44.44878 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.016e+01 1.191e+00 8.526 <2e-16 *** Nlog2N 1.005e+00 9.825e-05 10224.908 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 28.46 on 1998 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 1.045e+08 on 1 and 1998 DF, p-value: < 2.2e-16
Hmm. The
So why is this an issue? Most people do simply round up 0.99999809 to 1, but 1 is profound to statisticians: it means that the linear model has a perfect fit. Ok, so maybe it is a precision issue or option within R. But maybe not… look at the p-value! The question to take home is: why is it that the p-value has such precision, while the
Conclusion to the Original Purpose of this Post
So, to make a really long story just long, the total running time is not exactly
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.