[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I was reading Paul Hiemsta’s blogpost on Much more efficient bubble sort in R using the Rcpp and inline packages, went back to his first post Bubble sort implemented in pure R and thought, surely we can do it better in pure R. So I cleaned inner loops, removed a recursion made some variations and finally made a big improvement as I vectorized it. I don’t need to try Rcpp to know that is and will be faster, but surely closed the gap a bit.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Original code
This is how it was originally in the blog in pure R. If there are changes it is layout.larger = function(pair) {
if(pair[1] > pair[2]) return(TRUE) else return(FALSE)
}
swap_if_larger = function(pair) {
if(larger(pair)) {
return(rev(pair))
} else {
return(pair)
}
}
swap_pass = function(vec) {
for(i in seq(1, length(vec)-1)) {
vec[i:(i+1)] = swap_if_larger(vec[i:(i+1)])
}
return(vec)
}
bubble_sort = function(vec) {
new_vec = swap_pass(vec)
if(isTRUE(all.equal(vec, new_vec))) {
return(new_vec)
} else {
return(bubble_sort(new_vec))
}
}
Improve inside of loops
This is essentially the same code, but cleaned the first two functions.
larger1 = function(pair) pair[1] > pair[2]
swap_if_larger1 = function(pair) {
if(larger1(pair)) rev(pair)
else pair
}
swap_pass1 = function(vec) {
for(i in seq(1, length(vec)-1)) {
vec[i:(i+1)] = swap_if_larger1(vec[i:(i+1)])
}
return(vec)
}
bubble_sort1 = function(vec) {
new_vec = swap_pass1(vec)
if(isTRUE(all.equal(vec, new_vec))) {
return(new_vec)
} else {
return(bubble_sort1(new_vec))
}
}
Improve outside loops
I cleaned the outside loop, then decided that the two inside functions were not needed any more, because it was reduced to one statement.
swap_pass2 = function(vec) {
for(i in 1:(length(vec)-1)) {
if (vec[i] > vec[i+1] ) vec[c language=”(i,i+1)”][/c] <- vec[c language=”(i+1,i)”][/c]
}
return(vec)
}
bubble_sort2 = function(vec) {
new_vec = swap_pass2(vec)
if(identical(vec, new_vec)) new_vec
else bubble_sort2(new_vec)
}
No recursion
This was actually not intended a speed update, but R was complaining about too deep recursion. The same swapp_pass2 as before is used.
bubble_sort3 = function(vec) {
new_vec = swap_pass2(vec)
while (!identical(vec, new_vec)) {
vec <- new_vec
new_vec = swap_pass2(vec)
}
new_vec
}
A different setup
The way I understood bubblesort is different from what is programmed so far. So far the sort is completed when no improvements are made, which is checked via a vector comparison. I always understood you don’t check, the first bubble goes to the end, at which point the last element is the maximum. The second bubble then stops at end-1 etc. The final bubble is only the first two elements, after which the algorithm is finished.
swap_pass4 = function(vec,iend) {
for(i in 1:iend) {
if (vec[i] > vec[i+1] ) vec[c language=”(i,i+1)”][/c] <- vec[c language=”(i+1,i)”][/c]
}
return(vec)
}
bubble_sort4 = function(vec) {
for (iend in (length(vec)-1):1) vec <- swap_pass4(vec,iend)
vec
}
Tuning Paul’s original algorithm
I was a bit disappointed with the improvements from using the true bubble sort. Apparently there is some gain by checking if the process is completed. Which is logical since the bubbles do some sorting of the intermediate elements. So, what about bubbles that go up and go down combined with a check on no improvements?
swap_pass3b = function(vec) {
for(i in length(vec):2) {
if (vec[i] < vec[i-1] ) vec[c language=”(i,i-1)”][/c] <- vec[c language=”(i-1,i)”][/c]
}
return(vec)
}
bubble_sort3b = function(vec) {
new_vec = swap_pass2(vec)
while (!identical(vec, new_vec)) {
new_vec = swap_pass2(vec)
vec <- swap_pass3b(new_vec)
}
vec
}
Vectorizing the bubbles
When you look at the bubblesort without checking, it seems it is only assumed that the first step brings the highest element at the last position. This can be achieved by just pulling the highest element and placing it at the end, without the intermediate swaps.
bubble_sort5 = function(vec) {
wm <- which.max(vec)
vec <- c(vec[-wm],vec[wm])
for (iend in ((length(vec)-1):2)) {
wm <- which.max(vec[1:iend])
vec <- c(vec[1:iend][-wm],vec[1:iend][wm],vec[(iend+1):length(vec)])
}
vec
}
Results
First a demonstration that they work;
test_vec = round(runif(10, 0, 100))
cbind(
bubble_sort(test_vec),
bubble_sort1(test_vec),
bubble_sort2(test_vec),
bubble_sort3(test_vec),
bubble_sort4(test_vec),
bubble_sort3b(test_vec),
bubble_sort5(test_vec)
)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 28 28 28 28 28 28 28
[2,] 41 41 41 41 41 41 41
[3,] 42 42 42 42 42 42 42
[4,] 46 46 46 46 46 46 46
[5,] 52 52 52 52 52 52 52
[6,] 68 68 68 68 68 68 68
[7,] 89 89 89 89 89 89 89
[8,] 93 93 93 93 93 93 93
[9,] 98 98 98 98 98 98 98
[10,] 100 100 100 100 100 100 100
Speed
Now the timing. Cleaning the outer loops saves a third of the time. The bubble as I remembered it and the up-and-down bubble improve a bit. We are now at 13% of the original time. The final version, which uses the vectorization, blows them all out of the water.
test_vec = runif(1000, 0, 100)
system.time(bubble_sort(test_vec))
# user system elapsed
# 25.48 0.02 25.69
system.time(bubble_sort1(test_vec))
# user system elapsed
# 17.88 0.00 17.91
system.time(bubble_sort2(test_vec))
# user system elapsed
# 4.75 0.00 4.75
system.time(bubble_sort3(test_vec))
# user system elapsed
# 4.75 0.00 4.74
system.time(bubble_sort4(test_vec))
# user system elapsed
# 3.45 0.00 3.44
system.time(bubble_sort3b(test_vec))
# user system elapsed
# 3.57 0.00 3.59
system.time(bubble_sort5(test_vec))
# user system elapsed
# 0.06 0.00 0.06
How about sort()?
Sort is much faster and scales much better, even more so after subtracting the time for the sample() call.. But I don’t need to do the exercise to know that a not optimal algorithm in an interpreted language cannot compete with modern algorithms in a compiled language.
library(microbenchmark)
vector_size = 100
print(microbenchmark(bubble_sort5(sample(vector_size)),
sort(sample(vector_size)),
sample(vector_size)) )
Unit: microseconds
expr min lq median uq
bubble_sort5(sample(vector_size)) 1730.041 1776.957 1806.2795 1866.758
sort(sample(vector_size)) 104.096 114.359 126.4540 180.334
sample(vector_size) 5.865 8.797 9.8965 11.729
max neval
9978.521 100
233.849 100
21.992 100
To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.