Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
By Jinjing Xie – Consultant, Shanghai.
The challenge here is interesting but also very clear: for a given un-ordered vector of numbers, select from it a longest increasing (or decreasing) subsequence. This subsequence may not be unique.
For example, in the following sequence,
62 49 42 54 59 39 35 48 57 60
One obvious longest increasing subsequence is 48, 57, 60, which has length 3;
The (unique) longest decreasing subsequence is 62, 49, 42, 39, 35, which has length 5.
For a very long vector, say having 10000 numbers, is it the case that all increasing and decreasing sequences are very short? This question was actually answered 80 years ago, the longest increasing or decreasing subsequence has ( O(sqrt{n}) ) length.
Interested readers can find the proof in Erdos P. and Szekeres G., 1935 for following theorem: any sequence of ( n^2+1 ) distinct integers has either an increasing or a decreasing subsequence of length ( n+1 ).
That means for a general vector of length ( m ), there is either an increasing or decreasing subsequence of length at least ( sqrt{m-1} ).
Once we know that there is always a longest increasing or decreasing subsequence of at least a certain size, how do we find it? At a glance this may seem difficult. Actually there is a good dynamic programming algorithm using only ( O(nlog(n)) ) time.
R version
The following code has been implemented following Wiki page.
library(compiler) longest_subseq.R = cmpfun(function(x) { P = integer(length(x)) M = integer(length(x) + 1) L = newL = 0 for (i in seq_along(x) - 1) { lo = 1 hi = L while (lo <= hi) { mid = (lo + hi)%/%2 if (x[M[mid + 1] + 1] < x[i + 1]) { lo = mid + 1 } else { hi = mid - 1 } } newL = lo P[i + 1] = M[newL] if (newL > L) { M[newL + 1] = i L = newL } else if (x[i + 1] < x[M[newL + 1] + 1]) { M[newL + 1] = i } } k = M[L + 1] re = integer(L) for (i in L:1) { re[i] = k + 1 k = P[k + 1] } re }) longest_subseq.R(c(5, 1, 3, 2, 4)) ## [1] 2 4 5
Now let’s play with a larger vector.
x = rnorm(1000) plot(x, main = "Longest Increasing (blue) and Decreasing (red) Subsequencesnfor 1000 random numbers", ylab = "Random Value", pch = 16, col = "#0000004D") # Longest Increasing Sequence ind = longest_subseq.R(x) points(ind, x[ind], pch = 16, col = "blue") # Longest Decreasing Sequence rind = longest_subseq.R(-x) points(rind, x[rind], pch = 16, col = "red")
How about 250,000 length vectors?
y = rnorm(250000) system.time(ind1 <- longest_subseq.R(y)) ## user system elapsed ## 3.36 0.01 3.48
A bit slow, right?
But 250,000 is typically not long for a time series.
Rcpp version
library(Rcpp) ## Warning: package 'Rcpp' was built under R version 3.0.3 cppcode <- "n#include <vector>n#include <Rcpp.h>nusing namespace std;ntypedef int index_type;nn// [[Rcpp::export]]nvector<index_type> longest_subseq(SEXP X){n Rcpp::NumericVector x(X);ntvector<index_type> P(x.size(), 0);ntvector<index_type> M(x.size()+1, 0);ntindex_type L(0), newL;ntfor(index_type i=0; i < x.size(); ++i) {nttindex_type lo(1), hi(L), mid;nttwhile( lo <= hi) {ntttmid = (lo + hi) / 2;ntttif (x[M[mid]] < x[i] ) {nttttlo = mid + 1;nttt} else {ntttthi = mid - 1;nttt}ntt}nttnewL = lo;nttP[i] = M[newL - 1];nttif (newL > L) {ntttM[newL] = i;ntttL = newL;ntt} else if (x[i] < x[M[newL]]) {ntttM[newL] = i;ntt}nt}n vector<index_type> re(L);ntindex_type k(M[L]);ntfor(index_type i=L-1; i>=0; --i){nttre[i] = k + 1;nttk = P[k];nt}n return(re);n}" sourceCpp(code = cppcode)
The sourceCpp
will export the cpp function longest_subseq
as R function longest_subseq
, now we can test it.
system.time(ind2 <- longest_subseq(y)) ## user system elapsed ## 0.03 0.00 0.03 all.equal(ind1, ind2) ## [1] TRUE
That’s a big improvement.
Now we can play with some time series data sets.
data(EuStockMarkets) matplot(EuStockMarkets, main = "Longest Increasing SubsequencenEU Stock Market Data (1991–1998)", col = 2:5, pch = "o", cex = 0.3) for (i in 1:4) { ind = longest_subseq(EuStockMarkets[, i]) points(ind, EuStockMarkets[ind, i], type = "l", col = i + 1) } i = 3 ind = longest_subseq(EuStockMarkets[, i]) for (j in ind) lines(c(j, j), c(0, EuStockMarkets[j, i]), col = i + 1, lty = 3) legend("topleft", legend = colnames(EuStockMarkets), col = 2:5, pch = "o")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.