Compute longest increasing/decreasing subsequence using Rcpp

Mango Solutions Shop

8 years ago

[This article was first published on Mango Solutions Shop, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Jinjing Xie – Consultant, Shanghai.

The challenge here is interesting but also very clear: for a given un-ordered vector of numbers, select from it a longest increasing (or decreasing) subsequence. This subsequence may not be unique.

For example, in the following sequence,

62 49 42 54 59 39 35 48 57 60

One obvious longest increasing subsequence is 48, 57, 60, which has length 3;

The (unique) longest decreasing subsequence is 62, 49, 42, 39, 35, which has length 5.

For a very long vector, say having 10000 numbers, is it the case that all increasing and decreasing sequences are very short? This question was actually answered 80 years ago, the longest increasing or decreasing subsequence has ( O(sqrt{n}) ) length.

Interested readers can find the proof in Erdos P. and Szekeres G., 1935 for following theorem: any sequence of ( n^2+1 ) distinct integers has either an increasing or a decreasing subsequence of length ( n+1 ).

That means for a general vector of length ( m ), there is either an increasing or decreasing subsequence of length at least ( sqrt{m-1} ).

Once we know that there is always a longest increasing or decreasing subsequence of at least a certain size, how do we find it? At a glance this may seem difficult. Actually there is a good dynamic programming algorithm using only ( O(nlog(n)) ) time.

R version

The following code has been implemented following Wiki page.

library(compiler)
longest_subseq.R = cmpfun(function(x) {
    P = integer(length(x))
    M = integer(length(x) + 1)
    L = newL = 0
    for (i in seq_along(x) - 1) {
        lo = 1
        hi = L
        while (lo <= hi) {
            mid = (lo + hi)%/%2
            if (x[M[mid + 1] + 1] < x[i + 1]) {
                lo = mid + 1
            } else {
                hi = mid - 1
            }
        }
        newL = lo
        P[i + 1] = M[newL]
        if (newL > L) {
            M[newL + 1] = i
            L = newL
        } else if (x[i + 1] < x[M[newL + 1] + 1]) {
            M[newL + 1] = i
        }
    }
    k = M[L + 1]
    re = integer(L)
    for (i in L:1) {
        re[i] = k + 1
        k = P[k + 1]
    }
    re
})
longest_subseq.R(c(5, 1, 3, 2, 4))

## [1] 2 4 5

Now let’s play with a larger vector.

x = rnorm(1000)
plot(x, main = "Longest Increasing (blue) and Decreasing (red) Subsequencesnfor 1000 random numbers", 
    ylab = "Random Value", pch = 16, col = "#0000004D")
# Longest Increasing Sequence
ind = longest_subseq.R(x)
points(ind, x[ind], pch = 16, col = "blue")
# Longest Decreasing Sequence
rind = longest_subseq.R(-x)
points(rind, x[rind], pch = 16, col = "red")

How about 250,000 length vectors?

y = rnorm(250000)
system.time(ind1 <- longest_subseq.R(y))

##    user  system elapsed 
##    3.36    0.01    3.48

A bit slow, right?

But 250,000 is typically not long for a time series.

Rcpp version

library(Rcpp)

## Warning: package 'Rcpp' was built under R version 3.0.3

cppcode <- "n#include <vector>n#include <Rcpp.h>nusing namespace std;ntypedef int index_type;nn// [[Rcpp::export]]nvector<index_type> longest_subseq(SEXP X){n    Rcpp::NumericVector x(X);ntvector<index_type> P(x.size(), 0);ntvector<index_type> M(x.size()+1, 0);ntindex_type L(0), newL;ntfor(index_type i=0; i < x.size(); ++i) {nttindex_type lo(1), hi(L), mid;nttwhile( lo <= hi) {ntttmid = (lo + hi) / 2;ntttif (x[M[mid]] < x[i] ) {nttttlo = mid + 1;nttt} else {ntttthi = mid - 1;nttt}ntt}nttnewL = lo;nttP[i] = M[newL - 1];nttif (newL > L) {ntttM[newL] = i;ntttL = newL;ntt} else if (x[i] < x[M[newL]]) {ntttM[newL] = i;ntt}nt}n    vector<index_type> re(L);ntindex_type k(M[L]);ntfor(index_type i=L-1; i>=0; --i){nttre[i] = k + 1;nttk = P[k];nt}n    return(re);n}"

sourceCpp(code = cppcode)

The sourceCpp will export the cpp function longest_subseq as R function longest_subseq, now we can test it.

system.time(ind2 <- longest_subseq(y))

##    user  system elapsed 
##    0.03    0.00    0.03

all.equal(ind1, ind2)

## [1] TRUE

That’s a big improvement.

Now we can play with some time series data sets.

data(EuStockMarkets)
matplot(EuStockMarkets, main = "Longest Increasing SubsequencenEU Stock Market Data (1991–1998)", 
    col = 2:5, pch = "o", cex = 0.3)
for (i in 1:4) {
    ind = longest_subseq(EuStockMarkets[, i])
    points(ind, EuStockMarkets[ind, i], type = "l", col = i + 1)
}
i = 3
ind = longest_subseq(EuStockMarkets[, i])
for (j in ind) lines(c(j, j), c(0, EuStockMarkets[j, i]), col = i + 1, lty = 3)
legend("topleft", legend = colnames(EuStockMarkets), col = 2:5, pch = "o")

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions Shop.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.