Getting indices of top elements from a vector using a priority queue

[This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is based on a question on Stack Overflow and more precisely on Martin Morgan’s answer.

The problem is to find the indices of top n elements from a vector.

An inefficient way of doing this is to run order on the vector and then only keep the last n values:

top <- function(x, n){
    tail( order(x), n )
}

This is inefficient because it requires sorting the entire vector, which is expensive.

Instead, Martin ‘s idea is to use a priority queue, this is a container adapter from the Standard Template Libary which is designed so that its top element is always the greatest.

So Martin used a priority queue of std::pair<double,int>, using the default implementation of the “greater” operator for pairs which just does lexicographic ordering. The idea is then to only keep n such pairs in the queue and feed it by scanning the vector.

Here is Martin’s full code:

#include <Rcpp.h>
#include <queue>

using namespace Rcpp;
using namespace std;

// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
{
    typedef pair<double, int> Elt;
    priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
    vector<int> result;

    for (int i = 0; i != v.size(); ++i) {
        if (pq.size() < n)
            pq.push(Elt(v[i], i));
        else {
            Elt elt = Elt(v[i], i);
            if (pq.top() < elt) {
                pq.pop();
                pq.push(elt);
            }
        }
    }

    result.reserve(pq.size());
    while (!pq.empty()) {
        result.push_back(pq.top().second + 1);
        pq.pop();
    }

    return wrap(result);
}

The version we present here uses the Compare template argument of the priority_queue to control the comparison. This way, instead of storing pairs of (value, index) we will only store the indices and implement the comparison between two indices by going back to the data.

For that purpose, we need to define the IndexCompare class, which we make a template to handle different input types (integer vector, numeric vector, character vector).

Then we abstract some of the concept out of main function by providing our own queue template class : IndexQueue which is templated by the type of data that is in the vector.

With these types, we can now implement the top_index template function, which once again is templated on the type of vector we deal with. Finally we have the attributes compatible top_index function that dispatches to the appropriate version using a simple switch

#include <Rcpp.h>
#include <queue>

using namespace Rcpp ;

template <int RTYPE>
class IndexComparator {
public:
    typedef typename Rcpp::traits::storage_type<RTYPE>::type STORAGE ;
    
    IndexComparator( const Vector<RTYPE>& data_ ) : data(data_.begin()){}
    
    inline bool operator()(int i, int j) const {
        return data[i] > data[j] || (data[i] == data[j] && j > i ) ;    
    }

private:
    STORAGE* data ;
} ;

template <>
class IndexComparator<STRSXP> {
public:
    IndexComparator( const CharacterVector& data_ ) : data(data_.begin()){}
    
    inline bool operator()(int i, int j) const {
        return (String)data[i] > (String)data[j] || (data[i] == data[j] && j > i );    
    }

private:
    SEXP* data ;
} ;

template <int RTYPE>
class IndexQueue {
    public:
        typedef std::priority_queue<int, std::vector<int>, IndexComparator<RTYPE> > Queue ;
        
        IndexQueue( const Vector<RTYPE>& data_ ) : comparator(data_), q(comparator), data(data_) {}
        
        inline operator IntegerVector(){
            int n = q.size() ;
            IntegerVector res(n) ;
            for( int i=0; i<n; i++){
                // +1 for 1-based R indexing
                res[i] = q.top() + 1;
                q.pop() ;
            }
            return res ;
        }
        inline void input( int i){ 
            // if( data[ q.top() ] < data[i] ){
            if( comparator(i, q.top() ) ){
                q.pop(); 
                q.push(i) ;    
            }
        }
        inline void pop(){ q.pop() ; }
        inline void push( int i){ q.push(i) ; }
        
    private:
       IndexComparator<RTYPE> comparator ;
       Queue q ;  
       const Vector<RTYPE>& data ;
} ;


template <int RTYPE>
IntegerVector top_index(Vector<RTYPE> v, int n){
    int size = v.size() ;
    
    // not interesting case. Less data than n
    if( size < n){
        return seq( 0, n-1 ) ;
    }
    
    IndexQueue<RTYPE> q( v )  ;
    for( int i=0; i<n; i++) q.push(i) ;
    for( int i=n; i<size; i++) q.input(i) ;   
    return q ;
}

// [[Rcpp::export]]
IntegerVector top_index( SEXP x, int n){
    switch( TYPEOF(x) ){
    case INTSXP: return top_index<INTSXP>( x, n ) ;
    case REALSXP: return top_index<REALSXP>( x, n ) ;
    case STRSXP: return top_index<STRSXP>( x, n ) ;
    default: stop("type not handled") ; 
    }
    return IntegerVector() ; // not used
}

We will use the template implementation above for integer and numeric vectors. The IndexCompare keeps a reference to the internal data of the vector, as a raw pointer for better performance and defines the parenthesis operator using this information.

For character vectors, data is internally stored into an array of SEXP of type CHARSXP, which we can compare thanks to the implementation of the greater operator in the String class. We however need a specific implementation because we need a cast to String to use the operator.

The implementation of the operator takes into account ties. When there is a tie, we want to retain the value that has the smallest index. This is the reason for this part of the operator : || (data[i] == data[j] && j > i )

The IndexQueue template embeds a priority_queue with the right parameters, and implements:

  • input : this first compares the top of the queue and replaces it if necessary
  • pop : delegates to priority_queue
  • push : idem
  • conversion to an IntegerVector for when we want to get the results.

With this, the implementation of top_index is straightforward. We handle the case where there are less than n data points which is not interesting, then we push the first n points into the queue, and finally we input the rest of the data.

Let’s check that we get what we want:

x <- rnorm( 1000 )

res_cpp <- top_index( x, 30L )
res_r   <- tail( order(x), 30L )  
identical( res_cpp, res_r )


[1] TRUE

top_index(letters, 10) 


 [1] 17 18 19 20 21 22 23 24 25 26

And then let’s benchmark:

require(microbenchmark)


Loading required package: microbenchmark

x <- rnorm(1e5)
microbenchmark( 
   R_order = top(x, 100),
   cpp1    = top_i_pq( x, 100), 
   cpp2    = top_index( x, 100 )
)


Unit: microseconds
    expr   min      lq  median      uq     max neval
 R_order 25108 25279.2 25466.8 26177.6 58612.3   100
    cpp1   632   632.6   635.5   638.8   731.3   100
    cpp2   239   239.8   242.0   243.7   287.7   100

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)