Site icon R-bloggers

Handling Strings with Rcpp

[This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a quick example of how you might use Rcpp to send and receive R ‘strings’ to and from R. We’ll demonstrate this with a few operations.

Sort a String with R

Note that we can do this in R in a fairly fast way:

my_strings <- c("apples", "and", "cranberries")
R_str_sort <- function(strings) {
  sapply( strings, USE.NAMES=FALSE, function(x) {
    intToUtf8( sort( utf8ToInt( x ) ) )
    })
  }
R_str_sort( my_strings )


[1] "aelpps"      "adn"         "abceeinrrrs"

Sort a String with C++/Rcpp

Let’s see if we can re-create the output with Rcpp.

#include <Rcpp.h>
using namespace Rcpp;
 
// [[Rcpp::export]]
std::vector< std::string > cpp_str_sort( std::vector< std::string > strings ) {
  
  int len = strings.size();
 
  for( int i=0; i < len; i++ ) {
    std::sort( strings[i].begin(), strings[i].end() );
  }
  
  return strings;
}

Note the main things we do here:

Now, let’s test it, and let’s benchmark it as well.

cpp_str_sort( my_strings )


[1] "aelpps"      "adn"         "abceeinrrrs"

long_strings <- rep( paste( collapse="", sample( letters, 1E5, replace=TRUE ) ),
                     times=100 )
 
rbenchmark::benchmark( cpp_str_sort(long_strings),
                       R_str_sort(long_strings),
                       replications=3
                       )


                        test replications elapsed relative user.self
1 cpp_str_sort(long_strings)            3   0.898    1.000     0.883
2   R_str_sort(long_strings)            3   2.356    2.624     2.350
  sys.self user.child sys.child
1    0.014          0         0
2    0.007          0         0

Note that the C++ implementation is quite a bit faster (on my machine). However, std::sort will not handle UTF-8 encoded vectors.

Now, let’s do something crazy – let’s see if we can use Rcpp to perform an operation that takes a vector of strings, and returns a list of vectors of strings. (Or, in R parlance, a list of vectors of type character).

We’ll do a simple ‘split’, such that each string is split every n indices.

Split a string at consecutive indices n

#include <Rcpp.h>
using namespace Rcpp;
 
// [[Rcpp::export]]
List cpp_str_split( std::vector< std::string > strings, int n ) {
  
  int num_strings = strings.size();
  
  List out(num_strings);
  
  for( int i=0; i < num_strings; i++ ) {
    
    int num_substr = strings[i].length() / n;
    std::vector< std::string > tmp;
    
    for( int j=0; j < num_substr; j++ ) {
      
      tmp.push_back( strings[i].substr( j*n, n ) );
      
    }
    
    out[i] = tmp;
    
  }
  
  return out;
}

Main things to notice:

cpp_str_split( c("abcd", "efgh", "ijkl"), 2 )


[[1]]
[1] "ab" "cd"

[[2]]
[1] "ef" "gh"

[[3]]
[1] "ij" "kl"

cpp_str_split( c("abc", "de"), 2 )


[[1]]
[1] "ab"

[[2]]
[1] "de"

My solution is perhaps a bit deficient (bug or feature?) in that it truncates any strings not long enough; ideally, we’d either improve the C++ code or form an appropriate wrapper to the function in R (and warn the user if truncation might occur).

Hopefully this gives you a better idea how you might use Rcpp to perform more extensive string manipulation with R character vectors.

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.