Quick Hit: Above the Fold; Hard wrapping text at ‘n’ characters
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Despite being on holiday I’m getting in a bit of non-work R coding since the fam has a greater ability to sleep late than I do. Apart from other things I’ve been working on a PR into {lutz}, a package by @andyteucher that turns lat/lng pairs into timezone strings.
The package is super neat and has two modes: “fast” (originally based on a {V8}-backed version of @darkskyapp’s tzlookup javascript module) and “accurate” using R’s amazing spatial ops.
I ported the javascript algorithm to C++/Rcpp and have been tweaking the bit of package helper code that fetches this:
and extracts the embedded string tree and corresponding timezones array and turns both into something C++ can use.
Originally I just made a header file with the same long lines:
but that’s icky and fairly bad form, especially given that C++ will combine adjacent string literals for you.
The stringi::stri_wrap()
function can easily take care of wrapping the time zone array elements for us:
but, I also needed the ability to hard-wrap the encoded string tree at a fixed width. There are lots of ways to do that, here are three of them:
library(Rcpp) library(stringi) library(tidyverse) library(microbenchmark) sourceCpp(code = " #include <Rcpp.h> // [[Rcpp::export]] std::vector< std::string > fold_cpp(const std::string& input, int width) { int sz = input.length() / width; std::vector< std::string > out; out.reserve(sz); // shld make this more efficient for (unsigned long idx=0; idx<sz; idx++) { out.push_back( input.substr(idx*width, width) ); } if (input.length() % width != 0) out.push_back(input.substr(width*sz)); return(out); } ") fold_base <- function(input, width) { vapply( seq(1, nchar(input), width), function(idx) substr(input, idx, idx + width - 1), FUN.VALUE = character(1) ) } fold_tidy <- function(input, width) { map_chr( seq(1, nchar(input), width), ~stri_sub(input, .x, length = width) ) }
(If you know of a package that has this type of function def leave a note in the comments).
Each one does the same thing: move n
sequences of width
characters into a new slot in a character vector. Let’s see what they do with this toy long string example:
(src <- paste0(c(rep("a", 30), rep("b", 30), rep("c", 4)), collapse = "")) ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccc" for (n in c(1, 7, 30, 40)) { print(fold_base(src, n)) print(fold_tidy(src, n)) print(fold_cpp(src, n)) cat("\n") } ## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" ## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b" ## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" ## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c" ## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" ## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b" ## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" ## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c" ## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" ## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b" ## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" ## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c" ## ## [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb" ## [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c" ## [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb" ## [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c" ## [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb" ## [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c" ## ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb" ## [3] "cccc" ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb" ## [3] "cccc" ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb" ## [3] "cccc" ## ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb" ## [2] "bbbbbbbbbbbbbbbbbbbbcccc" ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb" ## [2] "bbbbbbbbbbbbbbbbbbbbcccc" ## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb" ## [2] "bbbbbbbbbbbbbbbbbbbbcccc"
So, we know they all work, which means we can take a look at which one is faster. Let’s compare folding at various widths:
map_df(c(1, 3, 5, 7, 10, 20, 30, 40, 70), ~{ microbenchmark( base = fold_base(src, .x), tidy = fold_tidy(src, .x), cpp = fold_cpp(src, .x) ) %>% mutate(width = .x) %>% as_tibble() }) %>% mutate( width = factor(width, levels = sort(unique(width)), ordered = TRUE) ) -> bench_df ggplot(bench_df, aes(expr, time)) + ggbeeswarm::geom_quasirandom( aes(group = width, fill = width), groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4 ) + scale_y_comma(trans = "log10", position = "right") + coord_flip() + guides( fill = guide_legend(override.aes = list(alpha = 1)) ) + labs( x = NULL, y = "Time (nanoseconds)", fill = "Split width:", title = "Performance comparison between 'fold' implementations" ) + theme_ft_rc(grid="X") + theme(legend.position = "top")
ggplot(bench_df, aes(width, time)) + ggbeeswarm::geom_quasirandom( aes(group = expr, fill = expr), groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4 ) + scale_x_discrete( labels = c(1, 3, 5, 7, 10, 20, 30, 40, "Split/fold width: 70") ) + scale_y_comma(trans = "log10", position = "right") + scale_fill_ft() + coord_flip() + guides( fill = guide_legend(override.aes = list(alpha = 1)) ) + labs( x = NULL, y = "Time (nanoseconds)", fill = NULL, title = "Performance comparison between 'fold' implementations" ) + theme_ft_rc(grid="X") + theme(legend.position = "top")
The Rcpp version is both faster and more consistent than the other two implementations (though they get faster as the number of string subsetting operations decrease); but, they’re all pretty fast. For an infrequently run process, it might be better to use the base R version purely for simplicity. Despite that fact, I used the Rcpp version to turn the string tree long line into:
FIN
If you have need to “fold” like this how do you currently implement your solution? Found a bug or better way after looking at the code? Drop a note in the comments so you can help others find an optimal solution to their own ‘fold’ing problems.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.