if c == 'a' then 'A' else c) upperA "banana" "bAnAnA" and some Julia [x+1 for x in "HAL"] 3-element Vector{Char}: 'I': ASCII/Unicode U+0049 (category Lu: Letter, uppercase) 'B': ASCII/Unicode U+0042 (category Lu: Letter, uppercase) 'M': ASCII/Unicode U+004D (category Lu: Letter, uppercase) In each of these cases the string is treated as a collection of individual characters. Many languages make this distinction, going so far as using different quotes to distinguish them; e.g. double quotes for strings "string" and single quotes for individual characters 's'. This makes a even more sense when the language supports types in that a string has a String type that is composed of 0 or more Char types. R is dynamically typed, so we don’t strictly enforce type signatures, and is an array language, so it has natural support for arrays (vectors, lists, matrices). So why are strings not collections of characters? My guess is that for the majority of use-cases, it wasn’t necessary - a lot of the time when we read in text data we want the entirety of the string and don’t want to worry about dealing with a collection on top of the collection of strings themselves. Plus, if you really need the individual characters you can split the text up with strsplit(x, ""). But if you do want to work with individual characters, calling strsplit(x, "")[[1]] throughout your code gets ugly. I solved the Exercism problem ‘Anagram’ in R and really didn’t like how it looked anagram " />

{charcuterie} – What if Strings Were Iterable in R?

[This article was first published on rstats on Irregularly Scheduled Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been using a lot of programming languages recently and they all have their quirks, differentiating features, and unique qualities, but one thing most of them have is that they handle strings as a collection of characters. R doesn’t, it has a “character” type which is 0 or more characters, and that’s what we call a “string”, but what if it did have iterable strings?

For comparison, here’s some Python code

for i in "string":
    print(i)

s
t
r
i
n
g

and some Haskell

upperA = map (\c -> if c == 'a' then 'A' else c)
upperA "banana"
"bAnAnA"

and some Julia

[x+1 for x in "HAL"]
3-element Vector{Char}:
 'I': ASCII/Unicode U+0049 (category Lu: Letter, uppercase)
 'B': ASCII/Unicode U+0042 (category Lu: Letter, uppercase)
 'M': ASCII/Unicode U+004D (category Lu: Letter, uppercase)

In each of these cases the string is treated as a collection of individual characters. Many languages make this distinction, going so far as using different quotes to distinguish them; e.g. double quotes for strings "string" and single quotes for individual characters 's'. This makes a even more sense when the language supports types in that a string has a String type that is composed of 0 or more Char types.

R is dynamically typed, so we don’t strictly enforce type signatures, and is an array language, so it has natural support for arrays (vectors, lists, matrices). So why are strings not collections of characters?

My guess is that for the majority of use-cases, it wasn’t necessary – a lot of the time when we read in text data we want the entirety of the string and don’t want to worry about dealing with a collection on top of the collection of strings themselves. Plus, if you really need the individual characters you can split the text up with strsplit(x, "").

But if you do want to work with individual characters, calling strsplit(x, "")[[1]] throughout your code gets ugly. I solved the Exercism problem ‘Anagram’ in R and really didn’t like how it looked

anagram <- function(subject, candidates) {
  # remove any same words and inconsistent lengths
  nonsames <- candidates[tolower(candidates) != tolower(subject) & 
                           nchar(subject) == nchar(candidates)]
  if (!length(nonsames)) return(c()) # no remaining candidates
  s_letters <- sort(tolower(strsplit(subject, "")[[1]]))
  c_letters <- sapply(sapply(nonsames, \(x) strsplit(x, "")), sort, simplify = FALSE)
  # find all cases where the letters are all the same
  anagrams <- nonsames[sapply(c_letters, \(x) all(s_letters == tolower(x)))]
  # if none found, return NULL
  if(!length(anagrams)) NULL else anagrams
}

Two calls to strsplit, then needing to sapply over that collection to sort it… not pretty at all. Here’s a Haskell solution from someone very knowledgeable in our local functional programming Meetup group

import Data.List (sort)
import Data.Char (toLower)
anagramsFor :: String -> [String] -> [String]
anagramsFor xs = filter (isAnagram xs' . map toLower)
  where xs' = map toLower xs
isAnagram :: String -> String -> Bool
isAnagram a b
  | a == b = False
  | otherwise = sort a == sort b

which, excluding the type declarations and the fact that it needs to deal with the edge case that it has to be a rearrangement, could nearly be a one-liner

import Data.List (sort)
import Data.Char (toLower)
isAnagram a b = sort (map toLower a) == sort (map toLower b)

Wouldn’t it be nice if we could do things like this in R?

The world if R had iterable strings
The world if R had iterable strings

I don’t expect it would ever happen (maaaybe via some special string handling like the bare strings r"(this doesn't need escaping)" but unlikely). I couldn’t find a package that did this (by all means, let me know if there is one) so I decided to build it myself and see how it could work.

Introducing {charcuterie} - named partly because it looks like “cut” “char”, and partly because of charcuterie boards involving lots of little bits of appetizers.

image by Google gemini
image by Google gemini
library(charcuterie)

At its core, this is just defining chars(x) as strsplit(x, "")[[1]] and slapping a new class on the output, but big improvements don’t immediately come from moonshots, they come from incremental improvements. Once I had this, I wanted to do things with it like sort the individual characters. There is of course a sort method for vectors (but not for individual strings) so

sort("string")
## [1] "string"
sort(c("s", "t", "r", "i", "n", "g"))
## [1] "g" "i" "n" "r" "s" "t"

One aspect of treating strings as collections of characters is that they should always look like strings, so I needed to modify the sort method to return an object of this new class, and make this class display collections of characters as a string. That just involves pasting the characters back together for printing, so now I can have this

s <- chars("string")
s
## [1] "string"
sort(s)
## [1] "ginrst"

It looks like a string, but it behaves like a collection of characters!

When you do things right, people won’t know you’ve done anything at all
When you do things right, people won’t know you’ve done anything at all

I thought about what other operations I might want to do and now I have methods to

  • sort with sort
  • reverse with rev
  • index with [
  • concatenate with c
  • print with format and print
  • slice with head and tail
  • set operations with setdiff, union, intersect, and a new except
  • leverage existing vectorised operations like unique, toupper, and tolower

I suspect the concatenation will be the one that raises the most eyebrows… I’ve dealt with the way that other languages join together strings before and I’m certainly open to what this version should do, but I think it makes sense to add the collections as

c(chars("butter"), chars("fly"))
## [1] "butterfly"

If you need more than one chars at a time, you’re asking for a vector of vectors, which R doesn’t support - it supports a list of them, though

x <- lapply(c("butter", "fly"), chars)
x
## [[1]]
## [1] "butter"
## 
## [[2]]
## [1] "fly"
unclass(x[[2]])
## [1] "f" "l" "y"

This still sounds simple, and it is - the point is that it feels a lot more ergonomic to use this inside a function compared to strsplit(x, "")[[1]] and working with the collection manually.

I added an entire vignette of examples to the package, including identifying vowels

vowels <- function(word) {
  ch <- chars(word)
  setNames(ch %in% chars("aeiou"), ch)
}
vowels("string")
##     s     t     r     i     n     g 
## FALSE FALSE FALSE  TRUE FALSE FALSE
vowels("banana")
##     b     a     n     a     n     a 
## FALSE  TRUE FALSE  TRUE FALSE  TRUE

palindromes

palindrome <- function(a, ignore_spaces = FALSE) {
  a <- chars(a)
  if (ignore_spaces) a <- except(a, " ")
  all(rev(a) == a)
}
palindrome("palindrome")
## [1] FALSE
palindrome("racecar")
## [1] TRUE
palindrome("never odd or even", ignore_spaces = TRUE)
## [1] TRUE

and performing character-level substitutions

spongebob <- function(phrase) {
  x <- chars(phrase)
  odds <- seq(1, length(x), 2)
  x[odds] <- toupper(x[odds])
  string(x)
}
spongebob("you can't do anything useful with this package")
## [1] "YoU CaN'T Do aNyThInG UsEfUl wItH ThIs pAcKaGe"
YoU CaN’T Do aNyThInG UsEfUl wItH ThIs pAcKaGe
YoU CaN’T Do aNyThInG UsEfUl wItH ThIs pAcKaGe

On top of all that, I felt it was worthwhile stretching my R package building muscles, so I’ve added tests with 100% coverage, and ensured it fully passes check().

I don’t expect this would be used on huge text sources, but it’s useful to me for silly little projects. If you have any suggestions for functionality that could extend this then by all means let me know either in GitHub Issues, the comment section below, or Mastodon.


devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.3 (2024-02-29)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2024-08-03
##  pandoc   3.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version    date (UTC) lib source
##  blogdown      1.18       2023-06-19 [1] CRAN (R 4.3.2)
##  bookdown      0.36       2023-10-16 [1] CRAN (R 4.3.2)
##  bslib         0.6.1      2023-11-28 [3] CRAN (R 4.3.2)
##  cachem        1.0.8      2023-05-01 [3] CRAN (R 4.3.0)
##  callr         3.7.3      2022-11-02 [3] CRAN (R 4.2.2)
##  charcuterie * 0.0.0.9000 2024-08-03 [1] local
##  cli           3.6.1      2023-03-23 [1] CRAN (R 4.3.3)
##  crayon        1.5.2      2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5      2022-10-11 [1] CRAN (R 4.3.2)
##  digest        0.6.34     2024-01-11 [3] CRAN (R 4.3.2)
##  ellipsis      0.3.2      2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.23       2023-11-01 [3] CRAN (R 4.3.2)
##  fastmap       1.1.1      2023-02-24 [3] CRAN (R 4.2.2)
##  fs            1.6.3      2023-07-20 [3] CRAN (R 4.3.1)
##  glue          1.7.0      2024-01-09 [1] CRAN (R 4.3.3)
##  htmltools     0.5.7      2023-11-03 [3] CRAN (R 4.3.2)
##  htmlwidgets   1.6.2      2023-03-17 [1] CRAN (R 4.3.2)
##  httpuv        1.6.12     2023-10-23 [1] CRAN (R 4.3.2)
##  icecream      0.2.1      2023-09-27 [1] CRAN (R 4.3.2)
##  jquerylib     0.1.4      2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.8      2023-12-04 [3] CRAN (R 4.3.2)
##  knitr         1.45       2023-10-30 [3] CRAN (R 4.3.2)
##  later         1.3.1      2023-05-02 [1] CRAN (R 4.3.2)
##  lifecycle     1.0.4      2023-11-07 [1] CRAN (R 4.3.3)
##  magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.3.3)
##  memoise       2.0.1      2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12       2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1    2018-05-18 [1] CRAN (R 4.3.2)
##  pkgbuild      1.4.2      2023-06-26 [1] CRAN (R 4.3.2)
##  pkgload       1.3.3      2023-09-22 [1] CRAN (R 4.3.2)
##  prettyunits   1.2.0      2023-09-24 [3] CRAN (R 4.3.1)
##  processx      3.8.3      2023-12-10 [3] CRAN (R 4.3.2)
##  profvis       0.3.8      2023-05-02 [1] CRAN (R 4.3.2)
##  promises      1.2.1      2023-08-10 [1] CRAN (R 4.3.2)
##  ps            1.7.6      2024-01-18 [3] CRAN (R 4.3.2)
##  purrr         1.0.2      2023-08-10 [3] CRAN (R 4.3.1)
##  R6            2.5.1      2021-08-19 [1] CRAN (R 4.3.3)
##  Rcpp          1.0.11     2023-07-06 [1] CRAN (R 4.3.2)
##  remotes       2.4.2.1    2023-07-18 [1] CRAN (R 4.3.2)
##  rlang         1.1.4      2024-06-04 [1] CRAN (R 4.3.3)
##  rmarkdown     2.25       2023-09-18 [3] CRAN (R 4.3.1)
##  rstudioapi    0.15.0     2023-07-07 [3] CRAN (R 4.3.1)
##  sass          0.4.8      2023-12-06 [3] CRAN (R 4.3.2)
##  sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.3.2)
##  shiny         1.7.5.1    2023-10-14 [1] CRAN (R 4.3.2)
##  stringi       1.8.3      2023-12-11 [3] CRAN (R 4.3.2)
##  stringr       1.5.1      2023-11-14 [3] CRAN (R 4.3.2)
##  urlchecker    1.0.1      2021-11-30 [1] CRAN (R 4.3.2)
##  usethis       3.0.0      2024-07-29 [1] CRAN (R 4.3.3)
##  vctrs         0.6.5      2023-12-01 [1] CRAN (R 4.3.3)
##  xfun          0.41       2023-11-01 [3] CRAN (R 4.3.2)
##  xtable        1.8-4      2019-04-21 [1] CRAN (R 4.3.2)
##  yaml          2.3.8      2023-12-11 [3] CRAN (R 4.3.2)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.3
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────


To leave a comment for the author, please follow the link and comment on their blog: rstats on Irregularly Scheduled Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)