Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
’tis the season to be coding! As I’m sure you’ve all noticed (cough) I have been rather quiet on the blogging front lately because my leisure coding time has been consumed by the annual Advent of Code challenge. It would have been fun to use this to become more proficient in Clojure, but I don’t have that much leisure coding time, so base R it is. The main goal, other than plain old fun, is to give myself a good refresh of what base can do.
All solutions can be found on my Github.
Base is rich and versatile
I’ve been surprised to remember how much I actually like base R, given how avid a user I’ve been of data.table
, purrr
and rlang
in recent years. The apply
family, array operations and functional programming tools like Reduce()
will get you very far indeed on moderately sized data. When using data frames base R falls well short in terms of API, compared to the modern champions of data munging, but it does still do the job.
Inevitably I’ve learned a few new things about base R. Here are some of the highlights.
rle()
was brought to my attention by the Day 4 solution of Adam Gruer. This was easily my worst day of Advent Of Code in terms of succinctness and speed of the solution, and it would have helped a lot to know that Run Length Encoding of repeated values in a vector is available in base R, just like that.
rle(c("a","a","b")) ## Run Length Encoding ## lengths: int [1:2] 2 1 ## values : chr [1:2] "a" "b"
strsplit()
was not a new function to me, but I didn’t know it can take split = ""
as an argument to just split out every individual character! Very helpful when many puzzle inputs are given as plain text character sequences.
strsplit("abcde", split = "") ## [[1]] ## [1] "a" "b" "c" "d" "e"
which()
is another basic function that I’ve used so often before, without realising it can compute n-D array indices through the arr.ind = TRUE
argument.
mat <- matrix(sample(c(FALSE, TRUE), size = 16, replace = TRUE), ncol = 4) which(mat, arr.ind = TRUE) ## row col ## [1,] 1 1 ## [2,] 3 1 ## [3,] 2 2 ## [4,] 2 3 ## [5,] 3 3
merge()
is the natural replacement of dplyr
’s join functions in base R, but did you know it can also do a full crossing of a data frame with itself? Just set the by
argument explicitly to empty.
df <- data.frame(a = letters[1:3], b = 1:3) merge(df, df, by = character(0), suffixes = c("_one", "_two")) ## a_one b_one a_two b_two ## 1 a 1 a 1 ## 2 b 2 a 1 ## 3 c 3 a 1 ## 4 a 1 b 2 ## 5 b 2 b 2 ## 6 c 3 b 2 ## 7 a 1 c 3 ## 8 b 2 c 3 ## 9 c 3 c 3
Not all is great, however. Today (Day 11) I really missed being able to key named lists or envs by something else than strings. In Python dicts you can use any immutable type, including tuples. So I could have directly used (x,y)
coordinates as a key, instead of having to awkwardly paste and parse them to and from strings (or take a cumbersome detour via data frame indexing). A longstanding shortcoming for R as a general-purpose computing language, in my opinion.
…not another intcode!
I’ve been particularly fond of the ‘intcode’ challenges. With ‘fond of’, I mean I hated them. But in a good way. Not to be too cryptic to non-participants, these are challenges where a sequence of integers is to be processed as if they were read/write/jump/print/… instructions within the sequence, each followed by a variable number of parameters. With every new challenge, more instructions and possible interpretations of the integers are added.
This means these intcode challenges are really challenges about software design and development, where scope is ever creeping and requirements are ever changing. I like this. It forces you to strike a balance between being pragmatic about the challenge of the day, and still making the solution extensible on future days. This to me is far more relevant and realistic as a coding challenge, and far more telling of actual skill and craft as a developer, than coming up with a clever algorithmic one-liner.
A favourite
If I could pick a favourite so far, it would have to be Day 6: Universal Orbit Map. A list of objects is given such that each object orbits exactly one other object. The challenge is to compute how many orbits there are in total, both direct and indirect. Many people were immediately triggered into thinking ‘graph!’, and in truth, so was I. But being restricted to base R, I looked for a less exhausting solution. So I quickly realised that there was really no need for a full graph, we only need to separately trace back each individual object to the common center of gravity and add up all the steps.
Here is the full solution, using this input:
inp <- read.delim("aoc_input6.txt", sep = ")", header = FALSE, col.names = c("orbitee", "orbiter"), stringsAsFactors = FALSE) memo_env <- new.env() count_orbits <- function(obj) { if (!is.null(memo_env[[obj]])) { memo_env[[obj]] } else if (obj %in% inp$orbiter) { memo_env[[obj]] <- 1 + count_orbits(inp[inp$orbiter == obj,]$orbitee) } else { 0 } } sum(sapply(unique(inp$orbiter), count_orbits)) ## [1] 142915
I enjoyed this because there is so much really powerful stuff going on in so few lines of code. First we read the data just like a two-column CSV, but using )
as the delimiter. The data frame structure comes very naturally to R, and often offers an intuitive insight into a problem. This goes in general for many of the Advent of Code challenges – when using a data frame you always have the current state of the data clearly visible in front of you, and can work incrementally towards a solution using all of R’s great data frame manipulation tools. In this case the data frame is mainly useful as an explicit lookup table.
Then there is recursion going on – the count_orbits()
function calls itself to find the next orbited object, until it reaches the center. Every recursion adds 1 to the counter. Because many of these paths are partially shared, the recursive function can be sped up considerably by using memoisation – we really don’t need to retrace a path we’ve been down already. Instead, we save the solution in a dedicated environment (a named list or vector would work as well, but environments scale better for larger problems) and retrieve it as needed.
Finally, the sapply call shows that you don’t need for-loops much, even in base R, if you can condense repeated independent operations on vector elements into a unary function. The resulting vector is immediately aggregated into the puzzle solution on the same line. A recipe that is succinct and clear, and oh so widely applicable. However much I love purrr
for its explicit consistency, 90%+ of its practical use cases are really already covered in base R – it’s easy to forget when you’re hooked on the %>%
.
All right then
Back into Santa’s service I go!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.