Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Previously in this series:
- The “lost boarding pass” puzzle
- The “deadly board game” puzzle
- The “knight on an infinite chessboard” puzzle
- The “largest stock profit or loss” puzzle
- The “birthday paradox” puzzle
I love 538’s Riddler column, and the January 3 puzzle is a fun one. I’ll quote:
The New York Times recently launched some new word puzzles, one of which is Spelling Bee. In this game, seven letters are arranged in a honeycomb lattice, with one letter in the center. Here’s the lattice from December 24, 2019:
The goal is to identify as many words that meet the following criteria:
- The word must be at least four letters long.
- The word must include the central letter.
- The word cannot include any letter beyond the seven given letters.
Note that letters can be repeated. For example, the words GAME and AMALGAM are both acceptable words. Four-letter words are worth 1 point each, while five-letter words are worth 5 points, six-letter words are worth 6 points, seven-letter words are worth 7 points, etc. Words that use all of the seven letters in the honeycomb are known as “pangrams” and earn 7 bonus points (in addition to the points for the length of the word). So in the above example, MEGAPLEX is worth 15 points.
Which seven-letter honeycomb results in the highest possible game score? To be a valid choice of seven letters, no letter can be repeated, it must not contain the letter S (that would be too easy) and there must be at least one pangram.
Solving this puzzle in R is interesting enough, but it’s particularly challenging to do so in a computationally efficient way. As much as I love the tidyverse, this, like the “lost boarding pass” puzzle and Emily Robinson’s evaluation of the best Pokémon team, this serves as a great example of using R’s matrix operations to work efficiently with data.
I’ve done a lot of puzzles recently, and I realized that showing the end result isn’t a representation of my thought process. I don’t show all the dead ends and bugs, or explain why I ended up choosing a particular path. So in the same spirit as my Tidy Tuesday screencasts, I recorded myself solving this puzzle (though not the process of turning it into a blog post).
Setup: Processing the word list
Our first step is to process the word list for words that could appear in a honeycomb puzzle. Based on the above rules, these will have at least four letters, have no more than 7 unique letters, and never contain the letter S. We’ll do this processing with tidyverse functions.
There are 44585 that are eligible to appear in a puzzle. This data gives us a way to solve the December 24th Honeycomb puzzle that comes with the Riddler column.
Looks like the score is 153, and the pangram that uses all seven letters is AMALGAM.
Could we use this get_words()
function to brute force every possible honeycomb? Only if we had a lot of time on our hands. Since S is eliminated, there are 25 possible choices for the center letter, and $24\choose 6$ (“24 choose 6”) possible combinations of outer letters, making 25 * choose(24, 6)
= 3.36 million possible honeycombs. It would take about 8 days to test every one this way. We can do a lot better.
Heuristics: What letters appear in the highest-scoring words?
Would you expect the winning honeycomb to have letters like X, Z, or Q? Neither would I. The winning honeycomb is likely filled with common letters like E that appear in lots of words.
Let’s quantify this, by looking at how many words each letter appears in, and how many total points they’d earn.
What letters have the most and least points?
Of course, any given honeycomb with one of these letters will get only a tiny fraction of the points available to the letter. There could be interactions between those terms (maybe two relatively rare letters go well together). But it gives us a sense of which letters are likely to be included (almost certainly not ones like X, Z, Q, and J), which may help us narrow down our search space in the next step.
Using matrices to score honeycombs
When you need R to be very efficient, you might want to turn to matrix operations, which when used properly are some of the fastest operations in R. Luckily, it turns out that a lot of this puzzle can be done through linear algebra. I’m presenting this as the finished product, but if you’re interested in my thought process as I came up with it I do recommend running through the screencast recording to show how I got here.
We start by creating a word-by-letter matrix. There are 44K words and (without S) 25 letters. To operate efficiently on these, we’ll want these in a 44K x 25 binary matrix rather than strings. The underrated reshape2::acast
can set this up for us.
A points-per-word vector. Once we have the set of words that are possible with each honeycomb, we’ll need to score them to find the maximum.
At this point, we take a matrix multiplication approach to score all the honeycomb combinations. Since I went through it in the screencast, I won’t walk through each of the steps except in the comments.
In this approach, I hold the center letter constant, and try every combination of 6 within a pool of letters (the possible_letters
argument) for the outer letters.
If we use all 25 available letters, this ends up impractically slow and memory-intensive (we end up with something like a 40,000 by 135,000 matrix). But if we bring it down to 15 letters, it can run in a few seconds per central letter. Here I’ll use the heuristic we came up with above: it’s likely that the winning solution is made up of mostly the ones that appear in lots of high-scoring words, like E, I, and A.
It takes about a minute to try all combinations of the top 15 letters. This makes it clear: the best combination is R in the center, and E, I, R, N, T, G surrounding it, for a total score of 4298.
How good was our heuristic at judging letters?
We took a shortcut, and therefore a bit of risk, in winnowing down the alphabet to just 15 letters. How could we get a sense of whether we still got the right answer?
We could start by looking at how good a predictor our heuristic was of how good a letter is in the center.
We can see that the heuristic isn’t perfect. For instance, G is a surprisingly good center letter (and makes it into the outer letters of our winning honeycomb) considering that overall it doesn’t quite make the top 10 for our heuristic. However, it’s unlikely we’re missing anything with the rarer letters.
Offline, I tried running this code with the top 21 letters (that is, all but Z, X, J, and Q) and while it takes a long time to run it confirms that you can’t beat R/EIRNTG, and that none of the letters after G are particularly strong.
Conclusion
After a decade programming in R, I still love the process of journeying through multiple approaches to a problem, and iterating before I find an efficient and elegant solution. (There’s probably still a lot of optimization I can do! I’m still using a brute force approach, and I don’t know if there’s another algorithmic way to solve it). Just because we end up with a handful of matrix multiplications doesn’t mean it’s easy to get there, especially when you’re out of practice with linear algebra like I am.
Every time I practice linear algebra pic.twitter.com/xDgXkFB8xD
— David Robinson (@drob) March 28, 2016
That’s why I’m glad that I recorded and shared my first attempt at this puzzle, including my dead ends and misunderstandings. (I even originally misunderstood the end goal of the puzzle, far from the first time that’s happened!) I hope it’s useful to you too!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.