Site icon R-bloggers

RGolf: NGSL Scrabble

[This article was first published on R snippets, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It is last part of RGolf before summer. As R excels in visualization capabilities today the task will be to generate a plot.

We will work with NGSL data – a list of 2801 important vocabulary words for students of English as a second language. I have prepared the list as a NGSL101.txt file for download.

Let us move to the task. Load NGSL101.txt into R. For each word in the list we want to calculate the number of other words from the list that can be arranged from a subset of letters from the original words (like in scrabble). For example for word “shoot” we have the following list of its subwords: “to”, “so”, “too”, “hot”, “shoot”, “host” and “shot”. As a product we want to plot the relationship between number of letters in the word and the logarithm of number of its subwords.

The rules of the game are standard:
(1) generate the plot in as few keystrokes as possible;
(2) plot formatting (e.g. x and y axis titles, type of plot) is not important;
(3) one line of code may not be longer than 80 characters;
(4) the solution must be in base R only (no package loading is allowed);
(5) assume that you have NGSL101.txt file in your R working directory.
As always – if you have a nice solution please submit a comment :).

Warning! This time the task takes a bit more time to compute so it is worth to do the development and testing of the solution on the subset of NGSL word list.

Here is my attempt in 169 keystrokes:

d=scan(“NGSL101.txt”,””,skip=1)
a=s((s=sapply)(strsplit(d,””),sort),paste,collapse=”.*”)
y=log(s(a,function(z)sum(s(a,function(i)grepl(i,z)))))
plot(by(y,nchar(d),mean))

And the output is the following:

As we can see the number of subwords approximately on the average grows exponentially with the number of letters in a word.

And here is a verbose version of the solution with comments (warning again – it is slower than the solution given above):

d <- readLines(“NGSL101.txt”)
d <- d[-1] # remove first line from the dataset as it is a comment

is.subword <- function(test, ref) {
    # we check if test is a subword of ref by applying regular
    # expression on sorted letters contained in both words
    test <- paste(sort(strsplit(test, “”)[[1]]),collapse=”.*”)
    ref <-  paste(sort(strsplit(ref, “”)[[1]]),collapse=””)
    # grepl returns true is match is found
    grepl(test, ref)
}

# traverse all words in d and count number of matches
count.subwords <- function(ref) {
    sum(sapply(d, is.subword, ref = ref))
}

x2 <- nchar(d)
y2 <- log(sapply(d, count.subwords))
y2.means <- tapply(y2, x2, mean)
plot(y2.means)

To leave a comment for the author, please follow the link and comment on their blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.