Relation of Word Order and Compression Ratio and Degree of Structure

Raffael Vogler

8 years ago

[This article was first published on joy of data » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Having a habit of compulsively wondering approximately every 34.765th day about how zip compression (bzip2 in this case) might be used to measure information contained in data – this time the question popped up in my head of whether or not and if then how permutation of a text’s words would affect its compression ratio. The answer is – it does – and it does so in a very weird and fascinating way.

Lo and behold James Joyce’s “A Portrait of the Artist as a Young Man” and its peculiar distribution of compression ratios …

library(stringr)
library(ggplot2)

compress <- function(str) {
  length(memCompress(str, type="bzip2")) / nchar(str)
}

calc_ratios <- function(title, N, sep=" ") {
  
  f <- paste("/media/Volume/temp/struc/",title,".txt",sep="")
  
  t <- scan(f, what=character())
  y <- unlist(sapply(t, function(l) str_split(l,"[^a-zA-Z]")))
  names(y) <- NULL
  y <- y[nchar(y)>1]
  
  ratios <- rep(NA, N)
  for(i in 1:N) {
    ratios[i] <- compress(paste(y[sample(length(y))],collapse=sep))
  }
  
  results <- list("perms" = ratios, "original" = compress(paste(y,collapse=sep)))
  
  return(results)
}

res <- calc_ratios("portraitartistyoungman", N=10000)

ggplot(data = data.frame(x = res[["perms"]])) + 
  geom_histogram(aes(x = x),binwidth=.00001,stat="bin") +
  geom_vline(xintercept=res[["original"]], color="red") +
  labs(title="compression ratio of not permuted text in red", x="", y="") +
  theme(plot.title = element_text(vjust=1, size=12))

hist(res[["perms"]], breaks=100, 
     main = "A Portrait of the Artist as a Young Man", 
     xlab="distribution of compression ratios for word permutations with median in red", ylab="")
abline(v=median(res[["perms"]]), col = "red")

So obviously the compression ratio is indeed affected by word order and even comes along with a very nice looking distribution. But the best part is that the compression ratio of the original word order is effectively statistically mega-super-über-hyper-significant! The same is true for all the book’s texts I checked out and, I guess, it is true for every natural language text in the world. Intuitively I thought that maybe there is a distribution which is typical for text but from what I see, they are pretty different from content to content. It even seems to me that there is a resemblance of shape per author (“authorship attribution”). Though, Ulysses – the literary troublemaker of last century – is of course not playing along nicely. Twist features a plateau, the other two Joyces a double hill and Dickens a bell-shape.

Compression Ratio as a Measure for Structure

The difference between a normal text and a permutation of it is of course that we kill its structure by doing so. This makes me wonder if it would be possible to measure a texts degree of structure by relating its original compression ratio to its distribution of compression ratios. F.x.:

A low DOS would indicate a high degree of structure because of the relatively high difference to the lower boundary of the distribution of compression ratios for randomly permuted word order. Looking at the DOS column of the table below this interpretation seems reasonable when we compare the degree f.x. for Dickens vs. Joyce.

wc = word count
original = compression ratio of original text (word order)
dec_nth / median = nth decile / median of c.r. dist. for permuted word orderings

               title  wc      original   dec_1st   median    dec_9th   DOS
           Dubliners  65609   0.28288    0.31397   0.31446   0.31511   0.901
             Ulysses  253927  0.32091    0.34607   0.34632   0.34663   0.927
  Great Expectations  176096  0.26178    0.29518   0.29534   0.29556   0.887
   David Copperfield  337291  0.25696    0.29184   0.292     0.29217   0.88
    Huckleberry Finn  105441  0.26252    0.2999    0.30017   0.30056   0.875
 Life on Mississippi  138867  0.27769    0.30453   0.30474   0.30534   0.912

Technicalities

All texts are taken from Project Gutenberg – relieved of sections not belonging to the book itself – reduced to letters a to z and A to Z by replacing anything else with a space – the result is split at white spaces and all “words” of length 1 are discarded. This vector of words or string atoms is then permuted using sample().

The compression applied is of type bzip2 and performed using memCompress() with calculation of compression ratio being implemented as follows:

compress <- function(str) {
  length(memCompress(str, type="bzip2")) / nchar(str)
}

To leave a comment for the author, please follow the link and comment on their blog: joy of data » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.