Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Lo and behold James Joyce’s “A Portrait of the Artist as a Young Man” and its peculiar distribution of compression ratios …
library(stringr) library(ggplot2) compress <- function(str) { length(memCompress(str, type="bzip2")) / nchar(str) } calc_ratios <- function(title, N, sep=" ") { f <- paste("/media/Volume/temp/struc/",title,".txt",sep="") t <- scan(f, what=character()) y <- unlist(sapply(t, function(l) str_split(l,"[^a-zA-Z]"))) names(y) <- NULL y <- y[nchar(y)>1] ratios <- rep(NA, N) for(i in 1:N) { ratios[i] <- compress(paste(y[sample(length(y))],collapse=sep)) } results <- list("perms" = ratios, "original" = compress(paste(y,collapse=sep))) return(results) } res <- calc_ratios("portraitartistyoungman", N=10000) ggplot(data = data.frame(x = res[["perms"]])) + geom_histogram(aes(x = x),binwidth=.00001,stat="bin") + geom_vline(xintercept=res[["original"]], color="red") + labs(title="compression ratio of not permuted text in red", x="", y="") + theme(plot.title = element_text(vjust=1, size=12)) hist(res[["perms"]], breaks=100, main = "A Portrait of the Artist as a Young Man", xlab="distribution of compression ratios for word permutations with median in red", ylab="") abline(v=median(res[["perms"]]), col = "red")
Compression Ratio as a Measure for Structure
The difference between a normal text and a permutation of it is of course that we kill its structure by doing so. This makes me wonder if it would be possible to measure a texts degree of structure by relating its original compression ratio to its distribution of compression ratios. F.x.:
A low DOS would indicate a high degree of structure because of the relatively high difference to the lower boundary of the distribution of compression ratios for randomly permuted word order. Looking at the DOS column of the table below this interpretation seems reasonable when we compare the degree f.x. for Dickens vs. Joyce.
wc = word count original = compression ratio of original text (word order) dec_nth / median = nth decile / median of c.r. dist. for permuted word orderings title wc original dec_1st median dec_9th DOS Dubliners 65609 0.28288 0.31397 0.31446 0.31511 0.901 Ulysses 253927 0.32091 0.34607 0.34632 0.34663 0.927 Great Expectations 176096 0.26178 0.29518 0.29534 0.29556 0.887 David Copperfield 337291 0.25696 0.29184 0.292 0.29217 0.88 Huckleberry Finn 105441 0.26252 0.2999 0.30017 0.30056 0.875 Life on Mississippi 138867 0.27769 0.30453 0.30474 0.30534 0.912
Technicalities
All texts are taken from Project Gutenberg – relieved of sections not belonging to the book itself – reduced to letters a to z and A to Z by replacing anything else with a space – the result is split at white spaces and all “words” of length 1 are discarded. This vector of words or string atoms is then permuted using sample().
The compression applied is of type bzip2 and performed using memCompress() with calculation of compression ratio being implemented as follows:
compress <- function(str) { length(memCompress(str, type="bzip2")) / nchar(str) }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.