Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the previous post in this series we coyly unveiled the tantalising mysteries of the Voynich Manuscript: an early 15th century text written in an unknown alphabet, filled with compelling illustrations of plants, humans, astronomical charts, and less easily-identifiable entities.
Stretching back into the murky history of the Voynich Manuscript, however, is the lurking suspicion that it is a fraud; either a modern fabrication or, perhaps, a hoax by a contemporary scribe.
One of the more well-known arguments for the authenticity of the manuscript, in addition to its manufacture with period parchment and inks, is that the text appears to follow certain statistical properties associated with human language, and which were unknown at the time of its creation.
The most well-known of these properties is that the frequency of words in the Voynich Manuscript have been claimed to follow a phenomenon known as Zipf’s Law, whereby the frequency of a word’s occurrence in the text is inversely proportional to its rank in the list of words ordered by frequency.
In this post, we will scrutinise the extent to which the expected statistical properties of natural languages hold for the arcane glyphs presented by the Voynich manuscript.
Unnatural Laws
Zipf’s Law is an example of a discrete power law probability distribution. Power laws have been found to lurk beneath a sinister variety of ostensibly natural phenomena, from the relative size of human settlements to the diversity of species descended from a particular ancestral freshwater fish.
In its original context of human langauge, Zipf’s Law states that the most common word in a given language is likely to be roughly twice as common as the second most common word, and three times as common as the third most common word. More precisely, this law holds for much of the corpus, as the law tends to break down somewhat at both the most-frequent and least-frequent words in the corpus1. Despite this, we will focus on the original, simpler Zipfian characterisation in this analysis.
The most well-known, if highly flawed, method to determine whether a distribution follows a power law is to plot it with both axes expressed as a log-scale: a so-called log-log plot. A power law, represented in such a way, will appear linear. Unfortunately, a hideous menagerie of other distributions will also appear linear in such a setting.
More generally, it is rarely sensible to claim that any natural phenomenon follows a given distribution or model, but instead to demonstrate that a distribution presents a useful model for a given set of observations. Indeed, it is possible to fit any set of observations to a power law, with the assumption that the fit will be poor. Ultimately, we can do little more than demonstrate that a given model is the best simulacrum of observed reality, subject to the uses to which it will be put. Certainly, a more Bayesian approach would advocate building a range of models, demonstrating that the power law is most accurate. All truth, it seems, is relative.
Faced with the awful statistical horror of the universe, we are reduced to seeking evidence against a phenomenon’s adherence to a given distribution. Our first examination, then, is to see whether the basic log-log plot supports or undermines the Voynich Manuscript.
A crude visual analysis certainly supports the argument that, for much of the upper half of the Voynich corpus, there is a linear relationship on the log-log plot consistent with Zipf’s Law. As mentioned, however, this superficial appeal to our senses leaves a gnawing lack of certainty in the conclusion. We must turn to less fallible tools.
The poweRlaw package for R is designed specifically to exorcise these particular demons. This package attempts to fit a power law distribution to a series of observations, in our case the word frequencies observed in the corpus of Voynich text. With the fitted model, we then attempt to disprove the null hypothesis that the data is drawn from a power law. If this attempt to betray our own model fails, then we attain an inverse enlightenment: there is insufficient evidence that the model is not drawn from a power law.
This is an inversion of the more typical frequentist null hypothesis scenario. Typically, in such approaches, we hope for a low p-value, typically below 0.05 or even 0.001, showing that the chance of the observations being consistent with the null hypothesis is extremely low. For this test, we instead hope that our p-value is insufficiently low to make such a claim, and thus that a power law is consistent with the data.
The diagram above shows a fitted parameterisation of the power law according to the poweRlaw package. In addition to the visually appealing fit of the line, the weirdly inverted logic of the above test provides a p-value of 0.151
. We thus have as much confidence as we can have, via this approach, that a power law is a reasonable model for the text in the Voynich corpus.
voynich_powerlaw.r
library( tidyverse ) library( magrittr ) library( ggthemes ) library( showtext ) library( tidytext ) library( drlib ) library( poweRlaw ) library(cowplot) library(magick) _add( "voynich_", "/usr/shares/TTF/weird/voynich/eva1.ttf") _add( "main_", "/usr/shares/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf") _add( "bold_", "/usr/shares/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf")\ showtext_auto() message( "Reading raw Voynich data..." ) voynich_tbl <- read_csv( "data/voynich_raw.txt", col_names=FALSE ) %>% rename( folio = X1, text = X2 ) # Tokenize voynich_words <- voynich_tbl %>% unnest_tokens( word, text ) # Most common words message( "Calculating Voynich language statistics..." ) voynich_common <- voynich_words %>% count( word, sort=TRUE ) %>% mutate( word = reorder( word, n ) ) %>% mutate( freq = n / sum(n) ) voynich_word_counts <- voynich_words %>% count( word, folio, sort = TRUE ) # (Following the poweRlaw vignette) # Create a discrete power law distribution object from the word counts voynich_powerlaw <- voynich_common %>% extract2( "n" ) %>% displ$new() # Estimate the lower bound voynich_powerlaw_xmin <- estimate_xmin( voynich_powerlaw ) # Set the parameters of the voynich_powerlaw to the estimated values voynich_powerlaw$setXmin( voynich_powerlaw_xmin ) # Estimate parameters of the power law distribution voynich_powerlaw_est <- estimate_pars( voynich_powerlaw ) # Calculate p-value of power law. See Section 4.2 of "Power-Law Distributions in Empirical Data" by Clauset et al. # If the p-value is _greater_ than 0.1 then we cannot rule out a power-law distribution. voynich_powerlaw_bootstrap_p <- bootstrap_p(voynich_powerlaw, no_of_sims=1000, threads=7 ) # p=0.143 power law cannot be ruled out # Parameter uncertainty via boostrapping voynich_powerlaw_bootstrap <- bootstrap( voynich_powerlaw, no_of_sims=1000, threads=7 ) # Plot data and power law fit voynich_powerlaw_plot_data <- plot( voynich_powerlaw, draw = F ) %>% mutate( log_x = log( x ), log_y = log( y ) ) voynich_powerlaw_fit_data <- lines( voynich_powerlaw, col=2, draw = F ) %>% mutate( log_x = log( x ), log_y = log( y ) ) # Plot the fitted power law data. gp <- ggplot( voynich_powerlaw_plot_data ) + geom_point( aes( x = log( x ), y =log( y ) ), colour="#8a0707" ) + geom_line( data= voynich_powerlaw_fit_data, aes( x = log( x ), y = log( y ) ), colour="#0b6788") + labs( x = "Log Rank", y = "Log Frequency" ) gp <- gp + theme( panel.background = element_rect(fill = "transparent", colour = "transparent"), plot.background = element_rect(fill = "transparent", colour = "transparent"), plot.title = element_text( family="bold_", colour="#3c3f4a", size=22 ), plot.subtitle = element_text( family="bold_", colour="#3c3f4a", size=12 ), axis.text = element_text( family="bold_", colour="#3c3f4a", size=12 ), axis.title.x = element_text( family="bold_", colour="#3c3f4a", size=12 ), axis.title.y = element_text( family="bold_", colour="#3c3f4a", size=12 ), axis.line = element_line( colour="#3c3f4a" ), panel.grid.major.x = element_blank(), panel.grid.major.y = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank() ) # Remove legend from internal plot theme_set(theme_cowplot(_size=4, _family = "main_" ) ) # Cowplot trick for ggtitle title <- ggdraw() + draw_label( "Fitted Power Law of Voynich Corpus", family="bold_", colour = "#3c3f4a", size=20, hjust=0, vjust=1, x=0.02, y=0.88 ) + draw_label( "http://www.weirddatascience.net | @WeirdDataSci", family="main_", colour = "#3c3f4a", size=12, hjust=0, vjust=1, x=0.02, y=0.40 ) data_label <- ggdraw() + draw_label( "Data: http://www.voynich.nu", family="main_", colour = "#3c3f4a", size=14, hjust=1, x=0.98 ) # Combine plots tgp <- plot_grid( title, gp, data_label, ncol=1, rel_heights=c( 0.1, 1, 0.1 ) ) # Add parchment underlay parchment_plot <- ggdraw() + draw_image("img/parchment.jpg", scale=1.4 ) + draw_plot(tgp) save_plot("output/voynich_power_law.pdf", parchment_plot, base_width = 16, base_height = 9, base_aspect_ratio = 1.78 )
Led further down twisting paths by this initial taste of success, we can now present the Voynich corpus against other human-language corpora to gain a faint impression of how similar or different it is to known languages. The following plot compares the frequency of words in the Voynich Manuscript to those of the twenty most popular languages in Wikipedia, taken from the dataset available here.
Data:
- Wikipedia Language Corpora: https://www.datos.gov.co/en/Ciencia-Tecnolog-a-e-Innovaci-n/Multilingual-Wikipedia-2015-word-frequencies-32-la/jmxq-gzgh/data
voynich_zipf_wikipedia.r
library( tidyverse ) library( magrittr ) library( ggthemes ) library( showtext ) library( tidytext ) library( drlib ) library(cowplot) library(magick) _add( "voynich_", "/usr/shares/TTF/weird/voynich/eva1.ttf") _add( "main_", "/usr/shares/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf") _add( "bold_", "/usr/shares/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf") message( "Reading raw Voynich data..." ) voynich_tbl <- read_csv( "data/voynich_raw.txt", col_names=FALSE ) %>% rename( folio = X1, text = X2 ) # Tokenize # (Remove words of 3 letters or less) # Stemming and stopword removal apparently not so effective anyway, # according to Schofield et al.: <www.cs.cornell.edu/~xanda/winlp2017.pdf> voynich_words <- voynich_tbl %>% unnest_tokens( word, text ) # Most common words message( "Calculating Voynich language statistics..." ) voynich_common <- voynich_words %>% count( word, sort=TRUE ) %>% mutate( word = reorder( word, n ) ) %>% mutate( freq = n / sum(n) ) # Plot a log-log plot of Voynich word frequencies. voynich_word_counts <- voynich_words %>% count( word, folio, sort = TRUE ) # Load other languages. # Select frequency counts. # Convert to long format, then normalise per-language. message( "Loading common language statistics..." ) wiki_language <- read.csv( "data/Multilingual_Wikipedia_2015_word_frequencies__32_languages_X_5_million_words.csv" ) %>% head( 10000 ) %>% as_tibble %>% select( matches( "*_FREQ" ) ) %>% gather( key = "language", value = "count" ) %>% mutate( language = str_replace( language, "_FREQ", "" ) ) %>% group_by( language ) %>% transmute( freq = count / sum( count ) ) %>% ungroup wiki_language_words <- read.csv( "data/Multilingual_Wikipedia_2015_word_frequencies__32_languages_X_5_million_words.csv" ) %>% head( 10000 ) %>% as_tibble # Combine with Voynich, assigning it the unassigned ISO 3166-1 alpha-2 code "vy" message( "Combining common and Voynich language statistics..." ) voynich_language <- voynich_common %>% transmute( language = "vy", freq = freq ) # Combine, then add per-language rank information message( "Processing common and Voynich language statistics..." ) all_languages <- bind_rows( wiki_language, voynich_language ) %>% mutate( colour = ifelse( str_detect( `language`, "vy" ), "red", "grey" ) ) %>% group_by( language ) %>% transmute( log_rank=log( row_number() ), log_freq=log( freq ), colour ) %>% ungroup # Plot a log-log plot of all language word frequencies. message( "Plotting common and Voynich language statistics..." ) voynich_wikipedia_plot <- all_languages %>% ggplot( aes( x=log_rank, y=log_freq, colour=colour) ) + geom_point( alpha=0.4, shape=20 ) + scale_color_manual( values=c("#3c3f4a", "#8a0707" ) ) + theme ( axis.title.y = element_text( angle = 90, family="main_", size=12 ), axis.text.y = element_text( colour="#3c3f4a", family="main_", size=12 ), axis.title.x = element_text( colour="#3c3f4a", family="main_", size=12 ), axis.text.x = element_text( colour="#3c3f4a", family="main_", size=12 ), axis.line.x = element_line( color = "#3c3f4a" ), axis.line.y = element_line( color = "#3c3f4a" ), plot.title = element_blank(), plot.subtitle = element_blank(), plot.background = element_rect( fill = "transparent" ), panel.background = element_rect( fill = "transparent" ) # bg of the panel ) + #scale_colour_viridis_d( option="cividis", begin=0.4 ) + guides( colour="none" ) + labs( y="Log Frequency", x="Log Rank" ) theme_set(theme_cowplot(_size=4, _family = "main_" ) ) # Cowplot trick for ggtitle title <- ggdraw() + draw_label( "Voynich Manuscript Rank Frequency Distribution against Wikipedia Corpora", family="bold_", colour = "#3c3f4a", size=20, hjust=0, vjust=1, x=0.02, y=0.88 ) + draw_label("http://www.weirddatascience.net | @WeirdDataSci", family="bold_", colour = "#3c3f4a", size=12, hjust=0, vjust=1, x=0.02, y=0.40 ) data_label <- ggdraw() + draw_label("Data from: http://www.voynich.nu | http://wikipedia.org", family="bold_", colour = "#3c3f4a", size=12, hjust=1, x=0.98 ) tgp <- plot_grid(title, voynich_wikipedia_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) voynich_wikipedia_plot <- ggdraw() + draw_image("img/parchment.jpg", scale=1.4 ) + draw_plot(tgp) save_plot("output/voynich_wikipedia_plot.pdf", voynich_wikipedia_plot, base_width = 16, base_height = 9, base_aspect_ratio = 1.78 )
The Voynich text seems consistent with the behaviour of known natural languages from Wikipedia. The most striking difference being the clustering of Voynich word frequencies in the lower half of the diagram, resulting from the smaller corpus of words in the Voynich Manuscript. This causes, in particular, lower-frequency words to occur an identical number of times, resulting in vertical leaps in the frequency graph towards the lower end.
To highlight this phenomenon, we can apply a similar technique to another widely-translated short text: the United Nations Declaration of Human Rights.
voynich_zipf_udhr.r
library( tidyverse ) library( magrittr ) library( ggthemes ) library( showtext ) library( tidytext ) library(cowplot) library(magick) _add( "voynich_", "/usr/shares/TTF/weird/voynich/eva1.ttf") _add( "main_", "/usr/shares/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf") _add( "bold_", "/usr/shares/TTF/weird/alchemy/1651 Alchemy/1651AlchemyNormal.otf") showtext_auto() message( "Reading raw Voynich data..." ) voynich_tbl <- read_csv( "data/voynich_raw.txt", col_names=FALSE ) %>% rename( folio = X1, text = X2 ) # Tokenize # (Remove words of 3 letters or less) # Stemming and stopword removal apparently not so effective anyway, # according to Schofield et al.: <www.cs.cornell.edu/~xanda/winlp2017.pdf> voynich_words <- voynich_tbl %>% unnest_tokens( word, text ) # Most common words message( "Calculating Voynich language statistics..." ) voynich_common <- voynich_words %>% count( word, sort=TRUE ) %>% mutate( word = reorder( word, n ) ) %>% mutate( freq = n / sum(n) ) # Combine with Voynich, assigning it the unassigned ISO 3166-1 alpha-2 code "vy" message( "Combining common and Voynich language statistics..." ) voynich_language <- voynich_common %>% transmute( language = "vy", freq = freq ) voynich_word_counts <- voynich_words %>% count( word, folio, sort = TRUE ) # UDHR corpus comparison (smaller text) udhr_corpus_files <- list.files("data/udhr/udhr_txt", pattern="*.txt", full.names=TRUE ) # Helper function to read in a text file and calculate a frequency tablle table_frequency_mapper <- function( x ) { # Read file and extract language code from filename udhr_text <- read_lines( x, skip=6, skip_empty_rows=TRUE ) language <- basename( x ) %>% str_replace( "udhr_", "" ) %>% str_replace( ".txt", "" ) # Tokenize and remove punctuation udhr_words <- udhr_text %>% str_flatten %>% str_remove_all( "[.,]" ) %>% str_split( "\\s+" ) %>% extract2( 1 ) %>% { tibble( word=. ) } # Most common words udhr_common <- udhr_words %>% count( word, sort=TRUE ) %>% mutate( word = reorder( word, n ), language ) } voynich_corpus <- voynich_language %>% transmute( language, log_rank=log( row_number() ), log_freq=log( freq ), colour="Voynich Text" ) udhr_corpus <- udhr_corpus_files %>% map( table_frequency_mapper ) %>% bind_rows %>% group_by( language ) %>% transmute( log_rank=log( row_number() ), log_freq=log( n / sum(n) ), colour="Known UDHR Language" ) %>% ungroup voynich_udhr_corpus <- bind_rows( udhr_corpus, voynich_corpus ) voynich_udhr_frequency_plot <- voynich_udhr_corpus %>% ggplot( aes( x=log_rank, y=log_freq, colour=colour) ) + geom_point( alpha=0.4, shape=19 ) + scale_color_manual( values=c( "Known UDHR Language" = "#3c3f4a", "Voynich Text" = "#8a0707" ) ) + theme ( axis.title.y = element_text( angle = 90, family="main_", size=12 ), axis.text.y = element_text( colour="#3c3f4a", family="main_", size=12 ), axis.title.x = element_text( colour="#3c3f4a", family="main_", size=12 ), axis.text.x = element_text( colour="#3c3f4a", family="main_", size=12 ), axis.line.x = element_line( color = "#3c3f4a" ), axis.line.y = element_line( color = "#3c3f4a" ), plot.title = element_blank(), plot.subtitle = element_blank(), plot.background = element_rect( fill = "transparent" ), panel.background = element_rect( fill = "transparent" ), # bg of the panel panel.grid.major.x = element_blank(), panel.grid.major.y = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), legend.text = element_text( family="bold_", colour="#3c3f4a", size=10 ), legend.title = element_blank(), legend.key.height = unit(1.2, "lines"), legend.position=c(.85,.5) ) + labs( y="Log Frequency", x="Log Rank" ) theme_set(theme_cowplot(_size=4, _family = "main_" ) ) # Cowplot trick for ggtitle title <- ggdraw() + draw_label("Voynich Manuscript Rank Frequency Distribution against UNDHR Translations", family="bold_", colour = "#3c3f4a", size=20, hjust=0, vjust=1, x=0.02, y=0.88) + draw_label("http://www.weirddatascience.net | @WeirdDataSci", family="bold_", colour = "#3c3f4a", size=12, hjust=0, vjust=1, x=0.02, y=0.40) data_label <- ggdraw() + draw_label("Data from: http://www.voynich.nu | http://unicode.org/udhr/", family="bold_", colour = "#3c3f4a", size=12, hjust=1, x=0.98 ) tgp <- plot_grid(title, voynich_udhr_frequency_plot, data_label, ncol=1, rel_heights=c(0.1, 1, 0.1)) voynich_udhr_plot <- ggdraw() + draw_image("img/parchment.jpg", scale=1.4 ) + draw_plot(tgp) save_plot("output/voynich_udhr_plot.pdf", voynich_udhr_plot, base_width = 16, base_height = 9, base_aspect_ratio = 1.78 )
A Refined Randomness
The above arguments might at first appear compelling. The surface incomprehensibility of the Voynich Manuscript succumbs to the deep currents of statistical laws, and reveals an underlying pattern amongst the chaos of the text.
Sadly, however, as with all too many arguments in the literature regarding power law distributions arising in nature, there is a complication to this argument that again highlights the difference between proof and the failure to disprove. Certainly, if a power law had proved incompatible with the Voynich Manuscript then we would have doubted its authenticity. With its apparent adherence to such a distribution, however, we have taken only one hesitant step towards confidence.
Rugg has argued that certain random mechanisms can produce text that adheres to Zipf’s Law, and has demonstrated a simple mechanical procedure for doing so. A more compelling argument is presented, without reference to the Voynich Manuscript, by Li. (1992)