Inventing New Words. Tribute to Umberto Eco
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
[1] [2] [3] In the mid-1980s, while I was a first-year student in physics at the Bologna State University (Italy), a few friends of mine dragged me to the School of Drama, Arts and Music Studies (DAMS) to follow a couple of lessons of Umberto Eco. I still remember how happy and impressed I was. Actually, I didn’t limit myself to listen to him but I literally drank his words, fascinated by him, his style, his erudition, his rigor at the same time austere and brilliant.
A couple of years later, during my Belgian university course in quantitative social sciences, I had the opportunity to enthusiastically study Eco’s academic contributions for my exams of linguistics, semiotics, social sciences and philosophy. Well before the novelist Umberto Eco, I liked, and still like, the investigator, meticulous to the mind-boggling, but always ready to expand the domain of his thinking well outside and ahead the limits of academic knowledge, in popular culture, in search of meaning. His ability to play with words, or use them either as tools or historical markers, was astonishing. As I still feel the emotion following his death, I would like to propose, as a kind of tribute and language game, a method for inventing words —actually, it is more an algorithm than a method. And this will be focused on the Italian language, which was Eco’s mother tongue (and my second mother tongue as well).
The easiest and most immediate way to invent a new word is to sample letters randomly. Proceeding that way to create words of, say, 7 letters, I got: nevltur, capbowqa, grnsohy, tfgzuoq, birymdo. [4] Nothing wrong with that, but the least one could say is that these words hardly “sound” Italian. How can I assert it? Perhaps because some sounds in these words are so rare in Italian (if not absolutely inexistent) that they do not sound familiar to an Italian ear by any mean. Something in the back of our minds tells us that consonant clusters such nvl, cpbq, grns, tfgz or zuoq or rymdo are not spontaneous in Italian —as a matter of fact, they don’t exist in any Italian word. So, the words we use in each language are not a product of chance but belong to rules which in turn they also contribute to set.
Therefore, if we want to invent new words that sound like Italian words we must first identify the rules of formation of phonemes in Italian (or any other language). And a statistical approach can here be of great help.
First of all, I built an Italian dictionary containing about 323,000 words collected from existing dictionaries and a few hundred free eBooks in Italian (novels, essays, etc.). [5] Analysing the composition of the words in a statistical manner, it appears quite clearly that each letter has a specific probability of being followed by another letter. As well as each letter has a certain probability of being the first or the last letter of a word.
For example, let’s look at the letter p.
In Italian, the probability that the letter p is followed by the letter a is of about 18%, about 14.5% by r or e and so on up to 0.2% probability of being followed by n (as in the word apnea), or 0.01% of being followed by z (for example, in opzione), and it is never followed by g or b. Perhaps this is the very reason why intuitively a word created by complete random sampling like the previous capbowqa doesn’t seem Italian at all.
So, if we say that a letter is a statistical event, the sequence of events in a word follows a probability chain similar to what statisticians call a Markov chain. [6]
Applying the same rationale to all the letters of the alphabet (including characters with accents) we can build a matrix, which will be specific to each language and details the probabilities of transition from any letter to any other in the following position. For the Italian language, the transition matrix looks like the chart below, and it reads like this: you choose a letter on the vertical axis, the corresponding color on the horizontal axis gives an indication of the probability that the vertical-axis letter is followed by the horizontal-axis letters. The grey color indicates a probability equal to zero (i.e. the combination of letters doesn’t exist in Italian), light blue indicates a very low probability up to dark red to reflect a high probability of transition. Those who wonder in which Italian words the letter é precedes the letters b or e … well, these are words of French origin commonly used in Italian, such as tournée or débacle.
We therefore see that in Italian the letter q is almost always followed by u, rarely it can be followed by another q (soqquadro, soqquadrare) or by nothing (the last letter of the word as in the case of Iraq). The ù and à appear both followed only by nothing, so always ending a word and never appearing in the middle of a word. Predominantly vowels follow v. On the contrary, the letter a accepts almost any other letter in his vicinity, except accented vowels.
At this point, it just takes feeding an algorithm with this matrix of transition probabilities from one letter to another to generate new words that will sound like Italian beyond their total absence of any meaning. I now can close this brief article with a personal touch, half way serious and facetious, inventing a story made of many invented words that would sound like Italian to Italian ears. Kind of… ear candies for Italian language lovers.
Ieri, passeggiando lungo la riva del flumattico, ho visto sul ramo di un fregirio solitario un sidri occupato a cinotolare il suo zantaro che roromava di piacere. Ma il gotriolo che feci avvicinandomi lo spinesò e scappò via, rapido e iasto, verso la cima pravata della collina. Scomparve presto dalla mia vista e mi rimase solo il faniaco di un untiolo raro nonché anche il dolinori al pensiero che questo ospruto gollitello fosse diventato così cutro nelle nostre senioli campagne. [7]
To download R code and dataset, click here (850 KB).
Notes
[1] This document is the result of an analysis conducted by the author and exclusively reflects the author’s positions. Therefore, this document does not involve in any way, either directly or indirectly, any of the employers, past and present, of the author.
[2] eMail: salvino [dot] salvaggio [at] gmail [dot] com
[3] This work was inspired by the blog of David Louapre, Science Etonnante – http://is.gd/zVTGEH
[4] In R: paste(sample(letters, 7, replace=T), collapse=’’)
[5] Novels and essays sometimes contain words in other languages, but in a so small proportion that it doesn’t really impacts the overall figures. Anyway, I kinda manually cleaned the dictionary canceling foreign words that I saw flicking through it.
[6] To be precise and accurate, the Markov property states that the future event depends only on the current state of the system and not on the previous states. In linguistics this is not entirely true since the probability of finding a letter in a certain position in the word doesn’t only depend on the previous letter, but also, although to a lesser extent, on the foregoing others —that is why a consonant which is doubled is never tripled. More on Markov Chain in R: http://is.gd/nyoFXN
[7] For the sake of transparency, it should be noted that with this basic version of the algorithm I had to generate hundreds of words to have a sufficient choice to be able to select those that seem.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.