Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
International Talk Like A Pirate Day
Q: What has 8 legs, 8 arms and 8 eyes?
A: 8 pirates.
Avast, ye scurvy scum! Today be September 19, the International Talk Like A Pirate Day. While it’s a silly “holiday”, it’s a great chance to switch your Facebook to Pirate English (“Arr, this be pleasing to me eye!”). Also, it’s as good a reason as any to go sailing the Internet Main, scouring for booty (data)!
Well, any excuse, y’know…
What is today known (more or less) as “Pirate English” (the bit with all the “arrr!”, “ahoy, ye matey!”, etc) is in fact the accent employed by Robert Newton, the actor who played Long John Silver in the first non-silent film (“talkie”) Treasure Island. He speaks with a Bristol accent, which was in fact the main trading port with the West Indies (Bristol, not the accent, of course). The accent we today associate with pirates could well have been historical!
More useless information on pirates (such as that there is no historical evidence of a pirate ever owning a parrot as a pet – again this stereotype evolved from the Treasure Island movie) is on the wonderful as always treasure trove of superfluous information, the QI page on pirates.
Data
First, let’s get some data on pirates. I have some hazy memories on seeing data on historical shipping, and also piracy somewhere a while ago, but I cannot find it any more. I must be getting older. So, I’ll do what any self-respecting young whippersnapper would do and grab data from Wikipedia. There is a list of famous pirates; it’s biased, of course. I see relatively few entries on Barbary pirates, for instance. No matter, the list covers most of the Golden Age of Piracy in the Caribbean, and I’ll use it to generate a dataset on these pirates.
We’ll grab the wiki page, linking to the static ID of the page for reproducability.
We then use purrr::map_dfr
to grab all tables in the HTML, and push them into a single dataframe and do some basic cleaning. Man, I hate Windows; utf-8 and Windows still in 2017 is a major pain in the stern.
With the raw data pulled, we now need to clean it a bit. Our goal is to generate a data set of the “pirating times” of each individual pirate. Wiki lists the active times, the “flourished” times, and/or the life times. We’ll split the age columns (usually yyyy-yyyy) into two, providing some padding if there is only one of the dates available. After this, we’ll clean up the columns into integer columns, ignoring the rows with did not provide accurate enough year data (e.g. “17th century”). This is a fun project, not a scientific one.
With the dataset now sufficiently cleared up, we will now change the dates to decades only – so change 1643 to 1640. We’ll graph our data by decade. We’ll use some rowwise() %>% do()
magic to generate one row per pirate per activity decade (so two rows if he was active from 1648 to 1651). We will add up the pirates active in each decade to get to our visualisation of pirate activity, and some pirates were active for far longer than others!
With the data now set up, we can set out to graph the pirate activity, and then also to build a random pirate generator!
Pirates in Action
Let’s take a look at the countries these famous pirates came from:
Country.of.origin | n |
---|---|
England | 108 |
Netherlands | 47 |
France | 39 |
Unknown | 25 |
Colonial America | 22 |
United States | 19 |
Spain | 10 |
China | 9 |
Venezuela | 8 |
Germany | 7 |
Ireland | 7 |
Wales | 7 |
Scotland | 6 |
Ottoman Empire | 4 |
Portugal | 4 |
England is over-represented by quite a large margin, especially if you group Wales/Scotland/Ireland, and potentially Colonial America. The Netherlands and France come in second and third place, respectively. No big surprises here.
We’ll pull the data on pirate activity to build a graph of the overall pirate activity over the decades. This is a rather straightforward bar chart:
One can nicely see the increase in pirate’s activity during the Golden Age (roughly 1560-1700). It drops shortly in 1700, building up again shortly thereafter. The middle of the 18th century is more or less free from pirates (at least those infamous enough to end up on Wikipedia), until piracy starts again around 1800 with the English-French wars. The Somalian pirates of the 2000s are the last on the graph!
This leads us to the next questions: Somalia was the heyday of pirates in the 2000s, but we cannot reasonably expect it to be in the 1600s. When was a country’s individual golden age of pirating?
To answer this, we’ll build a ridgeplot (formaly also known as a joyplot, from the Joy Division band t-shirt; the name is now discouraged for obvious reasons), showcasing the density plots of pirate activity per country.
The “usual suspects” add up nicely – England, the Netherlands, France. It’s interesting to see Germany show up as early (around 1350-1400). These are the pirates of the Hansa, mostly. Venezuela also had a “late blooming” pirate life in the 19th century, mostly. Interesting, I had not known about that!
A Pirate’s Life For Me!
With the historical data on pirates and their activity times, we can also try to generate a random pirate story! For this, we will restrict the data to the golden age of piracy, dropping all pirates which started their pirate life before 1500, and after 1750. We will want to create a random pirate with a statistically “correct” starting year of piracy, a statistically “correct” piracy tenure, and varnish him (or her) with a random pirate name!
Golden Age Pirate: starting piracy year
Let’s start with the pirating starting activity year.
Unfortunately, it’s a heavily left-skewed distribution, and after some hours of looking into distributions, I couldn’t find a good fit, and we’ll have to resort to pulling from the empirical distribution (ie, the data itself) for generating a random start year for a pirate:
Golden Age Pirate: piracing tenure
We have better luck with the tenure of a pirate: the distribution follows a nice count-data distribution, I’m guessing a negative binomial.
As it’s discrete data, we’ll use the negative binomial distribution to estimate the theoretical distribution of a pirate’s tenure using library(fitdistrplus)
.
The results are excellent: we can clearly see the oversampling of “round” years (10, in particular), which comes from the fact that with uncertain dates, Wikipedia noted a pirate’s activity in decades, rather than actual years. But still, our distribution of rnbinom(x, size = 0.3876683, mu = 6.395578)
works nicely.
We can thus draw a random tenure for a pirate by using this function:
Golden Age Pirate: silly name generator
All that is missing now is a random pirate name generator. Fantasy name generator have a nice collection of name generators, and a pirate one is included! They allow the use of the names “to name anything in any of your own works”, so let’s use them here!
We source the names into R and create a simple function pulling a random name, allowing you to specify male or female.
Golden Age Pirate: a pirate’s story!
With all the pieces set up, we can now generate a random pirate:
\u2620
is the utf-8 code for the skull-and-crossbones emoji. Calling the function several times then results in:
female | story |
---|---|
FALSE | Your name is Gerard ‘The Fierce’ Eastoft and you roamed the Caribbean from 1603 to 1605. Arrr! ☠ |
FALSE | Your name is Orton ‘Grommet’ Drachen and you roamed the Caribbean from 1657 to 1657. Arrr! ☠ |
FALSE | Your name is Ascot ‘Shadow’ Artemis and you roamed the Caribbean from 1650 to 1652. Arrr! ☠” |
TRUE | Your name is Ethyl ‘Temptress’ Mitchell and you roamed the Caribbean from 1610 to 1625. Arrr! ☠ |
TRUE | Your name is Christine ‘Daffy’ Ward and you roamed the Caribbean from 1657 to 1657. Arrr! ☠ |
TRUE | Your name is Huldah ‘Twitching’ Vome and you roamed the Caribbean from 1718 to 1718. Arrr! ☠ |
Concluding remarks
The code is available on github, as always.
I wish I could find those historical shipping and piracy data again – any hints or pointers are very much appreciated!
Talk like a pirate day 2017 was originally published by Kirill Pomogajko at Opiate for the masses on September 19, 2017.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.