Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
So I’ve been holidaying in Sri Lanka for a while now, and thought I would do a brief blog post using some public tourism data. Tourism has grown rapidly in Sri Lanka since the civil war finished in 2009. Travelling around, it’s easy to see why. Sri Lanka excels in the range of attractions it has to offer visitors. These include beaches, history and ruins (from ancient times right down to facilities from the civil war), forests, wildlife, and an interesting contemporary cultural and religious mix around the island.
Sri Lanka has eight UNESCO world heritage sites, mostly relating to history and culture, but with some interesting biosphere / environmental locations too.
Arrivals data basics
The Sri Lanka Tourism Development Authority publishes a range of statistical and research products.
As with other national tourism statistical systems in a controlled-border situation, the base of the tourism data comes from counting visitors crossing the border. All people entering Sri Lanka complete a physical arrival card with basic information. Aggregate results are published promptly in monthly reports.
Past years’ detailed data are available only in awkward PDF documents that I wasn’t able to efficiently scrape in the time available (I’m on holiday, ok?). However, detailed data from January 2018 is being published in Excel, and from 2014 to 2017 the summary data is easy enough to copy into a spreadsheet.
Data on “days spent in the country” (combining arrivals and departures, sometimes known as “stay days” or “total nights”) would be a better measure but are harder to deliver. They aren’t readily available.
Here’s what the main story looks like up to August 2018, including a forecast out for another 18 months:
That forecast is done with auto.arima
from Hyndman’s forecast
R package, using a logarithm transform because the variance increases roughly proportional to the average number of visitors. The code for today’s post, which is all unremarkable, is in one chunk at the bottom of the post.
Here’s a different look at the seasonality and growth, putting month of the year on the X axis and giving a separate line to each year’s arrivals figures:
This visualization is better at showing the two peak periods of tourism per year. These two seasons are partly driven by the weather – Sri Lanka experiences two monsoons per year which impact differently on different parts of the island – and partly by holidays and other seasonal features in the origin market countries.
These images also suggest (to me) a slowing in arrivals growth since 2015, but without a longer time series (at least back to the end of the war) it’s difficult to know how substantive that would be.
Seasonality and country of origin
I was interested to see which were the main market countries for Sri Lankan tourism, and if seasonality differed by country of residence. For this, the only data convenient for me was the Excel workbook with just the first eight months of 2018’s arrival numbers by country. The missing last four months of the year create an obvious limitation, but we’ve at least got a starting point.
Here’s a chart showing the top 15 countries:
Unsurprisingly, neighbouring India is both the largest single source of visitors (which includes purposes of visit other than holiday making, such as visiting family, medical or religious purposes), and has a slightly different seasonal mix from other markets. Indian visitors spiked in May 2018 for a reason I don’t know but would doubtless be clear to someone more familiar with the situation. However, May is a low period for other key countries such as the UK and Australia, as well as the “other” category that holds all unlisted countries.
Here’s a line chart that converts each country’s monthly arrivals into a percentage of the total arrivals from that country in the first eight months of 2018:
The story is the same, but perhaps a little clearer particularly for the countries towards the bottom of the chart. We see very different seasonality patterns for visitors from Saudi Arabia compared to Russia and Ukraine, for example. A good reminder that climate in the country of origin can matter as much as in the destination country!
Finally, here’s a more complex graphic that tries to show the results of principal component analysis, seeking to summarise similarities or differences between the various source countries. The principal components analysis behind this biplot uses as source data the proportions calculated in the step above as the basis for a matrix with country of residence as rows, and months as columns.
This chart shows, for example, that (based on just eight months’ data) visitor arrivals from Saudi Arabia and Australia have very different seasonal patterns, whereas Ukraine and Russia are very similar. This is useful for analytical and exploratory purposes, but my track record in being able to explain such images to non-statistical audiences (or anyone, in fact) is challenged so I wouldn’t necessarily recommend it for broad use.
R code
Here’s the code behind today’s post:
Finally, here’s the view from the guesthouse where I’m writing this, in the hills above Kandy in central Sri Lanka:
Nice.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.