Anzacathon: the benefits of Open Data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
25 April is Anzac Day in Australia, New Zealand and many communities around the world where Anzacs have served. Given the shift to online collaboration in 2020, Anzacathon has been set up to help people engage online.
One of the key themes of Anzacathon is data: finding new ways to use the data and also demonstrating the benefits of community engagement with open data.
With that in mind, I’m providing some tutorials for users of PostgreSQL and R to access the data and some simple examples to use it.
Data access over IPFS
The raw data files are shared over IPFS. They are in SQLite format, as this provides a convenient mechanism to query the data with SQL commands directly over IPFS.
The pgloader tool provides a convenient way to load SQLite databases directly from IPFS into a PostgreSQL schema. There is no need to download the SQLite files, we can simply mount IPFS like a filesystem. A sample configuration for PostgreSQL users is provided.
For R users, there is an RSQLite module that allows R to access SQLite data. Once again, there is no need to download the data, any file in IPFS can be opened directly using fuse.
R users who want to use this method first need to follow the first part of the PostgreSQL setup instructions, up to the point where you start the IPFS daemon process.
Finding the lone Anzacs
Every year, there are huge gatherings at Gallipoli and other sites where large numbers of Anzacs died. Due to Coronavirus, those gatherings won’t take place in 2020.
Therefore, we have the opportunity to discover lesser known places where Anzacs have died. In one place I discovered in the French alps, there is a single Australian airman buried alongside his British colleagues. After a quick search of the National Archives and a hike up the mountain with my camera, I created a visual story about their mystery.
How many more cemeteries in France and other places contain a single Anzac like this? We can’t ask that question through any of the web sites but it is very easy with SQLite or PostgreSQL:
SELECT cemeterymemorial, country, COUNT(cemeterymemorial) FROM cwgc_casualty GROUP BY cemeterymemorial, country HAVING COUNT(cemeterymemorial) = 1;
The query lists 1,143 cemeteries around the world having a single Anzac.
We could further refine that to focus on a single country, such as France:
SELECT cemeterymemorial, country, COUNT(cemeterymemorial) FROM cwgc_casualty WHERE country = 'France' GROUP BY cemeterymemorial, country HAVING COUNT(cemeterymemorial) = 1;
Of 1,143 cemeteries around the world, 192 are in France.
Using a subquery, we can see the names, services numbers and dates when those Anzacs died:
SELECT cemeterymemorial, country, forename, surname, TRIM(servicenumberexport, '''') as servicenumber, date_of_death, date_of_death2 FROM cwgc_casualty WHERE cemeterymemorial IN ( SELECT cemeterymemorial FROM cwgc_casualty GROUP BY cemeterymemorial HAVING COUNT(cemeterymemorial) = 1);
For convenience, I’m providing that list as a spreadsheet that you can download and explore.
With that list, it is fairly easy to get the name and service number of the relevant Anzac and look up the scanned copy of his paper file at the National Archives. Given the unique circumstances of these casualties, their files often contain something notable. For example, some of them were on particularly dangerous missions deep behind enemy lines during the time France was occupied.
Anzacs within a given radius (PostgreSQL / PostGIS)
The CWGC web site allows people to search for Anzacs by specificying regions, such as the department within France. There are 95 departments, far more than the 6 states in Australia. It can be a lot more convenient to search by distance/radius from of a point where you live or plan to travel.
Fortunately, we have the PostGIS extension. The PostGIS FAQ includes a specific example, the best way to find all objects within a radius of another object, using the ST_DWithin function.
The PostgreSQL setup documentation includes the necessary code to enable PostGIS and add extra columns to the tables containing the PostGIS objects encapsulating latitude ang longitude values.
This is a basic example of how to use ST_DWithin to obtain sites within a specified radius, returning the distances in kilometers:
SELECT cemetery_desc, ROUND(ST_Distance(location_gis, ('SRID=4326;POINT(46.5535 6.6523)')::geography)/1000) AS d FROM iwmcemeteries WHERE ST_DWithin(location_gis,('SRID=4326;POINT(46.5535 6.6523)')::geography, 100000) ORDER BY d;
The schema also includes a view, anzac_sites, that uses the SQL UNION mechanism to concatenate both the CWGC and TracesOfWar tables. We can access results from both tables using a single query like this:
SELECT source, description, ROUND(ST_Distance(location_gis, ('SRID=4326;POINT(46.5535 6.6523)')::geography)/1000) AS d FROM anzac_sites WHERE ST_DWithin(location_gis,('SRID=4326;POINT(46.5535 6.6523)')::geography, 100000) ORDER BY d;
Anzac family names (using R)
To begin, it is necessary to install the R modules RSQLite and Plyr. On a Debian system, that can be done with:
apt install r-cran-rsqlite r-cran-plyr
and on any other type of system:
install.packages("RSQLite") install.packages("plyr")
As discussed above, make sure that the IPFS daemon is running. See the first part of the PostgreSQL setup instructions.
Start the R command line and from there, it is possible to verify you have access to the data over IPFS:
library(DBI) cwgc_cemeteries <- dbConnect(RSQLite::SQLite(), "/ipfs/QmRgD8xHJXKGwE1S1YySUKQGt25EK8R2W1HTCCEA7sKDLy") dbListTables(cwgc_cemeteries) cwgc_casualty <- dbConnect(RSQLite::SQLite(), "/ipfs/QmPVkHJrSYoeig71EzrxU45zTMbFefQgwdRDSHv7fpjChA") dbListTables(cwgc_casualty)
To verify this, we can load the entire content of one table into a data frame count the frequency of surnames:
data <- dbReadTable(cwgc_casualty, "cwgc_casualty") summary(data) library(plyr) fd = count(data, 'surname') index <- with(fd, order(freq, surname)) tail(fd[index,])
The results will look something like this:
surname freq 18130 TAYLOR 467 19939 WILSON 549 9558 JONES 590 2303 BROWN 628 19901 WILLIAMS 635 17182 SMITH 1376
While this may be a trivial example, it demonstrates that we can use datasets from the IPFS cloud directly in R.
Making data personal
A British police chief suggested on 30 March that messages about flattening the curve were not getting through to many people. Five days later, the British PM was admitted to hospital with Coronavirus. Most of Britain spent the next week intensely following the news as he went in and out of intensive care. This example demonstrates the power of personal stories and examples over statistics and charts.
Using the tools described above, we can examine this data set at scale and also hone in on personal stories and acute examples of tragedy. Reading through the National Archives, there are plenty of personal letters from parents, spouses and children of missing Anzacs. It is particularly important to be respectful with a data set like this.
Why not attack Coronavirus this weekend?
The EUvsVirus hackathon takes place the same weekend as Anzac Day. Some people asked me why I don't put energy into that instead.
There are many answers to that question.
One that is on the top of my mind is that I like to finish things. Together with a dedicated group of volunteers in Kosovo, we had started looking at this data in 2019 and I feel this is a great way to take what I've learnt and hand it off to the crowd.
There have already been a number of dedicated events like the Bio-hackathon. For people unfamiliar with the science, it can be difficult to simultaneously learn about data science and bioinformatics. The Anzac data set provides an opportunity to boost skills with data science while working with an important data set that many people can already understand.
Innovations from this hackathon, such as the use of IPFS to share data sets between participants are directly transferrable to bio-hackathons and many other use cases.
Initial directions for studying the data
Here are some thoughts that come to mind:
- Which of these casualties were buried at a time and place behind enemy lines? There are stories of French citizens taking great risks to give allied soldiers a proper funeral.
- The notes often refer to their place of birth or enlistment. Can we build a report by place of birth / enlistment?
- How can we systematically gather links to third party documents and news reports and link them to the relevant diggers? This could be a Natural Language Processing (NLP) problem.
- Using the keywords extracted from Traces of War, how can we identify monuments relevant to Anzacs or in close proximity to Anzac graves?
- A classification problem: examining the descriptions in the Traces of War data set to identify whether they correspond to infrastructure (such as a fort) or to a monument
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.