Counting the Dead in Syria
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Joseph Rickert
This past June, the Human Rights Data Analysis Group (HRDAG), a San Francisco based non-profit organization, released its report: “Updated Statistical Analysis of Documentation of Killings in the Syrian Arab Republic”. The report is grim reading, but it represents the necessary, rational work the needs to be done if the world is to understand the extent of the human suffering caused by the Syrian civil war. Based on a statistical analysis comparing eight different data sources, HRDAG estimates that there were 92,901 unique records of conflict related deaths, “killings” in the language of the report, for the period between March 2011 and April 2013. The 92,901 figure is not an estimate of the total number of conflict related deaths for the time period considered. It only an estimate of the unique deaths obtained after a sophisticated effort to understand and verify the raw source data.
The analysis was built on what must have been a labor intensive effort that included verifying the integrity of the data, performing ordinary data cleaning, rationalizing the different data sets into a common format, translating the data into English and then having a native speaker in Syrian Arabic who is also fluent in English manually classify 14,160 pairs of recorded deaths as either referring to the same killing or not. Thereafter, a supervised learning algorithm was used to both to de-dupe (eliminate duplicates in each of the individual source data sets) and also perform record-linkage (identify records that refer to the same killing across the different data sets) for the rest of the data
For the data not included in the training set, HRDAG statisticians used the Alternating Decision Tree Algorithm to classify pairs of records as either being the same or not. (There does not appear to be a package with an R implementation of this algorithm, however, on the togaware site Graham Williams provides some code to access the WEKA implementation through the RWeka package.)
The report stops short of the next obvious step, estimating the number of undocumented killings, however, it does offer a possible approach to the problem. Figure 1 from the report (see below) shows the number of killings per month that were documented by from 1 to 5 sources by month for the time period considered.
The colors on each bar show the proportions contributed by each of these five data sources. (For example, from August 2012 onward a large proportion of the killings were documented by either 1 or 5 sources, and for most of the time period considered, there appears to be quite a bit of duplication.) The authors observe that in any particular month, some killings are documented by all 5 sources, others by 4 sources and so on. So, how many killings were estimated by zero groups? Whether or not answering this question leads to a viable estimate of the undocumented killings, it does represent a brilliant insight turning the messy duplication problem into a possible approach to a solution.
There are multiple levels of challenges associated with this project: the dangers involved in collecting data, intractable problems with the data itself, reporting biases, and the futility of trying to establish accurate counts while the conflict continues. Regarding these, the authors comment:
The enumeration provided in this report, 92,901, is the most accurate accounting available based on identifiable victims reported by these eight groups. However, many victims are not yet included in these databases, and the excluded victims may be systematically different from the victims who are recorded. Well-known individuals who are victims of very public acts of violence, and victims who are killed in large groups tend to attract public attention, and they are therefore likely to be reported to one or more of these sources. By contrast, single individuals killed quietly in a remote corner of the country tend to be overlooked by media and documentation projects.
And yet, it seems to me, this HRDAG project is important. It is a cliche to remark that because of a tragic event someone has become “just a statistic”. The unrecorded dead, however, are not even afforded this status. We are all diminished when the dead go uncounted.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.