Identifying data gaps with R

[This article was first published on R – Policy Analysis Lab, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Summary: In this post, Giles outlines a simple approach that can be used to identify potential ‘data gaps’ using R. Data gaps are assumed to denote the difference between the data required to achieve an outcome and the currently data available. To demonstrate how the methodology might be applied, the availability of data for Tanzania is compared against a set of geographic and economic peers.

Packages: wbstats, tidyverse and kableExtra.

Data: Data used in this post have been sourced using the wbstats R package.

Introduction

I first came across the idea of ‘data gap mapping’ when a client engaged our team to develop a composite index to improve how they monitored national economic, political and social risks.

In most cases a ‘data gap mapping’ tries to compare what data is needed with what’s available. This can to reveal not only where additional data is needed, but also the shortcomings of the data available to us – whether in quality, quantity, frequency, or granularity. Aside from this helping to highlight potential blind spots that might emerge in our analysis, it can also be helpful for highlighting where extra effort might be needed to fill any data gaps that are identified.

Setting aside the controversies surrounding the use (and abuse)1 of composite indices, for us the gaps analysis required comparing the characteristics of data available to us2 with what’s needed to build the client’s ‘ideal’ index. Unfortunately, because we needed to complete the analysis before knowing what the index would look like, we needed to find a reasonable substitute.

An Outline of the Approach

Because we had a sense of the themes the index should cover, we could determine the data available to us by reviewing national and international sources relevant to each theme. For instance, knowing that measures of political and democratic stability were likely to be important, we reviewed every national and international data source for indicators that might be suitable for the index. We then repeated this across each theme to produce a categorized database of indicators and corresponding metadata.

To proxy what data we needed to build the index, we decided to compare the availability of data for the country we were building the index on with its peers. This approach had the advantage of highlighting themes where data may be particularly scarce when designing and building the index, even when taking account of the statistical capacity of the focus country.

If the availability of data for our focus country is lower than its peers, it might indicate there are limited options available for covering the theme in an index or that more time will be needed to data sources to fill these gaps. Knowing where data is scarce would also be helpful during stakeholder consultations so requests for input can be focused on themes where data might be hard to come by.

To illustrate the approach, we’ll use Tanzania as our focus country.3

Data We Have

To determine the data available to us, we manually searched for databases focusing on indicators relevant to measuring and/or proxying political, economic and social risks. For instance, the International Monetary Fund’s International Financial Statistics database might be used as a source for economic and financial statistics. Whereas UNdata might be used as a source for social statistics.

For each data source, we then reviewed the availability of indicators for the focus country and its peers. To get a sense of the quality and quantity of each indicator we also collected metadata, such as the name of the indicator, its publication frequency, the period over which data is available and its primary source.

Although our full assessment covered 250+ databases, we’ll keep the example simple by focusing on data that can be accessed via the World Banks’s API (via the wbstats package). Because the wb_data() function can sometimes fail when an indicator isn’t available we will limit analysis to the World Development Indicators (source_id=2):

Loading packages and selecting sample data

#load relevant packages
library(wbstats)
library(tidyverse)

#get indicator metadata 
#(We've focused on the World Development Indicators as they're more reliably available)
ref_indicators<-wb_indicators() |> 
  filter(source_id==2)

#create a list of indicator IDs by topic
#(Note: Some indicators are assigned to multiple topics)
ref_dta_topics<-ref_indicators |> 
  select(indicator_id, topics) |> 
  unnest(topics) |> 
  rename(topic=value)

#select a random sample of indicators for each topic
#(set seed for reproducibility)
set.seed(321)

#select sample of indicators across topics
ref_selected_indicators<-ref_dta_topics |>
  group_by(topic) |> 
  slice_sample(n=3)

Data We Need

As we didn’t know what the index should look like, we decided to proxy this based on the availability of data for a group of peers. Although our selection of peers was quite involved in the full project, we’ll keep it simple here and naively assume the statistical peers for Tanzania are countries in the same regional and income groupings:

Specifying peers and downloading data

#specify focus country iso3c code
ref_fcs_iso<-"TZA"

#select countries 
ref_peer_isos<-wb_countries() |> 
  filter(admin_region_iso3c=="SSA", 
         income_level_iso3c== "LMC")

#download data for selected indicators 
dta_wb<-wb_data(country = ref_peer_isos$iso3c, 
                    indicator=ref_selected_indicators$indicator_id,
                    return_wide=FALSE)

#view peers
kableExtra::kable(ref_peer_isos |> select(country))

Selected Peers:

country
Angola
Benin
Cote d’Ivoire
Cameroon
Congo, Rep.
Comoros
Cabo Verde
Ghana
Guinea
Kenya
Lesotho
Mauritania
Nigeria
Senegal
Sao Tome and Principe
Eswatini
Tanzania
Zambia
Zimbabwe

Putting aside whether the countries listed can be considered ‘peers’ for Tanzania4, the next question is how to sensibly compare the availability of data across countries. For instance:

  • Should we compare Tanzania against the best performing peer, the median or is the average good enough?
  • Should the frequency of publication be considered?
  • Should more recent data be prioritized?
  • etc…

As you might have guessed, I’m not going to give a simple answer here because there isn’t one. But, because our assessment covered 14 thousand indicators across 250+ databases, we took a relatively simple approach by dropping all indicators without values for recent periods and comparing the availability of data for our focus country with the peer average.

The code below illustrates the basic idea, by comparing the percentage of missing values for Tanzania compared with its peers for each indicator, ignoring the recency of observations available for an indicator:

Estimating the comparative proportion of missing values

#label peers vs focus country tranform date to integer 
dta_wb<-dta_wb |> 
  mutate(date=as.integer(date), 
         iso_group= if_else(iso3c==ref_fcs_iso,
                            "Focus","Peer"))

#Calculate the % of missing values 
#Add topic (noting some indicators are assigned multiple topics)
sum_na_bechmarking<-left_join(dta_wb,ref_dta_topics) |> 
  group_by(iso_group,topic) |>
  summarize(pct_na=(sum(is.na(value))/n()) )|>
  pivot_wider(values_from =  pct_na, names_from = iso_group) |> 
  mutate(na_data_gap=round(Focus-Peer,2)) 

#present largest data gaps using the missing value approach
 sum_na_bechmarking |> 
   select(topic, na_data_gap) |>
  arrange(na_data_gap) |> 
   head(5) |> 
   kableExtra::kable()

Potential Data Gaps (missing values approach):

topicna_data_gap
External Debt-0.11
Climate Change-0.10
Infrastructure-0.10
Urban Development-0.09
Energy & Mining-0.06

Data We’re Missing

From the looks of it, the average availability of data on the topics external debt, climate change and infrastructure are lower for Tanzania for its peers. Suggesting it might be relatively harder to find data on these topics.

Of course, because the primary source for many of the indicators are the statistical agencies of each country, a country level comparison makes more sense. It’s also likely to be more informative to compare the availability of data at the indicator level, before summarizing data gaps by topic to avoid our results being skewed by the number of indicators available for each country.

Another potential issue is that in many cases missing values exist before data for an indicator was collected by Tanzania or its peers. A good example of this is the median lead time to import in days (ID: LP.IMP.DURS.MD), which for Tanzania is available only after 2007 and on an irregular basis. Making a data gap based on the percentage of missing values potentially misleading; as newer data series will have missing values in periods before data was first collected for an indicator.

To address this, the code below calculates the proportion of values available after a value is first available for an indicator. This follows a similar logic to using the percentage of missing values, except that it only penalizes a country’s score once there is evidence data has been collected for an indicator. It also provides a better indicator of how reliably data has been collected after the first observation:

Estimating Data Gaps at the Indicator Level:

#create a set of summary stats for indicators
sum_dta_benchmarking_by_indic<-dta_wb |> 
  filter(!is.na(value)) |> 
  group_by(indicator_id,indicator,iso_group,iso3c,country) |> 
  summarize(n_obs=n(),
            start_year=min(date),
            end_year=max(date)+1) |> 
  mutate(max_n_obs = end_year-start_year,
         pct_obs= (n_obs)/max_n_obs) |> 
  #aggregate to iso_group level
  group_by(indicator_id,indicator,iso_group) |> 
  summarize(avg_pct_obs=mean(pct_obs,na.rm=TRUE)) |> 
  #reshape to compare peers
  pivot_wider(names_from=iso_group, 
              values_from=avg_pct_obs,
              values_fill= 0) |> 
  mutate(data_gap=round(Focus-Peer,2))

# present indicators with a 'data gap' above 10 percentage points
sum_dta_benchmarking_by_indic |>
  filter(data_gap <=-0.1) |> 
  arrange(data_gap) |> 
  ungroup() |> 
  select(indicator,data_gap) |> 
  head(5) |> 
  kableExtra::kable()

Potential Data Gaps (% Values Since First Year at Indicator Level):

indicatordata_gap
Industrial design applications, resident, by count-0.82
Agricultural irrigated land (% of total agricultural land)-0.71
Market capitalization of listed domestic companies (current US$)-0.54
Patent applications, residents-0.53
Children in employment, study and work, female (% of female children in employment, ages 7-14)-0.38

At the indicator level, around 1/5 of indicators have a ‘data gap’ above ten percentage points. If this seems like an arbitrary cutoff, it is. The main point is really to highlight the most ‘unusual’ data gaps that are likely to say something about the availability of data. For example, in our full assessment we defined a ‘data gap’ to be two standard deviations lower than the peer average.

It’s also important to keep in mind that there are a range of reasons a ‘data gap’ might emerge in this assessment. For instance, some indicators might be less relevant for Tanzania, so they aren’t collected . Or maybe their collection and reporting methodology doesn’t suit the requirements to be included in the database being surveyed (in this case the World Bank). The assessment also says nothing about the accuracy, timeliness, sensitivity or granularity of the data available to us.

Reflecting this, we chose to to report the data gaps at the topic level. Making it clearer that any ‘data gaps’ are meant to be indicative of where data may be difficult to come by when building the index. The code below demonstrates the basic approach using an indicator’s topic:

Estimating Data Gaps at the Topic Level:

#estimate the average data gap by topic
sum_dta_benchmarking_by_topic <- sum_dta_benchmarking_by_indic |> 
  left_join(ref_dta_topics) |> 
  group_by(topic) |> 
  summarize(avg_data_gap=mean(data_gap) |> round(2))

#merge with data gaps estimated via missing values
sum_dta_benchmarking_by_topic<-sum_dta_benchmarking_by_topic |> left_join(sum_na_bechmarking |> select(topic,na_data_gap))

#view top ten 'data gaps' by topic
  sum_dta_benchmarking_by_topic |> 
    arrange(avg_data_gap) |> 
    head(5) |> 
    kableExtra::kable()

Potential Data Gaps (% Values Since First Year at Topic Level):

topicavg_data_gapna_data_gap
Infrastructure-0.25-0.10
Science & Technology-0.240.01
Agriculture & Rural Development-0.170.00
Financial Sector-0.16-0.05
Social Development-0.12-0.04

Based on the topic-level assessment by country and indicator (avg_data_gap), it appears that data gaps are likely to exist for infrastructure, science and technology, and agriculture and rural development.5 Suggesting that extra effort may be needed to source data in these areas or find suitable proxy variables to avoid this becoming a blind spot of the final index.

In our case, this is exactly how we interpreted the results: leading to us devoting significantly more time identifying additional data sources to help fill these gaps and requesting direct input from stakeholders on where to look. But, the results were also useful beyond this: with the identified gaps being a useful input into data advocacy discussions held with policymakers. Due to the analysis indicating to the focus country where effort should be focused to catch up with their peers.

Which is more or less why this blog exists: As identifying data gaps, and lobbying for them to be filled, has formed the backdrop of almost every analytics project that I’ve been involved in. So next time you find yourself staring at a project that would benefit from more data, consider taking our back-of-the-envelope methodology out for a spin (and letting us know how you improved it along the way!).

  1. For example, see: Ravallion, M., 2012. Mashup indices of development. The World Bank Research Observer, 27(1), pp.1-32. ↩
  2. Such as the accuracy, sensitivity, granularity and timeliness of the data.
    See: Joint Research Centre, 2008. Handbook on constructing composite indicators: Methodology and user guide. OECD publishing, https://www.oecd-ilibrary.org/economics/handbook-on-constructing-composite-indicators-methodology-and-user-guide_9789264043466-en, p.44 ↩
  3. Tanzania was not our focus during the full assessment. Their selection is for illustrative purposes. ↩
  4. In the actual assessment, the peers were selected based on input received during consultations and analysis. ↩
  5. The results also confirm what we might have expected: the aggregate data gaps suggested by the % of missing values will not necessarily accord with country level analysis by indicator (the na_data_gap column). ↩

A note how AI was used: The majority of this post and code were produced by the author. AI tools were used to critique the methodology and suggest alternative approaches for communicating some ideas presented in the post.

To leave a comment for the author, please follow the link and comment on their blog: R – Policy Analysis Lab.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)