Site icon R-bloggers

GDELT, Missiles, and Image Collection

[This article was first published on R Bloggers on syknapptic, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Global Database of Events, Language, and Tone, or GDELT, is “a realtime network diagram and database of global human society for open research”.

The potential for a firehose stream of global data has tantalizing possibilities for research, but concrete examples of work beyond simple evaluations of the database’s capabilities are notably absent…

See also:

Let’s see how well we can scoop up a set of ballistic missile images using a combination of packages:

library(gdeltr2)       # devtools::install_github("abresler/gdeltr2")
library(tidyverse)     # install.packages("tidyverse")
library(knitr)
library(imager)        # install.packages("tidyverse")

Before we start extracting any data, let’s refine our search as much as possible by assigning a handful of variables that we can use in the API call.

GDELT uses a set of codebooks that can be referenced with get_gdelt_codebook_ft_api().

Query Variables

Languages

gdeltr2 will only query English articles by default, but we don’t really care about the language of the article or even the article text

We’ll pull() all the languages from code_book = "languages" into a vector variable that we can use as a search argument like so:

langs <- get_gdelt_codebook_ft_api(code_book = "languages") %>%
  pull(value)

langs
##  [1] "Afrikaans"        "Albanian"         "Arabic"          
##  [4] "Armenian"         "Azerbaijani"      "Bengali"         
##  [7] "Bosnian"          "Bulgarian"        "Catalan"         
## [10] "Chinese"          "Croatian"         "Czech"           
## [13] "Danish"           "Dutch"            "Estonian"        
## [16] "Finnish"          "French"           "Galician"        
## [19] "Georgian"         "German"           "Greek"           
## [22] "Gujarati"         "Hebrew"           "Hindi"           
## [25] "Hungarian"        "Icelandic"        "Indonesian"      
## [28] "Italian"          "Japanese"         "Kannada"         
## [31] "Kazakh"           "Korean"           "Latvian"         
## [34] "Lithuanian"       "Macedonian"       "Malay"           
## [37] "Malayalam"        "Marathi"          "Mongolian"       
## [40] "Nepali"           "Norwegian"        "NorwegianNynorsk"
## [43] "Persian"          "Polish"           "Portuguese"      
## [46] "Punjabi"          "Romanian"         "Russian"         
## [49] "Serbian"          "Sinhalese"        "Slovak"          
## [52] "Slovenian"        "Somali"           "Spanish"         
## [55] "Swahili"          "Swedish"          "Tamil"           
## [58] "Telugu"           "Thai"             "Tibetan"         
## [61] "Turkish"          "Ukrainian"        "Urdu"            
## [64] "Vietnamese"

Image Tags

Since we’re looking specifically for imagery, we can query the relevant codebooks with code_book = "imagetags" and code_book = "imageweb" like so:

get_gdelt_codebook_ft_api(code_book = "imagetags") %>%
  head() %>%
  kable()
nameCodebook idImageTag value
IMAGETAGS person 65791693
IMAGETAGS profession 33553949
IMAGETAGS vehicle 25342998
IMAGETAGS sports 17180851
IMAGETAGS speech 16976988
IMAGETAGS people 13358317
get_gdelt_codebook_ft_api(code_book = "imageweb") %>%
  head() %>%
  kable()
nameCodebook idImageWeb value
IMAGEWEB Image 2198300
IMAGEWEB News 2136894
IMAGEWEB Photograph 1027341
IMAGEWEB United States of America 659847
IMAGEWEB Speech 649292
IMAGEWEB Car 621304

We’ll filter() the tags to retain only those that explicitly reference “missile” with a regex.

We also want to handle a bug in gdeltr2’s query functions where sometimes a a large amount of incorrect information makes it into tag lists. Fortunately, we can omit that by excluding results containing blocks of multiple digits.

tag_regex <- "\\b[Mm]issile\\b"

bind_rows(
  get_gdelt_codebook_ft_api(code_book = "imagetags") %>%
    filter(str_detect(idImageTag, tag_regex),
           !str_detect(idImageTag, "\\d{2,}")),
  
  get_gdelt_codebook_ft_api(code_book = "imageweb") %>%
    filter(str_detect(idImageWeb, tag_regex),
           !str_detect(idImageWeb, "\\d{2,}"))
  ) %>%
  head() %>%
  kable()
nameCodebook idImageTag value idImageWeb
IMAGETAGS missile 247486 NA
IMAGETAGS guided missile destroyer 194660 NA
IMAGETAGS missile boat 147549 NA
IMAGETAGS ballistic missile submarine 55996 NA
IMAGETAGS cruise missile submarine 11508 NA
IMAGEWEB NA 77637 Missile

We’ll refine our results by excluding some of the tags that have a tendency to return less relevant images.

  • vehicle terms tend to emphasize the vehicle itself, rather than weapon systems
    • "boat"
    • "submarine"
    • "tank"
    • "destroyer"
  • Missile "defense" emphasizes politics over hardware
  • specific "system" tags are all in reference to surface-to-air platforms
    • S-300 missile system
    • S-400 missile system
    • Buk missile system
  • generalized Surface-to-"air" doesn’t seem fuzzy enough to ever reference ballistic missiles

We’ll use another regex to omit those tags, including the multiple digit regex used to exclude the buggy data that may leak into our results.

Junk Tag Filtering

junk_tag_regex <- c("boat", "[Ss]ubmarine", "tank", "destroyer",
                    "defense",
                    "system",
                    "air") %>%
  paste0("\\b", ., "\\b") %>%
  str_c(collapse = "|") %>%
  paste0("|\\d{2,}")

junk_tag_regex
## [1] "\\bboat\\b|\\b[Ss]ubmarine\\b|\\btank\\b|\\bdestroyer\\b|\\bdefense\\b|\\bsystem\\b|\\bair\\b|\\d{2,}"

With some parameters in mind and filtering variables assigned, let’s pull() the desired tags from each codebook into a pair of variables which we will use to query GDELT’s API.

image_tags <- get_gdelt_codebook_ft_api(code_book = "imagetags") %>%
  filter(str_detect(idImageTag, tag_regex),
         !str_detect(idImageTag, junk_tag_regex)) %>%
  pull(idImageTag)

imageweb_tags <- get_gdelt_codebook_ft_api(code_book = "imageweb") %>%
  filter(str_detect(idImageWeb, tag_regex),
         !str_detect(idImageWeb, junk_tag_regex)) %>%
  pull(idImageWeb)

combine(image_tags, imageweb_tags)
##  [1] "missile"                           
##  [2] "Missile"                           
##  [3] "Ballistic missile"                 
##  [4] "Cruise missile"                    
##  [5] "Intercontinental ballistic missile"
##  [6] "Anti-ballistic missile"            
##  [7] "Missile launch facility"           
##  [8] "Medium-range ballistic missile"    
##  [9] "Land-attack missile"               
## [10] "Short-range ballistic missile"     
## [11] "Surface-to-surface missile"

Dates

We’ll specify a time period using gdeltr2::generate_dates(). For this example, we’ll select September 22-23 of 2017 to see if we can capture coverage of an Iranian military parade.

target_dates <- generate_dates(start_date = "2017-09-22",
                               end_date = "2017-09-23")

API Call

With all of our query variables prepared, we’ll call GDELT’s API using get_data_ft_v2_api(). As duplicate articles are commonly published in many venues, we’ll omit results to only include distinct() titleArticles.

articles_df <- get_data_ft_v2_api(images_tag = image_tags,
                                  images_web_tag = imageweb_tags,
                                  search_language = langs,
                                  dates = target_dates, 
                                  visualize_results = FALSE) %>%
  distinct(titleArticle, .keep_all = TRUE)

Query Results

Here’s a summary of what we get back.

articles_df %>% 
  glimpse()
## Observations: 393
## Variables: 16
## $ modeSearch          <chr> "ArtList", "ArtList", "ArtList", "ArtList"...
## $ imagewebtagSearch   <chr> "Missile", "Missile", "Missile", "Missile"...
## $ datetimeStartSearch <chr> "2017-09-22 12:00:00", "2017-09-22 12:00:0...
## $ datetimeEndSearch   <chr> "2017-09-23 11:59:59", "2017-09-23 11:59:5...
## $ imagetagSearch      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ isOR                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ countMaximumRecords <dbl> 250, 250, 250, 250, 250, 250, 250, 250, 25...
## $ urlGDELTV2FTAPI     <chr> "https://api.gdeltproject.org/api/v2/doc/d...
## $ urlArticle          <chr> "http://www.iranherald.com/news/254804227/...
## $ urlArticleMobile    <chr> NA, "http://m.alarab.com/Article/825920", ...
## $ titleArticle        <chr> "Iran Herald", "<U+0647><U+0632><U+0629> <U+0623><U+0631><U+0636><U+064A><U+0629> <U+062B><U+0627><U+0646><U+064A><U+0629> <U+0641><U+064A> <U+0643><U+0648><U+064A><U+0627> <U+0627><U+0644>...
## $ datetimeArticle     <dttm> 2017-09-23 05:00:00, 2017-09-23 05:00:00,...
## $ urlImage            <chr> "http://cdn.bignewsnetwork.com/voa15061647...
## $ domainArticle       <chr> "iranherald.com", "alarab.com", "baomoi.co...
## $ languageArticle     <chr> "English", "Arabic", "Vietnamese", "Nepali...
## $ countryArticle      <chr> "Iran", "Israel", "Vietnam", "Tuvalu", "Un...

Extracting Images

Now that we have a data frame of articles that includes a column of image URLs, we can download the data.

Directory

Let’s assign a variable for our directory.

dir <- "data/missile_images/"

Then we’ll actually create the directory.

dir.create(dir)

Workflow

We’re going to take advantage of the magic of the purrr package in several ways to stabilize our workflow.

Error Handling

The Internet is littered with broken links and webpages, which becomes more likely the further back in time we go. We’ll use one of purrr’s adverbs, safely(), to handle the inevitable download errors that will occur by creating a new function called safe_download().

safe_download <- safely(download.file)

We’re also going to create safe versions of functions we’ll use for loading and plotting images. Although most of the valid URLs will link to clean images, it’s not uncommon for otherwise successful downloads to actually come from already corrupted sources.

To handle this, we’ll create safe_image() and safe_plot().

safe_image <- safely(load.image)
safe_plot <- safely(plot)

Download Images

  1. filter() images using a regex that confirms either a .jpg or .png extension and simultaneously validates a URL sequence that we can use for each image’s eventual file path.
  2. select a sample of 100 random rows sample_n(100)
  3. pull() the urlImage column into a vector
  4. iterate through each item of the vector with walk()
    • safe_download() each image’s binary format (mode = "wb")
      • and write it to dir using its match to valid_path_regex
valid_path_regex <- "/[A-z0-9-_]+\\.(jpg|png)$"

articles_df %>%
  filter(str_detect(urlImage, valid_path_regex)) %>%
  sample_n(100) %>%
  pull(urlImage) %>%
  walk(~
         safe_download(.x,
                       paste0(dir, 
                              str_extract(.x, valid_path_regex)),
                       mode = "wb")
         )

Inspect Images

Let’s insepct a sample of the downloaded images.

Clearly the results are not perfect. There are images without anything resembling a missile as well as several duplicate or near-duplicate images. That said, manual renaming of files will allow filtering of useless images.

This is a quick proof of concept that sets us up well for enhancing data sets established through other methods.

More importantly,it demonstrates a basic workflow for bulk image processing that can be easily expanded to iteratively prepare a large dataset for many kinds of analyis.

We can take a look at our results with the following:

  1. list.files() the full paths of all the files in dir
  2. iterate through the resulting vector, reading each file with safe_image() and map()ping the results to a list
  3. remove a layer of the list hierarchy by flatten()ing it
  4. omit any resulting NULL values by compact()ing the list
  5. subset a sample() of half the images, just for demonstration
  6. walk() through the list, plotting each image
par(mfrow = c(5, 2))

list.files(dir, full.names = TRUE) %>%
  map(safe_image) %>%
  flatten() %>%
  compact() %>%
  sample(50) %>%
  walk(~ 
         safe_plot(.x, 
                   axes = FALSE, ann = FALSE)
       )

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2       imager_0.41.1        magrittr_1.5        
##  [4] knitr_1.20.8         forcats_0.3.0        stringr_1.3.1       
##  [7] dplyr_0.7.6          purrr_0.2.5          readr_1.1.1         
## [10] tidyr_0.8.1          tibble_1.4.2.9004    ggplot2_3.0.0.9000  
## [13] tidyverse_1.2.1.9000 gdeltr2_0.3.11702   
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137            xts_0.10-2             
##  [3] lubridate_1.7.4         webshot_0.5.0          
##  [5] progress_1.2.0          httr_1.3.1             
##  [7] tools_3.5.1             backports_1.1.2        
##  [9] utf8_1.1.4              R6_2.2.2               
## [11] lazyeval_0.2.1          colorspace_1.3-2       
## [13] withr_2.1.2             readbitmap_0.1.5       
## [15] tidyselect_0.2.4        prettyunits_1.0.2      
## [17] mnormt_1.5-5            curl_3.2               
## [19] compiler_3.5.1          cli_1.0.0              
## [21] rvest_0.3.2             xml2_1.2.0             
## [23] bookdown_0.7            triebeard_0.3.0        
## [25] scales_0.5.0.9000       checkmate_1.8.5        
## [27] psych_1.8.4             RApiDatetime_0.0.3     
## [29] trelliscopejs_0.1.13    digest_0.6.15          
## [31] tiff_0.1-5              foreign_0.8-70         
## [33] rmarkdown_1.10.7        base64enc_0.1-3        
## [35] jpeg_0.1-8              pkgconfig_2.0.1        
## [37] htmltools_0.3.6         highcharter_0.5.0      
## [39] highr_0.7               htmlwidgets_1.2        
## [41] rlang_0.2.1             ggthemes_3.5.0         
## [43] readxl_1.1.0            TTR_0.23-3             
## [45] htmldeps_0.1.0          rstudioapi_0.7         
## [47] quantmod_0.4-13         bindr_0.1.1            
## [49] zoo_1.8-2               jsonlite_1.5           
## [51] mclust_5.4.1            rlist_0.4.6.1          
## [53] fansi_0.2.3             Rcpp_0.12.17           
## [55] munsell_0.5.0           purrrlyr_0.0.3         
## [57] stringi_1.2.3           yaml_2.1.19            
## [59] plyr_1.8.4              grid_3.5.1             
## [61] parallel_3.5.1          crayon_1.3.4           
## [63] lattice_0.20-35         haven_1.1.2            
## [65] hms_0.4.2               anytime_0.3.1          
## [67] pillar_1.3.0.9000       igraph_1.2.1           
## [69] reshape2_1.4.3          codetools_0.2-15       
## [71] glue_1.2.0              evaluate_0.10.1        
## [73] blogdown_0.7.1          DistributionUtils_0.5-1
## [75] bmp_0.3                 data.table_1.11.5      
## [77] modelr_0.1.2            png_0.1-7              
## [79] urltools_1.7.0          cellranger_1.1.0       
## [81] gtable_0.2.0            assertthat_0.2.0       
## [83] xfun_0.3                broom_0.4.5            
## [85] autocogs_0.0.1          wordcloud2_0.2.1

To leave a comment for the author, please follow the link and comment on their blog: R Bloggers on syknapptic.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.