GDELT, Missiles, and Image Collection

R Bloggers on syknapptic

4 years ago

[This article was first published on R Bloggers on syknapptic, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Global Database of Events, Language, and Tone, or GDELT, is “a realtime network diagram and database of global human society for open research”.

The potential for a firehose stream of global data has tantalizing possibilities for research, but concrete examples of work beyond simple evaluations of the database’s capabilities are notably absent…

Query Variables

Languages

gdeltr2 will only query English articles by default, but we don’t really care about the language of the article or even the article text

We’ll pull() all the languages from code_book = "languages" into a vector variable that we can use as a search argument like so:

langs <- get_gdelt_codebook_ft_api(code_book = "languages") %>%
  pull(value)

langs
##  [1] "Afrikaans"        "Albanian"         "Arabic"          
##  [4] "Armenian"         "Azerbaijani"      "Bengali"         
##  [7] "Bosnian"          "Bulgarian"        "Catalan"         
## [10] "Chinese"          "Croatian"         "Czech"           
## [13] "Danish"           "Dutch"            "Estonian"        
## [16] "Finnish"          "French"           "Galician"        
## [19] "Georgian"         "German"           "Greek"           
## [22] "Gujarati"         "Hebrew"           "Hindi"           
## [25] "Hungarian"        "Icelandic"        "Indonesian"      
## [28] "Italian"          "Japanese"         "Kannada"         
## [31] "Kazakh"           "Korean"           "Latvian"         
## [34] "Lithuanian"       "Macedonian"       "Malay"           
## [37] "Malayalam"        "Marathi"          "Mongolian"       
## [40] "Nepali"           "Norwegian"        "NorwegianNynorsk"
## [43] "Persian"          "Polish"           "Portuguese"      
## [46] "Punjabi"          "Romanian"         "Russian"         
## [49] "Serbian"          "Sinhalese"        "Slovak"          
## [52] "Slovenian"        "Somali"           "Spanish"         
## [55] "Swahili"          "Swedish"          "Tamil"           
## [58] "Telugu"           "Thai"             "Tibetan"         
## [61] "Turkish"          "Ukrainian"        "Urdu"            
## [64] "Vietnamese"

Image Tags

Since we’re looking specifically for imagery, we can query the relevant codebooks with code_book = "imagetags" and code_book = "imageweb" like so:

get_gdelt_codebook_ft_api(code_book = "imagetags") %>%
  head() %>%
  kable()

nameCodebook	idImageTag	value
IMAGETAGS	person	65791693
IMAGETAGS	profession	33553949
IMAGETAGS	vehicle	25342998
IMAGETAGS	sports	17180851
IMAGETAGS	speech	16976988
IMAGETAGS	people	13358317

get_gdelt_codebook_ft_api(code_book = "imageweb") %>%
  head() %>%
  kable()

nameCodebook	idImageWeb	value
IMAGEWEB	Image	2198300
IMAGEWEB	News	2136894
IMAGEWEB	Photograph	1027341
IMAGEWEB	United States of America	659847
IMAGEWEB	Speech	649292
IMAGEWEB	Car	621304

We’ll filter() the tags to retain only those that explicitly reference “missile” with a regex.

We also want to handle a bug in gdeltr2’s query functions where sometimes a a large amount of incorrect information makes it into tag lists. Fortunately, we can omit that by excluding results containing blocks of multiple digits.

tag_regex <- "\\b[Mm]issile\\b"

bind_rows(
  get_gdelt_codebook_ft_api(code_book = "imagetags") %>%
    filter(str_detect(idImageTag, tag_regex),
           !str_detect(idImageTag, "\\d{2,}")),
  
  get_gdelt_codebook_ft_api(code_book = "imageweb") %>%
    filter(str_detect(idImageWeb, tag_regex),
           !str_detect(idImageWeb, "\\d{2,}"))
  ) %>%
  head() %>%
  kable()

nameCodebook	idImageTag	value	idImageWeb
IMAGETAGS	missile	247486	NA
IMAGETAGS	guided missile destroyer	194660	NA
IMAGETAGS	missile boat	147549	NA
IMAGETAGS	ballistic missile submarine	55996	NA
IMAGETAGS	cruise missile submarine	11508	NA
IMAGEWEB	NA	77637	Missile

We’ll refine our results by excluding some of the tags that have a tendency to return less relevant images.

vehicle terms tend to emphasize the vehicle itself, rather than weapon systems
- "boat"
- "submarine"
- "tank"
- "destroyer"
Missile "defense" emphasizes politics over hardware
specific "system" tags are all in reference to surface-to-air platforms
- S-300 missile system
- S-400 missile system
- Buk missile system
generalized Surface-to-"air" doesn’t seem fuzzy enough to ever reference ballistic missiles

We’ll use another regex to omit those tags, including the multiple digit regex used to exclude the buggy data that may leak into our results.

Junk Tag Filtering

junk_tag_regex <- c("boat", "[Ss]ubmarine", "tank", "destroyer",
                    "defense",
                    "system",
                    "air") %>%
  paste0("\\b", ., "\\b") %>%
  str_c(collapse = "|") %>%
  paste0("|\\d{2,}")

junk_tag_regex
## [1] "\\bboat\\b|\\b[Ss]ubmarine\\b|\\btank\\b|\\bdestroyer\\b|\\bdefense\\b|\\bsystem\\b|\\bair\\b|\\d{2,}"

With some parameters in mind and filtering variables assigned, let’s pull() the desired tags from each codebook into a pair of variables which we will use to query GDELT’s API.

image_tags <- get_gdelt_codebook_ft_api(code_book = "imagetags") %>%
  filter(str_detect(idImageTag, tag_regex),
         !str_detect(idImageTag, junk_tag_regex)) %>%
  pull(idImageTag)

imageweb_tags <- get_gdelt_codebook_ft_api(code_book = "imageweb") %>%
  filter(str_detect(idImageWeb, tag_regex),
         !str_detect(idImageWeb, junk_tag_regex)) %>%
  pull(idImageWeb)

combine(image_tags, imageweb_tags)
##  [1] "missile"                           
##  [2] "Missile"                           
##  [3] "Ballistic missile"                 
##  [4] "Cruise missile"                    
##  [5] "Intercontinental ballistic missile"
##  [6] "Anti-ballistic missile"            
##  [7] "Missile launch facility"           
##  [8] "Medium-range ballistic missile"    
##  [9] "Land-attack missile"               
## [10] "Short-range ballistic missile"     
## [11] "Surface-to-surface missile"

Dates

We’ll specify a time period using gdeltr2::generate_dates(). For this example, we’ll select September 22-23 of 2017 to see if we can capture coverage of an Iranian military parade.

target_dates <- generate_dates(start_date = "2017-09-22",
                               end_date = "2017-09-23")

API Call

With all of our query variables prepared, we’ll call GDELT’s API using get_data_ft_v2_api(). As duplicate articles are commonly published in many venues, we’ll omit results to only include distinct() titleArticles.

articles_df <- get_data_ft_v2_api(images_tag = image_tags,
                                  images_web_tag = imageweb_tags,
                                  search_language = langs,
                                  dates = target_dates, 
                                  visualize_results = FALSE) %>%
  distinct(titleArticle, .keep_all = TRUE)

Query Results

Here’s a summary of what we get back.

articles_df %>% 
  glimpse()
## Observations: 393
## Variables: 16
## $ modeSearch          <chr> "ArtList", "ArtList", "ArtList", "ArtList"...
## $ imagewebtagSearch   <chr> "Missile", "Missile", "Missile", "Missile"...
## $ datetimeStartSearch <chr> "2017-09-22 12:00:00", "2017-09-22 12:00:0...
## $ datetimeEndSearch   <chr> "2017-09-23 11:59:59", "2017-09-23 11:59:5...
## $ imagetagSearch      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ isOR                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ countMaximumRecords <dbl> 250, 250, 250, 250, 250, 250, 250, 250, 25...
## $ urlGDELTV2FTAPI     <chr> "https://api.gdeltproject.org/api/v2/doc/d...
## $ urlArticle          <chr> "http://www.iranherald.com/news/254804227/...
## $ urlArticleMobile    <chr> NA, "http://m.alarab.com/Article/825920", ...
## $ titleArticle        <chr> "Iran Herald", "<U+0647><U+0632><U+0629> <U+0623><U+0631><U+0636><U+064A><U+0629> <U+062B><U+0627><U+0646><U+064A><U+0629> <U+0641><U+064A> <U+0643><U+0648><U+064A><U+0627> <U+0627><U+0644>...
## $ datetimeArticle     <dttm> 2017-09-23 05:00:00, 2017-09-23 05:00:00,...
## $ urlImage            <chr> "http://cdn.bignewsnetwork.com/voa15061647...
## $ domainArticle       <chr> "iranherald.com", "alarab.com", "baomoi.co...
## $ languageArticle     <chr> "English", "Arabic", "Vietnamese", "Nepali...
## $ countryArticle      <chr> "Iran", "Israel", "Vietnam", "Tuvalu", "Un...

Extracting Images

Now that we have a data frame of articles that includes a column of image URLs, we can download the data.

Workflow

We’re going to take advantage of the magic of the purrr package in several ways to stabilize our workflow.

Error Handling

The Internet is littered with broken links and webpages, which becomes more likely the further back in time we go. We’ll use one of purrr’s adverbs, safely(), to handle the inevitable download errors that will occur by creating a new function called safe_download().

safe_download <- safely(download.file)

We’re also going to create safe versions of functions we’ll use for loading and plotting images. Although most of the valid URLs will link to clean images, it’s not uncommon for otherwise successful downloads to actually come from already corrupted sources.

To handle this, we’ll create safe_image() and safe_plot().

safe_image <- safely(load.image)
safe_plot <- safely(plot)

Download Images

filter() images using a regex that confirms either a .jpg or .png extension and simultaneously validates a URL sequence that we can use for each image’s eventual file path.
select a sample of 100 random rows sample_n(100)
pull() the urlImage column into a vector
iterate through each item of the vector with walk()
- safe_download() each image’s binary format (mode = "wb")
  - and write it to dir using its match to valid_path_regex

valid_path_regex <- "/[A-z0-9-_]+\\.(jpg|png)$"

articles_df %>%
  filter(str_detect(urlImage, valid_path_regex)) %>%
  sample_n(100) %>%
  pull(urlImage) %>%
  walk(~
         safe_download(.x,
                       paste0(dir, 
                              str_extract(.x, valid_path_regex)),
                       mode = "wb")
         )

Inspect Images

Let’s insepct a sample of the downloaded images.

Clearly the results are not perfect. There are images without anything resembling a missile as well as several duplicate or near-duplicate images. That said, manual renaming of files will allow filtering of useless images.

This is a quick proof of concept that sets us up well for enhancing data sets established through other methods.

More importantly,it demonstrates a basic workflow for bulk image processing that can be easily expanded to iteratively prepare a large dataset for many kinds of analyis.

We can take a look at our results with the following:

list.files() the full paths of all the files in dir
iterate through the resulting vector, reading each file with safe_image() and map()ping the results to a list
remove a layer of the list hierarchy by flatten()ing it
omit any resulting NULL values by compact()ing the list
subset a sample() of half the images, just for demonstration
walk() through the list, plotting each image

par(mfrow = c(5, 2))

list.files(dir, full.names = TRUE) %>%
  map(safe_image) %>%
  flatten() %>%
  compact() %>%
  sample(50) %>%
  walk(~ 
         safe_plot(.x, 
                   axes = FALSE, ann = FALSE)
       )

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2       imager_0.41.1        magrittr_1.5        
##  [4] knitr_1.20.8         forcats_0.3.0        stringr_1.3.1       
##  [7] dplyr_0.7.6          purrr_0.2.5          readr_1.1.1         
## [10] tidyr_0.8.1          tibble_1.4.2.9004    ggplot2_3.0.0.9000  
## [13] tidyverse_1.2.1.9000 gdeltr2_0.3.11702   
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137            xts_0.10-2             
##  [3] lubridate_1.7.4         webshot_0.5.0          
##  [5] progress_1.2.0          httr_1.3.1             
##  [7] tools_3.5.1             backports_1.1.2        
##  [9] utf8_1.1.4              R6_2.2.2               
## [11] lazyeval_0.2.1          colorspace_1.3-2       
## [13] withr_2.1.2             readbitmap_0.1.5       
## [15] tidyselect_0.2.4        prettyunits_1.0.2      
## [17] mnormt_1.5-5            curl_3.2               
## [19] compiler_3.5.1          cli_1.0.0              
## [21] rvest_0.3.2             xml2_1.2.0             
## [23] bookdown_0.7            triebeard_0.3.0        
## [25] scales_0.5.0.9000       checkmate_1.8.5        
## [27] psych_1.8.4             RApiDatetime_0.0.3     
## [29] trelliscopejs_0.1.13    digest_0.6.15          
## [31] tiff_0.1-5              foreign_0.8-70         
## [33] rmarkdown_1.10.7        base64enc_0.1-3        
## [35] jpeg_0.1-8              pkgconfig_2.0.1        
## [37] htmltools_0.3.6         highcharter_0.5.0      
## [39] highr_0.7               htmlwidgets_1.2        
## [41] rlang_0.2.1             ggthemes_3.5.0         
## [43] readxl_1.1.0            TTR_0.23-3             
## [45] htmldeps_0.1.0          rstudioapi_0.7         
## [47] quantmod_0.4-13         bindr_0.1.1            
## [49] zoo_1.8-2               jsonlite_1.5           
## [51] mclust_5.4.1            rlist_0.4.6.1          
## [53] fansi_0.2.3             Rcpp_0.12.17           
## [55] munsell_0.5.0           purrrlyr_0.0.3         
## [57] stringi_1.2.3           yaml_2.1.19            
## [59] plyr_1.8.4              grid_3.5.1             
## [61] parallel_3.5.1          crayon_1.3.4           
## [63] lattice_0.20-35         haven_1.1.2            
## [65] hms_0.4.2               anytime_0.3.1          
## [67] pillar_1.3.0.9000       igraph_1.2.1           
## [69] reshape2_1.4.3          codetools_0.2-15       
## [71] glue_1.2.0              evaluate_0.10.1        
## [73] blogdown_0.7.1          DistributionUtils_0.5-1
## [75] bmp_0.3                 data.table_1.11.5      
## [77] modelr_0.1.2            png_0.1-7              
## [79] urltools_1.7.0          cellranger_1.1.0       
## [81] gtable_0.2.0            assertthat_0.2.0       
## [83] xfun_0.3                broom_0.4.5            
## [85] autocogs_0.0.1          wordcloud2_0.2.1

To leave a comment for the author, please follow the link and comment on their blog: R Bloggers on syknapptic.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.