Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Global Database of Events, Language, and Tone, or GDELT, is “a realtime network diagram and database of global human society for open research”.
The potential for a firehose stream of global data has tantalizing possibilities for research, but concrete examples of work beyond simple evaluations of the database’s capabilities are notably absent…
See also:
- My hasty walkthrough from November 2017, Evaluating GDELT: Syrian Conflict
Let’s see how well we can scoop up a set of ballistic missile images using a combination of packages:
gdeltr2
: a package that is decidedly buggy, but works for these purposestidyverse
suite:dplyr
for data carpentrystringr
for string processingpurrr
for functional enhancements and simplified error-handling
knitr
for nicely rendered tabular dataimager
for slick image processing tools
library(gdeltr2) # devtools::install_github("abresler/gdeltr2") library(tidyverse) # install.packages("tidyverse") library(knitr) library(imager) # install.packages("tidyverse")
Before we start extracting any data, let’s refine our search as much as possible by assigning a handful of variables that we can use in the API call.
GDELT uses a set of codebooks that can be referenced with get_gdelt_codebook_ft_api()
.
Query Variables
Languages
gdeltr2
will only query English articles by default, but we don’t really care about the language of the article or even the article text
We’ll pull()
all the languages from code_book = "languages"
into a vector
variable that we can use as a search argument like so:
langs <- get_gdelt_codebook_ft_api(code_book = "languages") %>% pull(value) langs ## [1] "Afrikaans" "Albanian" "Arabic" ## [4] "Armenian" "Azerbaijani" "Bengali" ## [7] "Bosnian" "Bulgarian" "Catalan" ## [10] "Chinese" "Croatian" "Czech" ## [13] "Danish" "Dutch" "Estonian" ## [16] "Finnish" "French" "Galician" ## [19] "Georgian" "German" "Greek" ## [22] "Gujarati" "Hebrew" "Hindi" ## [25] "Hungarian" "Icelandic" "Indonesian" ## [28] "Italian" "Japanese" "Kannada" ## [31] "Kazakh" "Korean" "Latvian" ## [34] "Lithuanian" "Macedonian" "Malay" ## [37] "Malayalam" "Marathi" "Mongolian" ## [40] "Nepali" "Norwegian" "NorwegianNynorsk" ## [43] "Persian" "Polish" "Portuguese" ## [46] "Punjabi" "Romanian" "Russian" ## [49] "Serbian" "Sinhalese" "Slovak" ## [52] "Slovenian" "Somali" "Spanish" ## [55] "Swahili" "Swedish" "Tamil" ## [58] "Telugu" "Thai" "Tibetan" ## [61] "Turkish" "Ukrainian" "Urdu" ## [64] "Vietnamese"
Image Tags
Since we’re looking specifically for imagery, we can query the relevant codebooks with code_book = "imagetags"
and code_book = "imageweb"
like so:
get_gdelt_codebook_ft_api(code_book = "imagetags") %>% head() %>% kable()
nameCodebook | idImageTag | value |
---|---|---|
IMAGETAGS | person | 65791693 |
IMAGETAGS | profession | 33553949 |
IMAGETAGS | vehicle | 25342998 |
IMAGETAGS | sports | 17180851 |
IMAGETAGS | speech | 16976988 |
IMAGETAGS | people | 13358317 |
get_gdelt_codebook_ft_api(code_book = "imageweb") %>% head() %>% kable()
nameCodebook | idImageWeb | value |
---|---|---|
IMAGEWEB | Image | 2198300 |
IMAGEWEB | News | 2136894 |
IMAGEWEB | Photograph | 1027341 |
IMAGEWEB | United States of America | 659847 |
IMAGEWEB | Speech | 649292 |
IMAGEWEB | Car | 621304 |
We’ll filter()
the tags to retain only those that explicitly reference “missile” with a regex.
We also want to handle a bug in gdeltr2
’s query functions where sometimes a a large amount of incorrect information makes it into tag lists. Fortunately, we can omit that by excluding results containing blocks of multiple digits.
tag_regex <- "\\b[Mm]issile\\b" bind_rows( get_gdelt_codebook_ft_api(code_book = "imagetags") %>% filter(str_detect(idImageTag, tag_regex), !str_detect(idImageTag, "\\d{2,}")), get_gdelt_codebook_ft_api(code_book = "imageweb") %>% filter(str_detect(idImageWeb, tag_regex), !str_detect(idImageWeb, "\\d{2,}")) ) %>% head() %>% kable()
nameCodebook | idImageTag | value | idImageWeb |
---|---|---|---|
IMAGETAGS | missile | 247486 | NA |
IMAGETAGS | guided missile destroyer | 194660 | NA |
IMAGETAGS | missile boat | 147549 | NA |
IMAGETAGS | ballistic missile submarine | 55996 | NA |
IMAGETAGS | cruise missile submarine | 11508 | NA |
IMAGEWEB | NA | 77637 | Missile |
We’ll refine our results by excluding some of the tags that have a tendency to return less relevant images.
- vehicle terms tend to emphasize the vehicle itself, rather than weapon systems
"boat"
"submarine"
"tank"
"destroyer"
- Missile
"defense"
emphasizes politics over hardware - specific
"system"
tags are all in reference to surface-to-air platforms- S-300 missile system
- S-400 missile system
- Buk missile system
- generalized Surface-to-
"air"
doesn’t seem fuzzy enough to ever reference ballistic missiles
We’ll use another regex to omit those tags, including the multiple digit regex used to exclude the buggy data that may leak into our results.
Junk Tag Filtering
junk_tag_regex <- c("boat", "[Ss]ubmarine", "tank", "destroyer", "defense", "system", "air") %>% paste0("\\b", ., "\\b") %>% str_c(collapse = "|") %>% paste0("|\\d{2,}") junk_tag_regex ## [1] "\\bboat\\b|\\b[Ss]ubmarine\\b|\\btank\\b|\\bdestroyer\\b|\\bdefense\\b|\\bsystem\\b|\\bair\\b|\\d{2,}"
With some parameters in mind and filtering variables assigned, let’s pull()
the desired tags from each codebook into a pair of variables which we will use to query GDELT’s API.
image_tags <- get_gdelt_codebook_ft_api(code_book = "imagetags") %>% filter(str_detect(idImageTag, tag_regex), !str_detect(idImageTag, junk_tag_regex)) %>% pull(idImageTag) imageweb_tags <- get_gdelt_codebook_ft_api(code_book = "imageweb") %>% filter(str_detect(idImageWeb, tag_regex), !str_detect(idImageWeb, junk_tag_regex)) %>% pull(idImageWeb) combine(image_tags, imageweb_tags) ## [1] "missile" ## [2] "Missile" ## [3] "Ballistic missile" ## [4] "Cruise missile" ## [5] "Intercontinental ballistic missile" ## [6] "Anti-ballistic missile" ## [7] "Missile launch facility" ## [8] "Medium-range ballistic missile" ## [9] "Land-attack missile" ## [10] "Short-range ballistic missile" ## [11] "Surface-to-surface missile"
Dates
We’ll specify a time period using gdeltr2::generate_dates()
. For this example, we’ll select September 22-23 of 2017 to see if we can capture coverage of an Iranian military parade.
target_dates <- generate_dates(start_date = "2017-09-22", end_date = "2017-09-23")
API Call
With all of our query variables prepared, we’ll call GDELT’s API using get_data_ft_v2_api()
. As duplicate articles are commonly published in many venues, we’ll omit results to only include distinct()
titleArticle
s.
articles_df <- get_data_ft_v2_api(images_tag = image_tags, images_web_tag = imageweb_tags, search_language = langs, dates = target_dates, visualize_results = FALSE) %>% distinct(titleArticle, .keep_all = TRUE)
Query Results
Here’s a summary of what we get back.
articles_df %>% glimpse() ## Observations: 393 ## Variables: 16 ## $ modeSearch <chr> "ArtList", "ArtList", "ArtList", "ArtList"... ## $ imagewebtagSearch <chr> "Missile", "Missile", "Missile", "Missile"... ## $ datetimeStartSearch <chr> "2017-09-22 12:00:00", "2017-09-22 12:00:0... ## $ datetimeEndSearch <chr> "2017-09-23 11:59:59", "2017-09-23 11:59:5... ## $ imagetagSearch <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ isOR <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ... ## $ countMaximumRecords <dbl> 250, 250, 250, 250, 250, 250, 250, 250, 25... ## $ urlGDELTV2FTAPI <chr> "https://api.gdeltproject.org/api/v2/doc/d... ## $ urlArticle <chr> "http://www.iranherald.com/news/254804227/... ## $ urlArticleMobile <chr> NA, "http://m.alarab.com/Article/825920", ... ## $ titleArticle <chr> "Iran Herald", "<U+0647><U+0632><U+0629> <U+0623><U+0631><U+0636><U+064A><U+0629> <U+062B><U+0627><U+0646><U+064A><U+0629> <U+0641><U+064A> <U+0643><U+0648><U+064A><U+0627> <U+0627><U+0644>... ## $ datetimeArticle <dttm> 2017-09-23 05:00:00, 2017-09-23 05:00:00,... ## $ urlImage <chr> "http://cdn.bignewsnetwork.com/voa15061647... ## $ domainArticle <chr> "iranherald.com", "alarab.com", "baomoi.co... ## $ languageArticle <chr> "English", "Arabic", "Vietnamese", "Nepali... ## $ countryArticle <chr> "Iran", "Israel", "Vietnam", "Tuvalu", "Un...
Extracting Images
Now that we have a data frame of articles that includes a column of image URLs, we can download the data.
Directory
Let’s assign a variable for our dir
ectory.
dir <- "data/missile_images/"
Then we’ll actually create the dir
ectory.
dir.create(dir)
Workflow
We’re going to take advantage of the magic of the purrr
package in several ways to stabilize our workflow.
Error Handling
The Internet is littered with broken links and webpages, which becomes more likely the further back in time we go. We’ll use one of purrr
’s adverbs, safely()
, to handle the inevitable download errors that will occur by creating a new function called safe_download()
.
safe_download <- safely(download.file)
We’re also going to create safe versions of functions we’ll use for loading and plotting images. Although most of the valid URLs will link to clean images, it’s not uncommon for otherwise successful downloads to actually come from already corrupted sources.
To handle this, we’ll create safe_image()
and safe_plot()
.
safe_image <- safely(load.image) safe_plot <- safely(plot)
Download Images
filter()
images using a regex that confirms either a .jpg or .png extension and simultaneously validates a URL sequence that we can use for each image’s eventual file path.- select a sample of 100 random rows
sample_n(100)
pull()
theurlImage
column into a vector- iterate through each item of the vector with
walk()
safe_download()
each image’s binary format (mode = "wb"
)- and write it to
dir
using its match tovalid_path_regex
- and write it to
valid_path_regex <- "/[A-z0-9-_]+\\.(jpg|png)$" articles_df %>% filter(str_detect(urlImage, valid_path_regex)) %>% sample_n(100) %>% pull(urlImage) %>% walk(~ safe_download(.x, paste0(dir, str_extract(.x, valid_path_regex)), mode = "wb") )
Inspect Images
Let’s insepct a sample of the downloaded images.
Clearly the results are not perfect. There are images without anything resembling a missile as well as several duplicate or near-duplicate images. That said, manual renaming of files will allow filtering of useless images.
This is a quick proof of concept that sets us up well for enhancing data sets established through other methods.
More importantly,it demonstrates a basic workflow for bulk image processing that can be easily expanded to iteratively prepare a large dataset for many kinds of analyis.
We can take a look at our results with the following:
list.files()
the full paths of all the files indir
- iterate through the resulting
vector
, reading each file withsafe_image()
andmap()
ping the results to alist
- remove a layer of the
list
hierarchy byflatten()
ing it - omit any resulting
NULL
values bycompact()
ing the list - subset a
sample()
of half the images, just for demonstration walk()
through the list, plotting each image
par(mfrow = c(5, 2)) list.files(dir, full.names = TRUE) %>% map(safe_image) %>% flatten() %>% compact() %>% sample(50) %>% walk(~ safe_plot(.x, axes = FALSE, ann = FALSE) )
sessionInfo() ## R version 3.5.1 (2018-07-02) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 17134) ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] bindrcpp_0.2.2 imager_0.41.1 magrittr_1.5 ## [4] knitr_1.20.8 forcats_0.3.0 stringr_1.3.1 ## [7] dplyr_0.7.6 purrr_0.2.5 readr_1.1.1 ## [10] tidyr_0.8.1 tibble_1.4.2.9004 ggplot2_3.0.0.9000 ## [13] tidyverse_1.2.1.9000 gdeltr2_0.3.11702 ## ## loaded via a namespace (and not attached): ## [1] nlme_3.1-137 xts_0.10-2 ## [3] lubridate_1.7.4 webshot_0.5.0 ## [5] progress_1.2.0 httr_1.3.1 ## [7] tools_3.5.1 backports_1.1.2 ## [9] utf8_1.1.4 R6_2.2.2 ## [11] lazyeval_0.2.1 colorspace_1.3-2 ## [13] withr_2.1.2 readbitmap_0.1.5 ## [15] tidyselect_0.2.4 prettyunits_1.0.2 ## [17] mnormt_1.5-5 curl_3.2 ## [19] compiler_3.5.1 cli_1.0.0 ## [21] rvest_0.3.2 xml2_1.2.0 ## [23] bookdown_0.7 triebeard_0.3.0 ## [25] scales_0.5.0.9000 checkmate_1.8.5 ## [27] psych_1.8.4 RApiDatetime_0.0.3 ## [29] trelliscopejs_0.1.13 digest_0.6.15 ## [31] tiff_0.1-5 foreign_0.8-70 ## [33] rmarkdown_1.10.7 base64enc_0.1-3 ## [35] jpeg_0.1-8 pkgconfig_2.0.1 ## [37] htmltools_0.3.6 highcharter_0.5.0 ## [39] highr_0.7 htmlwidgets_1.2 ## [41] rlang_0.2.1 ggthemes_3.5.0 ## [43] readxl_1.1.0 TTR_0.23-3 ## [45] htmldeps_0.1.0 rstudioapi_0.7 ## [47] quantmod_0.4-13 bindr_0.1.1 ## [49] zoo_1.8-2 jsonlite_1.5 ## [51] mclust_5.4.1 rlist_0.4.6.1 ## [53] fansi_0.2.3 Rcpp_0.12.17 ## [55] munsell_0.5.0 purrrlyr_0.0.3 ## [57] stringi_1.2.3 yaml_2.1.19 ## [59] plyr_1.8.4 grid_3.5.1 ## [61] parallel_3.5.1 crayon_1.3.4 ## [63] lattice_0.20-35 haven_1.1.2 ## [65] hms_0.4.2 anytime_0.3.1 ## [67] pillar_1.3.0.9000 igraph_1.2.1 ## [69] reshape2_1.4.3 codetools_0.2-15 ## [71] glue_1.2.0 evaluate_0.10.1 ## [73] blogdown_0.7.1 DistributionUtils_0.5-1 ## [75] bmp_0.3 data.table_1.11.5 ## [77] modelr_0.1.2 png_0.1-7 ## [79] urltools_1.7.0 cellranger_1.1.0 ## [81] gtable_0.2.0 assertthat_0.2.0 ## [83] xfun_0.3 broom_0.4.5 ## [85] autocogs_0.0.1 wordcloud2_0.2.1
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.