Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post I show how to select relevant bits of the GDELT data in R and present some introductory ideas about how to visualise it as a network map. I've included all the code used to generate the illustrations. Because of this, if you here for the shiny visualisations, you'll have to scroll way down
The Guardian recently published an article linking to a database of 250 million events. Sounds too good to be true, but as I'm writing a PhD on recent Russian memory events, I was excited to try it out. I downloaded the data, generously made available by Kalev Leetaru of the University of Illinois, and got going. It's a large 650mb zip file (4.6gb uncompressed!), and this is apparently the abbreviated version. Consequently this early stage of the analysis was dominated by eager anticipation, as the Cambridge University internet did its thing.
Meanwhile I headed over to David Masad's writeup on getting started with GDELT in python
25 minutes later the data was down, and unpacked. It came equipped with a few python and R scripts, which I quickly set about modifying. Although hampered by my lack of python knowledge I got the extract script running easily enough, and soon was pulling out all events associated with Russia. It was still a bit unclear what these 'events' were.
According to David Masad, there are four types of event: 1. Material Cooperation 2. Verbal Cooperation 3. Verbal Conflict 4. Material Conflict
I was still not too sure what these exactly represented, but by now the extract script was ticking over into 2007, so I was more interested in preparing the visualisation scripts in R. On the subject of the extract script - the events recorded are very obviously are skewed towards the present, so any time-series conclusions should be taken with a pinch of salt, or at the very least adjusted somehow. Thankfully the data is not quite up to date, and the output data finally clocked in at 226mb.
I was a bit worried about how R would handle a file that size, so I tried reading in a few lines at a time. The first line looked like this. Clearly the data is tab delimited. The first column is the date, the second and third actor codes, then three columns I later discovered were 'event code, quad category, and Godlstein Scale'. Much excitement to be had figuring out what this all means. The last six columns are longitude and latitude, representing the location of the actors and the event.
By the way, for anyone preferring to do this in Python, which frankly makes a lot of sense, here is a great tutorial for finding exactly the data you need.
[1] "19790101\tATH\tRUS\t120\t3\t-4.0\t52.1256\t140.381\t52.1256\t140.381\t52.1256\t140.381"
Now for loading the data in. I was hopeful this could work, but not much beyond that. Windows task manager suggested I was running out of memory fast. But then it was all ok, and the data beautifully formatted. This was also when I realised just how massive this data set is: more than 3 million events featuring Russia. Phew.
First I renamed the columns, fixed the dates, and saved the file in R's native Rdata format (this got the data down to a positively miniscule 37mb) (I realised yet again that Lubridate is amazing. I was worried this would take ages, but, thankfully, no)
library(lubridate) t <- read.table("GDELTproject/data/select.outfile.txt", sep = "\t") colnames(t) <- c("Day", "Actor1Code", "Actor2Code", "EventCode", "QuadCategory", "GoldsteinScale", "Actor1Geo_Lat", "Actor1Geo_Long", "Actor2Geo_Lat", "Actor2Geo_Long", "ActionGeo_Lat", "ActionGeo_Long") t$Day <- ymd(t$Day) save(t, file = "gdeltRus.Rdata")
So now for some exploratory analysis - who is involved, and when did these events take place?
print(head(t)) Day Actor1Code Actor2Code EventCode QuadCategory GoldsteinScale 1 1979-01-01 ATH RUS 120 3 -4.0 2 1979-01-01 MOS RUS 40 2 1.0 3 1979-01-01 MOS RUS 43 2 2.8 4 1979-01-01 MOS RUS 51 2 3.4 5 1979-01-01 RUS GUYGOV 841 1 7.0 6 1979-01-01 RUS IRN 100 3 -5.0 Actor1Geo_Lat Actor1Geo_Long Actor2Geo_Lat Actor2Geo_Long ActionGeo_Lat 1 52.13 140.38 52.13 140.38 52.13 2 55.75 37.62 55.75 37.62 55.75 3 52.13 140.38 52.13 140.38 52.13 4 55.75 37.62 55.75 37.62 55.75 5 6.80 -58.15 38.00 -97.00 6.80 6 55.75 37.62 55.75 37.62 55.75 ActionGeo_Long 1 140.38 2 37.62 3 140.38 4 37.62 5 -58.15 6 37.62 tail(sort(table(t$Actor1Code)), 20) IRN CVL DEU JPN CHE USAGOV GBR BUS LEG 21413 21801 22956 24430 25539 25544 25775 28223 31419 COP CHN UKR RUSMED RUSMIL MIL MED USA GOV 31773 35332 35857 35904 36555 40957 50677 89275 128319 RUSGOV RUS 283648 1422931
The list of actors looks promising: Russia, Russian government, a generic 'GOV' (what's that about?), USA, MED and MIL which I imagine are medical and military respectively, followed by the Russian military etc etc.
The actor2 columns looked quite similar, but featured some intriguing entries, such as IGOWSTNAT. Ideas, anyone? [Turns out this IGO refers to international or regional inter-governmental organisations ]
As for date distribution, it is reasonably sparse until the mid 90s, since which there has been a general rise, with sharp increases in events during the crises of the late 90s and late 2000s. This is promising!
library(lubridate) library(ggplot2) x <- data.frame(table(year(t$Day))) ggplot(x, aes(x = Var1, y = Freq)) + geom_bar(stat = "identity") + theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
Now, what about plotting some on a map? For this initial plot i ignored actor and date data, and just used the final two columns of event location, grouping them by count. This left 30000 distinct geo locations in which events with Russian involvement have been recorded.
On to the plotting:
library(RgoogleMaps) library(ggmap) library(mapproj) library(plyr) # Reshape the data so we get the counts of events for each location t$count <- 1 df <- ddply(t, .(ActionGeo_Lat, ActionGeo_Long), summarize, count = sum(count)) # lets define the scope of our map lat <- c(40, 70) lon <- c(20, 100) center = c(lat = mean(lat), lon = mean(lon)) zoom <- min(MaxZoom(range(lat), range(lon))) # And download a map file! map <- get_map(location = c(lon = mean(lon), lat = mean(lat)), zoom = 3, maptype = "terrain", source = "google") print(ggmap(map) + geom_point(data = df, aes(x = ActionGeo_Long, y = ActionGeo_Lat, size = count), alpha = 0.3))
Clearly there is a lot going on here. Maybe focusing on one type of activity would be more productive. This is where it really gets interesting. The data set uses the CAMEO coding scheme , which includes quite detailed information about event type. Let's isolate examples of civil disobedience: protests, rallies, hunger strikes, strikes, boycotts, obstructions, violent protest, etc.
This reduces the data to 25 000 events. Still a lot.
But the CAMEO specifications include information about actors, e.g. the NGOs involved. Hence the mysterious REB, SPY, JUD, etc.
By keeping only entries where at least one agent is a representative of so-called civil society, we are left with 6443 entries.
t$count <- 1 # selecting the events we are interested in t2 <- t[t$EventCode > 139 & t$EventCode < 146, ] # Keep only rows with agents that interest us. Combined these t3a <- t2[grep("OPP|EDU|REL|CVL|ELI|LAB|IGO|NGO|NGM", t2$Actor1Code), ] t3b <- t2[grep("OPP|EDU|REL|CVL|ELI|LAB|IGO|NGO|NGM", t2$Actor2Code), ] t3 <- rbind(t3a, t3b) # agents: OPP, EDU, REL, CVL, ELI, LAB, IGO, NGO, NGM df2 <- ddply(t3, .(ActionGeo_Lat, ActionGeo_Long), summarize, count = sum(count)) map <- get_map(location = c(lon = mean(lon), lat = mean(lat)), zoom = 3, maptype = "terrain", source = "google", color = "bw") print(ggmap(map) + geom_point(data = df2, aes(x = ActionGeo_Long, y = ActionGeo_Lat, size = count), alpha = 0.8, pch = 21, colour = "black", fill = "red2") + scale_size(range = c(2, 7)))
Let's zoom in on Moscow during the recent Russian protests and see what detail we can see.
t3$recent <- 0 t3$Day <- as.Date(t3$Day) t3$recent[t3$Day > ("2011-07-01")] <- 1 tryCatch(detach("package:Hmisc"), error = function(e) NULL) df2 <- ddply(t3, .(ActionGeo_Lat, ActionGeo_Long, recent), summarize, count = sum(count)) mos <- get_map(location = "moscow", zoom = 10, maptype = "terrain", source = "google") print(ggmap(mos) + geom_point(data = df2, aes(x = ActionGeo_Long, y = ActionGeo_Lat, size = count, fill = recent), alpha = 0.8, pch = 21, colour = "black") + scale_size(range = c(4, 20)) + facet_wrap(~recent))
On the left is all civil society protest activity from 1979 until July 2011, and on the right in the following 12 months. I've not analysed the plot in great detail, but to the naked eye it looks like there were half the number of (reports of) protests and acts of civil disobedience in Moscow in 2011-12 as for the preceding 30 years, which you may or may not find surprising.
On the subject of protest movements, Alex Hanna has already looked at what GDELT has to say about the Arab spring.
Back to Russia: a more interesting plot can be achieved by drawing lines between the geo location of actors. For the plot below I have only included events between July 2011 and mid 2012, when the dataset ends.
(most of the code for the visualisation is plundered from here)
#Limit the data to the last year recorded t3$recent <- 0 t3$Day <- as.Date(t3$Day) t3$recent[t3$Day>("2011-07-01")] <- 1 t3 <- t3[t3$recent==1,] #to avoid conflicts between plyr and Hmisc. If anyone knows a better way of doing this, please let me know! tryCatch(detach("package:Hmisc"),error=function(e) NULL) df2 <- ddply(t3,.(Actor1Geo_Lat,Actor1Geo_Long,Actor2Geo_Lat,Actor2Geo_Long,ActionGeo_Lat,ActionGeo_Long),summarize, count=sum(count)) df2 <- df2[complete.cases(df2),] #remove links with America and southern hemisphere df2 <- df2[df2$Actor1Geo_Lat>0&df2$Actor2Geo_Lat>0&df2$Actor1Geo_Long>0&df2$Actor2Geo_Long>0,] #remove Generic Russia df2 <- df2[df2$Actor2Geo_Lat!=60&df2$Actor2Geo_Long!=100,] df2 <- df2[df2$Actor1Geo_Lat!=60&df2$Actor1Geo_Long!=100,] #place points and edges in separate data frames pointsDf <- df2[,5:7] colnames(pointsDf)[3] <- "count2" edgesDf <- df2[,c(1:4,7)] a <- paste0(edgesDf[,1],edgesDf[,2])#remove points were start and end are the same b <- paste0(edgesDf[,3],edgesDf[,4]) edgesDf <- edgesDf[!a==b,] library(sna) library(Hmisc) edgeMaker <- function(whichRow, len = 1, curved = TRUE){ fromC <- c(edgesDf[whichRow,2],edgesDf[whichRow,1]) # Origin toC <- c(edgesDf[whichRow,4],edgesDf[whichRow,3]) # Terminus weight <- edgesDf[whichRow, 5] # Terminus # Add curve: graphCenter <- c(50,50)#colMeans(edgesDf[,1:2]) # Center of the overall graph bezierMid <- c(fromC[1], toC[2]) # A midpoint, for bended edges distance1 <- sum((graphCenter - bezierMid)^2) if(distance1 < sum((graphCenter - c(toC[1], fromC[2]))^2)){ bezierMid <- c(toC[1], fromC[2]) } # To select the best Bezier midpoint bezierMid <- (fromC + toC + bezierMid) / 3 # Moderate the Bezier midpoint if(curved == FALSE){bezierMid <- (fromC + toC) / 2} # Remove the curve edge <- data.frame(bezier(c(fromC[1], bezierMid[1], toC[1]), # Generate c(fromC[2], bezierMid[2], toC[2]), # X & y evaluation = len)) # Bezier path coordinates edge$Sequence <- 1:len # For size and colour weighting in plot edge$weight <- weight edge$Group <- whichRow return(edge) } allEdges <- lapply(1:nrow(edgesDf), edgeMaker, len = 100, curved = TRUE) allEdges <- do.call(rbind, allEdges) # a fine-grained path ^, with bend ^ # col <- rep(gray(8/10),length(labels)) # col2 <- rep(gray(1/10),length(labels)) # col[labels==term] <- "red3" # col2[labels==term] <- "red4" new_theme_empty <- theme_bw() new_theme_empty$line <- element_blank() new_theme_empty$rect <- element_blank() new_theme_empty$strip.text <- element_blank() new_theme_empty$axis.text <- element_blank() new_theme_empty$axis.title <- element_blank() new_theme_empty$legend <- element_blank() new_theme_empty$plot.margin <- structure(c(0, 0, 0, -1), unit = "lines", valid.unit = 3L, class = "unit") map <-get_map(location=c(lon=mean(lon), lat=mean(lat)),zoom=3,maptype="terrain",color="bw") p <- ggmap(map) #p <- ggplot(allEdges,environment=environment()) p <- p + geom_path(data=allEdges, aes(x = x, y = y,group = Group, # Edges with gradient size=(weight-1),colour=Sequence),alpha=.6) # and taper p <- p + geom_point(data = data.frame(pointsDf), # Add points aes(x = ActionGeo_Long, y = ActionGeo_Lat,size=log(count2)), pch = 21, colour = "black", fill = "red2") # Customize gradient p <- p + scale_colour_gradient(low = "red3", high = "white", guide = "none") p <- p + scale_size(range = c(.6, 5), guide = "none") # Customize taper p <- p + new_theme_empty # Clean up plot p <- p + xlim(min(allEdges$x-1),max(allEdges$x+1)) p <- p + ylim(20,65) print(p) ggsave("latestImage.png", p, h = 4, w = 8, type = "cairo-png")
In the plot above the circles represent the number of events occurring in a given location, while the shaded lines represent events involving two actors in different locations. The red end of the edge is the origin, the white the destination. I removed all links to the USA or the southern hemisphere, as these obstruct the map. I was interested to note how few events link Russia and the former East European Satellite states (Poland, Hungary, etc.), while noting that there seems to be an extraordinary amount of activity linking Russia and Israel, and possibly also Syria. Finally, it seems that events taking place in the Russian regions, especially the Caucasus, very rarely elicit an international response:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.