UNHCR Refugee Data Visualized

[This article was first published on R – Incidental Ideas, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was working on to expand the disease transmission model from my first post but with the currents events, I felt compelled to work on something different. The new United States presidential administration formulated an executive order the end of last week, if issued, will suspend visas of internationals in the United States (either for school or work) from select countries, prevent the the entrance of Syrian refugees indefinitely and a “readjustment” of the the Refugee and Immigration Programs. Because of this disruptive change in immigration – especially for the most vulnerable populations that include refugees, asylum seekers and displaced people, all fleeing from conflict, instability or persecution – I wanted create a post on this topic.

Where’s the Data?

The data I’m using is taken from the United Nations High Commissioner for Refugees (UNHCR) website – the UN Refugee Agency. You can read more on what they do and why the exist in the link above.  Currently you can only download the mid-year statistics for 2015. You get a statement saying: “Error: Statistics filters are not available at this time. Please try again later.”  I just checked again and the data is accessible again!

In this post I am not going to share my views on the current political landscape in the United States (that can be a complete separate blog on its own). Similar to my previous post, I’m going to do a walkthrough with data cleaning, ask a couple questions and visualize the data in some meaningful way. If you want to skip the details on the code, you can just scroll down to figures.

Working in R

As usual, all the code is in my GitHub repository for whoever wants to download and use it for themselves. For this post, I’m going to use the following packages ggplot, grid, scales, reshape2, scales and worldcloud. I was reading some R manuals and I discovered a “new” way to access your working directory environment. It involves the use of the list.files() function, which lists the files or folders in your current working directory. You can also further specify the path name (which saves the time of constantly changing the pathname in the setwd() function (which I have been foolishly doing for a while).

setwd("~/Documents/UNHCR Data/") # ~ acts as base for home directory

list.files(full.names=TRUE) # gives full path for files or folders
files.all <- list.files(path="All_Data/") #assigns name to file names in the All_Data folder
length(files.all) #checks how many objects there are in the /All_Data folder
files.all

Since the file is a comm delimited file (.csv) we use the read.csv() function to read it in and assign it the name “ref.d”. I used what I know with the paste0() function  We set the skip argument equal to 2 because by visual inspection of the file, the first two rows are committed to the file title (we don’t want to read that into R). I also used the na.string argument to specify that any blanks (“”), dashes (“-“) and asterisks (“*”) would be considered missing data, NA. The asterisks are specified to be redacted information, based on the UNHCR website.

ref.d <- read.csv(paste0("All_Data/",files.all), #insert filepath and name into 1st argument
	header=T, #select the headers in 3rd row
	skip=2, #skips the first rows (metadata in .csv file)
	na.string=c("","-","*"), #convert all blanks, "i","*" cells into missing type NA
	col.names=new.names #since we already made new names
	)

First thing I see, the names for the columns names are long and they would be annoying to reproduce. So first we’ll change these using the names() function.

new.names <- c("Year", "Country", "Country_Origin", "Refugees", "Asylum_Seekers", "Returned_Refugees", "IDPs", "Returned_IDPs", "Stateless_People", "Others_of_Concern","Total")
names(ref.d) <- new.names 

By using summary() and str() on the dataset we can see that the range of the data spans from 1951 to 2014, it contains information on the country where refugees are situated, their country of origin, counts for each population of concern and total counts.

 > summary(ref.d)
      Year        Country          Country_Origin        Refugees
 Min.   :1951   Length:103746      Length:103746      Min.   :      1
 1st Qu.:2000   Class :character   Class :character   1st Qu.:      3
 Median :2006   Mode  :character   Mode  :character   Median :     16
 Mean   :2004                                         Mean   :   5992
 3rd Qu.:2010                                         3rd Qu.:    167
 Max.   :2014                                         Max.   :3272290
                                                      NA's   :19353
 Asylum_Seekers     Returned_Refugees      IDPs         Returned_IDPs
 Min.   :     0.0   Min.   :      1   Min.   :    470   Min.   :     23
 1st Qu.:     1.0   1st Qu.:      2   1st Qu.:  90746   1st Qu.:   5000
 Median :     5.0   Median :     16   Median : 261704   Median :  27284
 Mean   :   255.9   Mean   :   6737   Mean   : 540579   Mean   : 111224
 3rd Qu.:    34.0   3rd Qu.:    247   3rd Qu.: 594443   3rd Qu.: 104230
 Max.   :358056.0   Max.   :9799410   Max.   :7632500   Max.   :1186889
 NA's   :46690      NA's   :97327     NA's   :103330    NA's   :103544
 Stateless_People  Others_of_Concern      Total
 Min.   :      1   Min.   :     1.0   Min.   :      1
 1st Qu.:    205   1st Qu.:    14.8   1st Qu.:      3
 Median :   1720   Median :   444.5   Median :     16
 Mean   :  66803   Mean   : 24672.5   Mean   :   8605
 3rd Qu.:  11462   3rd Qu.:  6000.0   3rd Qu.:    166
 Max.   :3500000   Max.   :957000.0   Max.   :9799410
 NA's   :103103    NA's   :103018     NA's   :2433

It’s always good practice to identify missing data as well (especially when we set the condition of the read.csv argument above). For non-numeric variables you can use a simple function using apply() and is.na() to identify missing values (NA’s) in your data.

apply(ref.d,2, function(x) sum(is.na(x))) 

When I used str() on the data, I saw that the country names were as factors and the populations of concern categories were integers. I made a short for loop to change these to a character and numeric type respectively.

for(i in 2:length(names(ref.d))){ # "2"-ignores the first column (we want to keep Year as an integer)
	if (class(ref.d[,i])=="factor"){
		ref.d[,i] <- as.character(ref.d[,i])}
	if (class(ref.d[,i])=="integer"){
		ref.d[,i] <- as.numeric(ref.d[,i])}
}

Also another nuance, I wanted to change some of the names of the countries (they were either very long or had extra information). I first identified the names I wanted to change and then replace them a new set of names. I did this using a for loop as well.

old.countries <- c("Bolivia (Plurinational State of)",
	"China, Hong Kong SAR",
	"China, Macao SAR",
	"Iran (Islamic Rep. of)",
	"Micronesia (Federated States of)",
	"Serbia and Kosovo (S/RES/1244 (1999))",
	"Venezuela (Bolivarian Republic of)",
	"Various/Unknown")
# replacement names
new.countries <- c("Bolivia","Hong Kong","Macao","Iran","Micronesia","Serbia & Kosovo","Venezuela","Unknown")
for (k in 1:length(old.countries)){
	ref.d$Country_Origin[ref.d$Country_Origin==old.countries[k]]<-new.countries[k]
	ref.d$Country[ref.d$Country==old.countries[k]]<-new.countries[k]
}

If any has alternative ways to achieve the above (ie. using the apply family), comment below! Just a short disclaimer on for loops in R. There has been a lot of argument on the effectiveness of for loops in R compared to the apply function family. A quick Google Search shows many opinions on this issue, based on computing speed/power, simplicity, elegance, etc. Advanced R by Hadley Wickham talks about this and I’ve generally used this as a guideline on whether to use a for loop or an apply function.

Some Descriptives and North Korea

Just to get an idea of the data, we can create a list of the countries and countries of origin by using the code similar to identifying countries with certain MRSA strains in my previous post.

clist<-sort(unique(ref.d$Country)) #alphabetical
clist
or.clist<-sort(unique(ref.d$Country_Origin)) #alphabetical
or.clist

We can then compare them for any differences either using matching operators or the setdiff() function. First we’ll do this…

clist[!clist %in% or.clist] # or
setdiff(clist,or.clist)

[1] "Bonaire"                   "Montserrat"
[3] "Sint Maarten (Dutch part)" "State of Palestine"

… we can infer that these countries haven’t produced refugees or there is no data on these countries in the UNHCR database. If we reverse the comparison…

or.clist[!or.clist %in% clist] # or
setdiff(or.clist,clist)

 [1] "Andorra"                     "Anguilla"
 [3] "Bermuda"                     "Cook Islands"
 [5] "Dem. People's Rep. of Korea" "Dominica"
 [7] "French Polynesia"            "Gibraltar"
 [9] "Guadeloupe"                  "Holy See (the)"
[11] "Kiribati"                    "Maldives"
[13] "Marshall Islands"            "Martinique"
[15] "New Caledonia"               "Niue"
[17] "Norfolk Island"              "Palestinian"
[19] "Puerto Rico"                 "Samoa"
[21] "San Marino"                  "Sao Tome and Principe"
[23] "Seychelles"                  "Stateless"
[25] "Tibetan"                     "Tuvalu"
[27] "Wallis and Futuna Islands "  "Western Sahara"

… we get a list of countries that have only produced refugees (not taken any refugees in) or there is missing data in the UNHCR. I myself being Korean-Canadian I noticed North Korea (the Democratic People’s Republic of Korea) on this list. I wanted to ask the question, which countries have the largest number of North Korean refugees based on the UNHCR Data? What are the top 10?

NK.tot<- aggregate(cbind(Total)~Country,data=NK,FUN=sum)
NK.tot[order(-NK.tot[,2]),][1:10,]
                    Country Total
31           United Kingdom  4808
5                    Canada  2954
10                  Germany  2845
18              Netherlands   487
3                   Belgium   435
23       Russian Federation   357
32 United States of America   346
1                 Australia   318
20                   Norway   262
9                    France   228

We find that the UK has the highest number of North Koreans refugees, followed by Canada and Germany with similar counts. Learned something new today.

Word Clouds in R

If you haven’t guessed based on the packages I loaded in the beginning, a word cloud was inevitable. Making word clouds in R is pretty simple thanks to the wordcloud package. Remember that word clouds are an esthetically decent way to qualitatively see your data. There are bunch of pros and cons on using word clouds. Generally this is appropriate if you just want see the general relative frequency of words in your data. Making any quantitative conclusions would be erroneous.

First we’ll aggregate the counts for the country of origin.

or.count.tot <- aggregate(cbind(Total)~Country_Origin,data=ref.d,FUN=sum)
or.count.tot

Then set color palette using HEX codes (much thanks to I Want Hue).

pal3 <- c("#274c56",
"#664c47",
"#4e5c48",
"#595668",
"#395e68",
"#516964",
"#6b6454",
"#58737f",
"#846b6b",
"#807288",
"#758997",
"#7e9283",
"#a79486",
"#aa95a2",
"#8ba7b4")

Then let’s make a word cloud based on the countries of origin for the populations of concern in the data and save it as a .png file. I’ve commented the code for each line We can see that Afghanistan seems to have the highest relative number and those with unknown countries of origin is second. Other high conflict countries are visible too like DRC, Iraq, Sudan, Syria, Somalia, etc.

# COUNTRY OF ORIGIN
png("wordcloud_count_or.png",width=600,height=850,res=200)
wordcloud(or.count.tot[,1], #list of words
	or.count.tot[,2], #frequencies for words
	scale=c(3,.5), #scale of size range
	min.freq=100, #minimum frequency
	max.words=100, #maximum number of words show (others dropped)
	family="Garamond", font=2, #text edit (The font "face" (1=plain, 2=bold, 3=italic, 4=bold-italic))
	random.order=F, #F-plotted in decreasing frequency
	colors=rev(pal3)) #colours from least to most frequent (reverse order)
dev.off()

wordcloud_count_or

Historical Data Visualized

So in the UNHCR data, it contains information not only on countries but based on the year from 1951 to 2014. I wanted to see the trends over time between each of the populations of concern. There are seven types of populations in the dataset. Their definitions can be found here on the UNHCR website. To visualize the data using ggplot2, it’s a good practice to convert the data from wide to long form. The benefits on using long form can be read here and here. The accomplish this, we’re going to use the reshape2 package and the melt() function. This tutorial is helpful in understanding what’s going on.

I first subset to remove the country, country of origin and total columns.

no.count<-(ref.d[c(-2,-3,-11)]) #removes Country, Country of Origin and Total
lno.count<-melt(no.count,id=c("Year"))
head(lno.count)
  Year variable  value
1 1951 Refugees 180000
2 1951 Refugees 282000
3 1951 Refugees  55000
4 1951 Refugees 168511
5 1951 Refugees  10000
6 1951 Refugees 265000

Then I set the ggplot2 theme and the color palette I want to use.

blank_t <- theme_minimal()+
  theme(
  panel.border = element_blank(), #removes border
  panel.background = element_rect(fill = "#d0d1cf",colour=NA),
  panel.grid = element_line(colour = "#ffffff"),
  plot.title=element_text(size=20, face="bold",hjust=0,vjust=2), #set the plot title
  plot.subtitle=element_text(size=15,face=c("bold","italic"),hjust=0.01), #set subtitle
  legend.title=element_text(size=10,face="bold",vjust=0.5), #specify legend title
  axis.text.x = element_text(size=9,angle = 0, hjust = 0.5,vjust=1,margin=margin(r=10)),
  axis.text.y = element_text(size=10,margin = margin(r=5)),
  legend.background = element_rect(fill="gray90", size=.5),
  legend.position=c(0.15,0.65),
  plot.margin=unit(c(t=1,r=1.2,b=1.2,l=1),"cm")
  )
pal5 <- c("#B53C26", #Refugees 7 TOTAL
"#79542D", #Asylum Seekers
"#DA634D", #Returned Refugees
"#0B4659", #IDPs
"#4B8699", #Returned IDPs
"#D38F47", #Stateless People
"#09692A") #Others of Concern

Then use geom_bar and the fill argument to make the stacked bar graph over time by specifying what titles and axis labels. In the scale_fill_manual() I used the gsub() function remove the underscores from the names (I was lazy to rewrite the names out in a vector and assign it to the labels).

gg1 <-ggplot(lno.count,aes(Year,value)) +
	geom_bar(aes(fill=variable),stat="identity") +
	labs(title="UNHCR Population Statistics Database",
	subtitle="(1951 - 2014)",
	x="Year",y="Number of People (Millions)") + blank_t +
	scale_fill_manual(guide_legend(title="Populations of Concern"),labels=gsub("*_"," ",names(ref.d)[c(-1,-2,-3,-11)]),values=pal5)

Next while modifying the axes, I wanted to change the scale of the y-axis to shows it as millions versus 6 zeros. I set the labels argument in scale_y_continuous() to a short function that converts it by a factor of 10e-6. The plot is below and you can click on it to see the full-sized version.

mil.func gg2<-gg1+
scale_y_continuous(limits=c(0,6e7),
breaks=pretty(0:6e7,n=5),
labels=mil.func,
expand=c(0.025,0)) +
scale_x_continuous(limits=c(1950,2015),
breaks=seq(1950,2015,by=5),
expand=c(0.01,0))
ggsave(plot=gg2,filename="UNHCR_Totals_Yr.png", width=9.5, height=5.5, dpi=200)

unhcr_totals_yr

So based on this, we can see that there has been a large increase of the total populations of concern the past 30 years. Internally Displaced People (IDPs) have also increased dramatically in the past 20 years. Based on the most recent data, the totals are reaching 65.3 million! (that’s above the y-limit in the bar graph)

Let’s Be Real

(Not R related) Based on doing this, I learned that the UK hosts the highest number of North Korean refugees in the world, that Afghanistan has historically produced the largest number of refugees and there is a upward trend (qualitative conclusion) with the total number of displaced people in recent history. Remembering that these vulnerable populations are mostly undocumented so the numbers in this dataset are likely underestimates to the actual numbers of people who fall under the population of concern definition. In the midst of the news and social media firestorm on all the unfortunate events happening in the world, I think it’s easy to dissociate and just live our own lives. I believe that those with access to resources should be generous with how we use them. A simple act of kindness like a Samaritan can go a long way. Some simple steps that I’ve taken were to educate myself on these issues, advocate where I am able, donate financially and engage where I can. There are a plethora of organizations committed to relieving the burden of these populations – one of them I support is the International Rescue Committee.

So I hope this post was informative and demonstrates how one can pull some data off the internet, tinker around with R and discover some news things about the world. Any feedback on the code, alternative ways on what I did are always welcome!

 


To leave a comment for the author, please follow the link and comment on their blog: R – Incidental Ideas.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)