Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R and Shiny Training: If you find this blog to be interesting, please note that I offer personalized and group-based training sessions that may be reserved through Buy me a Coffee. Additionally, I provide training services in the Spanish language and am available to discuss means by which I may contribute to your Shiny project.
< section id="motivation" class="level1">Motivation
I needed to create a few plots by using the Civil Liberties and Political Rights scores from the Freedom Index. However, the data provided in Excel format was not in an easy to use presentation. This blog post shows how I reshaped the data to make it easier to work with.
There is an excellent post, Cleaning Freedom House indicators, by Marta Kolczynska that I used as reference. The post is four years old, so here I am using some updated functions and I tried to make some steps more general.
< section id="download-and-read-the-data" class="level1">Download and read the data
Required packages:
# install_github("pachadotdev/tintin") library(dplyr) library(tidyr) library(readxl) library(countrycode) library(ggplot2) library(stringr) library(tintin)
Download the data:
url <- "https://freedomhouse.org/sites/default/files/2023-02/Country_and_Territory_Ratings_and_Statuses_FIW_1973-2023%20.xlsx" raw_xlsx <- gsub("%20", "", gsub(".*/", "", url)) if (!file.exists(raw_xlsx)) { download.file(url, raw_xlsx) }
After opening the Excel file with Libre Office I realized I need to read the data from the second sheet and recode “-” to missing:
freedom_house <- read_excel(raw_xlsx, sheet = 2, na = "-") freedom_house
# A tibble: 207 × 151 `Survey Edition` `Jan.-Feb. 1973` ...3 ...4 `Jan.-Feb. 1974` ...6 ...7 <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 Year(s) Under Revi… 1972 <NA> <NA> 1973 <NA> <NA> 2 <NA> PR CL Stat… PR CL Stat… 3 Afghanistan 4 5 PF 7 6 NF 4 Albania 7 7 NF 7 7 NF 5 Algeria 6 6 NF 6 6 NF 6 Andorra 4 3 PF 4 4 PF 7 Angola <NA> <NA> <NA> <NA> <NA> <NA> 8 Antigua and Barbuda <NA> <NA> <NA> <NA> <NA> <NA> 9 Argentina 6 3 PF 2 2 F 10 Armenia <NA> <NA> <NA> <NA> <NA> <NA> # ℹ 197 more rows # ℹ 144 more variables: `Jan.-Feb. 1975` <chr>, ...9 <chr>, ...10 <chr>, # `Jan.-Feb. 1976` <chr>, ...12 <chr>, ...13 <chr>, `Jan.-Feb. 1977` <chr>, # ...15 <chr>, ...16 <chr>, `1978` <chr>, ...18 <chr>, ...19 <chr>, # `1979` <chr>, ...21 <chr>, ...22 <chr>, `1980` <chr>, ...24 <chr>, # ...25 <chr>, `1981` <chr>, ...27 <chr>, ...28 <chr>, `1982` <chr>, # ...30 <chr>, ...31 <chr>, `1983-84` <chr>, ...33 <chr>, ...34 <chr>, …
Tidy the data
Fix the first column name:
names(freedom_house)[1] <- "country" freedom_house
# A tibble: 207 × 151 country `Jan.-Feb. 1973` ...3 ...4 `Jan.-Feb. 1974` ...6 ...7 <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 Year(s) Under Revi… 1972 <NA> <NA> 1973 <NA> <NA> 2 <NA> PR CL Stat… PR CL Stat… 3 Afghanistan 4 5 PF 7 6 NF 4 Albania 7 7 NF 7 7 NF 5 Algeria 6 6 NF 6 6 NF 6 Andorra 4 3 PF 4 4 PF 7 Angola <NA> <NA> <NA> <NA> <NA> <NA> 8 Antigua and Barbuda <NA> <NA> <NA> <NA> <NA> <NA> 9 Argentina 6 3 PF 2 2 F 10 Armenia <NA> <NA> <NA> <NA> <NA> <NA> # ℹ 197 more rows # ℹ 144 more variables: `Jan.-Feb. 1975` <chr>, ...9 <chr>, ...10 <chr>, # `Jan.-Feb. 1976` <chr>, ...12 <chr>, ...13 <chr>, `Jan.-Feb. 1977` <chr>, # ...15 <chr>, ...16 <chr>, `1978` <chr>, ...18 <chr>, ...19 <chr>, # `1979` <chr>, ...21 <chr>, ...22 <chr>, `1980` <chr>, ...24 <chr>, # ...25 <chr>, `1981` <chr>, ...27 <chr>, ...28 <chr>, `1982` <chr>, # ...30 <chr>, ...31 <chr>, `1983-84` <chr>, ...33 <chr>, ...34 <chr>, …
Remove the first and second row of data:
freedom_house <- freedom_house %>% filter( country != "Year(s) Under Review", !is.na(country) ) freedom_house
# A tibble: 205 × 151 country `Jan.-Feb. 1973` ...3 ...4 `Jan.-Feb. 1974` ...6 ...7 <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 Afghanistan 4 5 PF 7 6 NF 2 Albania 7 7 NF 7 7 NF 3 Algeria 6 6 NF 6 6 NF 4 Andorra 4 3 PF 4 4 PF 5 Angola <NA> <NA> <NA> <NA> <NA> <NA> 6 Antigua and Barbuda <NA> <NA> <NA> <NA> <NA> <NA> 7 Argentina 6 3 PF 2 2 F 8 Armenia <NA> <NA> <NA> <NA> <NA> <NA> 9 Australia 1 1 F 1 1 F 10 Austria 1 1 F 1 1 F # ℹ 195 more rows # ℹ 144 more variables: `Jan.-Feb. 1975` <chr>, ...9 <chr>, ...10 <chr>, # `Jan.-Feb. 1976` <chr>, ...12 <chr>, ...13 <chr>, `Jan.-Feb. 1977` <chr>, # ...15 <chr>, ...16 <chr>, `1978` <chr>, ...18 <chr>, ...19 <chr>, # `1979` <chr>, ...21 <chr>, ...22 <chr>, `1980` <chr>, ...24 <chr>, # ...25 <chr>, `1981` <chr>, ...27 <chr>, ...28 <chr>, `1982` <chr>, # ...30 <chr>, ...31 <chr>, `1983-84` <chr>, ...33 <chr>, ...34 <chr>, …
Convert the dataset to long format:
freedom_house <- freedom_house %>% pivot_longer(cols = -country, names_to = "year", values_to = "value") freedom_house
# A tibble: 30,750 × 3 country year value <chr> <chr> <chr> 1 Afghanistan Jan.-Feb. 1973 4 2 Afghanistan ...3 5 3 Afghanistan ...4 PF 4 Afghanistan Jan.-Feb. 1974 7 5 Afghanistan ...6 6 6 Afghanistan ...7 NF 7 Afghanistan Jan.-Feb. 1975 7 8 Afghanistan ...9 6 9 Afghanistan ...10 NF 10 Afghanistan Jan.-Feb. 1976 7 # ℹ 30,740 more rows
Fix the year by removing all non-numeric characters in the year column (i.e., the 2023 survey reflects the scores for 2022):
freedom_house <- freedom_house %>% mutate( year = as.integer(gsub("[^0-9]", "", year)) - 1L, year = case_when( year < 1972 ~ NA_integer_, TRUE ~ year ) ) freedom_house
# A tibble: 30,750 × 3 country year value <chr> <int> <chr> 1 Afghanistan 1972 4 2 Afghanistan NA 5 3 Afghanistan NA PF 4 Afghanistan 1973 7 5 Afghanistan NA 6 6 Afghanistan NA NF 7 Afghanistan 1974 7 8 Afghanistan NA 6 9 Afghanistan NA NF 10 Afghanistan 1975 7 # ℹ 30,740 more rows
Replace blank years:
freedom_house <- freedom_house %>% fill(year) freedom_house
# A tibble: 30,750 × 3 country year value <chr> <int> <chr> 1 Afghanistan 1972 4 2 Afghanistan 1972 5 3 Afghanistan 1972 PF 4 Afghanistan 1973 7 5 Afghanistan 1973 6 6 Afghanistan 1973 NF 7 Afghanistan 1974 7 8 Afghanistan 1974 6 9 Afghanistan 1974 NF 10 Afghanistan 1975 7 # ℹ 30,740 more rows
Add measurement categories:
freedom_house <- freedom_house %>% group_by(country, year) %>% mutate( n = row_number(), category = case_when( n == 1 ~ "political_rights", n == 2 ~ "civil_liberties", n == 3 ~ "status" ) ) %>% ungroup() freedom_house
# A tibble: 30,750 × 5 country year value n category <chr> <int> <chr> <int> <chr> 1 Afghanistan 1972 4 1 political_rights 2 Afghanistan 1972 5 2 civil_liberties 3 Afghanistan 1972 PF 3 status 4 Afghanistan 1973 7 1 political_rights 5 Afghanistan 1973 6 2 civil_liberties 6 Afghanistan 1973 NF 3 status 7 Afghanistan 1974 7 1 political_rights 8 Afghanistan 1974 6 2 civil_liberties 9 Afghanistan 1974 NF 3 status 10 Afghanistan 1975 7 1 political_rights # ℹ 30,740 more rows
Convert the data to wide format because the value column is not tidy (i.e., it contains different units in the same column):
freedom_house <- freedom_house %>% select(-n) %>% pivot_wider(names_from = category, values_from = value) freedom_house
# A tibble: 10,250 × 5 country year political_rights civil_liberties status <chr> <int> <chr> <chr> <chr> 1 Afghanistan 1972 4 5 PF 2 Afghanistan 1973 7 6 NF 3 Afghanistan 1974 7 6 NF 4 Afghanistan 1975 7 6 NF 5 Afghanistan 1976 7 6 NF 6 Afghanistan 1977 6 6 NF 7 Afghanistan 1978 7 7 NF 8 Afghanistan 1979 7 7 NF 9 Afghanistan 1980 7 7 NF 10 Afghanistan 1981 7 7 NF # ℹ 10,240 more rows
Convert political rights and civil liberties to integer (i.e., for South Africa in 1973 I considered the worst score, which is inside parenthesis, because the index was separated into White and Black populations):
remove_parenthesis <- function(x) { y <- str_extract(x, "\\((.*?)\\)") y <- str_replace_all(y, "\\(|\\)", "") return(y) } freedom_house <- freedom_house %>% mutate( political_rights = case_when( country == "South Africa" & year == 1973 ~ remove_parenthesis(political_rights), TRUE ~ political_rights ), civil_liberties = case_when( country == "South Africa" & year == 1973 ~ remove_parenthesis(civil_liberties), TRUE ~ civil_liberties ), status = case_when( country == "South Africa" & year == 1973 ~ remove_parenthesis(status), TRUE ~ status ), political_rights = as.integer(political_rights), civil_liberties = as.integer(civil_liberties) )
Warning: There were 2 warnings in `mutate()`. The first warning was: ℹ In argument: `political_rights = as.integer(political_rights)`. Caused by warning: ! NAs introduced by coercion ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
freedom_house %>% filter(country == "South Africa", year == 1973)
# A tibble: 1 × 5 country year political_rights civil_liberties status <chr> <int> <int> <int> <chr> 1 South Africa 1973 NA NA <NA>
freedom_house
# A tibble: 10,250 × 5 country year political_rights civil_liberties status <chr> <int> <int> <int> <chr> 1 Afghanistan 1972 4 5 PF 2 Afghanistan 1973 7 6 NF 3 Afghanistan 1974 7 6 NF 4 Afghanistan 1975 7 6 NF 5 Afghanistan 1976 7 6 NF 6 Afghanistan 1977 6 6 NF 7 Afghanistan 1978 7 7 NF 8 Afghanistan 1979 7 7 NF 9 Afghanistan 1980 7 7 NF 10 Afghanistan 1981 7 7 NF # ℹ 10,240 more rows
Recode the status to make it more readable:
freedom_house <- freedom_house %>% mutate( status = factor( case_when( status == "F" ~ "Free", status == "PF" ~ "Partially Free", status == "NF" ~ "Not Free" ), levels = c("Free", "Partially Free", "Not Free"), ) ) freedom_house
# A tibble: 10,250 × 5 country year political_rights civil_liberties status <chr> <int> <int> <int> <fct> 1 Afghanistan 1972 4 5 Partially Free 2 Afghanistan 1973 7 6 Not Free 3 Afghanistan 1974 7 6 Not Free 4 Afghanistan 1975 7 6 Not Free 5 Afghanistan 1976 7 6 Not Free 6 Afghanistan 1977 6 6 Not Free 7 Afghanistan 1978 7 7 Not Free 8 Afghanistan 1979 7 7 Not Free 9 Afghanistan 1980 7 7 Not Free 10 Afghanistan 1981 7 7 Not Free # ℹ 10,240 more rows
Do some verifications:
freedom_house %>% filter(is.na(country))
# A tibble: 0 × 5 # ℹ 5 variables: country <chr>, year <int>, political_rights <int>, # civil_liberties <int>, status <fct>
freedom_house %>% filter(is.na(year))
# A tibble: 0 × 5 # ℹ 5 variables: country <chr>, year <int>, political_rights <int>, # civil_liberties <int>, status <fct>
freedom_house %>% filter(is.na(political_rights))
# A tibble: 1,207 × 5 country year political_rights civil_liberties status <chr> <int> <int> <int> <fct> 1 Andorra 1977 NA NA <NA> 2 Andorra 1978 NA NA <NA> 3 Andorra 1979 NA NA <NA> 4 Andorra 1980 NA NA <NA> 5 Andorra 1981 NA NA <NA> 6 Andorra 198383 NA NA <NA> 7 Andorra 198484 NA NA <NA> 8 Andorra 198585 NA NA <NA> 9 Andorra 198686 NA NA <NA> 10 Andorra 198787 NA NA <NA> # ℹ 1,197 more rows
Fix the year variable because some years were formatted as “198384” (i.e, 1983-84):
freedom_house <- freedom_house %>% mutate( year = case_when( nchar(year) > 4 ~ as.integer(substr(year, 1, 4)), TRUE ~ year ) ) freedom_house %>% filter(is.na(political_rights))
# A tibble: 1,207 × 5 country year political_rights civil_liberties status <chr> <int> <int> <int> <fct> 1 Andorra 1977 NA NA <NA> 2 Andorra 1978 NA NA <NA> 3 Andorra 1979 NA NA <NA> 4 Andorra 1980 NA NA <NA> 5 Andorra 1981 NA NA <NA> 6 Andorra 1983 NA NA <NA> 7 Andorra 1984 NA NA <NA> 8 Andorra 1985 NA NA <NA> 9 Andorra 1986 NA NA <NA> 10 Andorra 1987 NA NA <NA> # ℹ 1,197 more rows
Discard country-year combinations with missing values:
freedom_house <- freedom_house %>% drop_na(political_rights, civil_liberties, status) freedom_house
# A tibble: 9,043 × 5 country year political_rights civil_liberties status <chr> <int> <int> <int> <fct> 1 Afghanistan 1972 4 5 Partially Free 2 Afghanistan 1973 7 6 Not Free 3 Afghanistan 1974 7 6 Not Free 4 Afghanistan 1975 7 6 Not Free 5 Afghanistan 1976 7 6 Not Free 6 Afghanistan 1977 6 6 Not Free 7 Afghanistan 1978 7 7 Not Free 8 Afghanistan 1979 7 7 Not Free 9 Afghanistan 1980 7 7 Not Free 10 Afghanistan 1981 7 7 Not Free # ℹ 9,033 more rows
Additional steps
The data is now tidy, but adding ISO-2 and ISO-3 codes is highly convenient, so I create a table to join it with the data later:
countries <- freedom_house %>% ungroup() %>% distinct(country) %>% mutate( iso2c = countrycode( country, origin = "country.name", destination = "iso2c" ), iso3c = countrycode( country, origin = "country.name", destination = "iso3c" ) )
Warning: There were 2 warnings in `mutate()`. The first warning was: ℹ In argument: `iso2c = countrycode(country, origin = "country.name", destination = "iso2c")`. Caused by warning in `countrycode_convert()`: ! Some values were not matched unambiguously: Czechoslovakia, Kosovo, Micronesia, Serbia and Montenegro, Yugoslavia ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
countries
# A tibble: 205 × 3 country iso2c iso3c <chr> <chr> <chr> 1 Afghanistan AF AFG 2 Albania AL ALB 3 Algeria DZ DZA 4 Andorra AD AND 5 Angola AO AGO 6 Antigua and Barbuda AG ATG 7 Argentina AR ARG 8 Armenia AM ARM 9 Australia AU AUS 10 Austria AT AUT # ℹ 195 more rows
Add continent:
countries <- countries %>% mutate( continent = countrycode( iso3c, origin = "iso3c", destination = "continent" ) ) countries
# A tibble: 205 × 4 country iso2c iso3c continent <chr> <chr> <chr> <chr> 1 Afghanistan AF AFG Asia 2 Albania AL ALB Europe 3 Algeria DZ DZA Africa 4 Andorra AD AND Europe 5 Angola AO AGO Africa 6 Antigua and Barbuda AG ATG Americas 7 Argentina AR ARG Americas 8 Armenia AM ARM Asia 9 Australia AU AUS Oceania 10 Austria AT AUT Europe # ℹ 195 more rows
Because some ISO codes could not be matched unambiguously, for some countries I need to add the continent manually:
countries %>% filter(is.na(continent))
# A tibble: 5 × 4 country iso2c iso3c continent <chr> <chr> <chr> <chr> 1 Czechoslovakia <NA> <NA> <NA> 2 Kosovo <NA> <NA> <NA> 3 Micronesia <NA> <NA> <NA> 4 Serbia and Montenegro <NA> <NA> <NA> 5 Yugoslavia <NA> <NA> <NA>
countries <- countries %>% mutate( continent = case_when( country == "Czechoslovakia" ~ "Europe", country == "Kosovo" ~ "Europe", country == "Micronesia" ~ "Oceania", country == "Serbia and Montenegro" ~ "Europe", country == "Yugoslavia" ~ "Europe", TRUE ~ continent ) )
Join the two data sets:
freedom_house <- freedom_house %>% left_join(countries, by = "country") %>% select(year, country, iso2c, iso3c, continent, political_rights, civil_liberties, status) freedom_house
# A tibble: 9,043 × 8 year country iso2c iso3c continent political_rights civil_liberties status <int> <chr> <chr> <chr> <chr> <int> <int> <fct> 1 1972 Afghanis… AF AFG Asia 4 5 Parti… 2 1973 Afghanis… AF AFG Asia 7 6 Not F… 3 1974 Afghanis… AF AFG Asia 7 6 Not F… 4 1975 Afghanis… AF AFG Asia 7 6 Not F… 5 1976 Afghanis… AF AFG Asia 7 6 Not F… 6 1977 Afghanis… AF AFG Asia 6 6 Not F… 7 1978 Afghanis… AF AFG Asia 7 7 Not F… 8 1979 Afghanis… AF AFG Asia 7 7 Not F… 9 1980 Afghanis… AF AFG Asia 7 7 Not F… 10 1981 Afghanis… AF AFG Asia 7 7 Not F… # ℹ 9,033 more rows
Add colours for the status (i.e., here I avoided colours such as red and green because they are linked to positive and negative associations):
freedom_house <- freedom_house %>% mutate( color = case_when( status == "Free" ~ "#549f95", status == "Partially Free" ~ "#a1aafc", status == "Not Free" ~ "#7454a6" ) )
Final result
The final result is a tidy dataset that can be used to create plots and perform analysis:
freedom_house
# A tibble: 9,043 × 9 year country iso2c iso3c continent political_rights civil_liberties status <int> <chr> <chr> <chr> <chr> <int> <int> <fct> 1 1972 Afghanis… AF AFG Asia 4 5 Parti… 2 1973 Afghanis… AF AFG Asia 7 6 Not F… 3 1974 Afghanis… AF AFG Asia 7 6 Not F… 4 1975 Afghanis… AF AFG Asia 7 6 Not F… 5 1976 Afghanis… AF AFG Asia 7 6 Not F… 6 1977 Afghanis… AF AFG Asia 6 6 Not F… 7 1978 Afghanis… AF AFG Asia 7 7 Not F… 8 1979 Afghanis… AF AFG Asia 7 7 Not F… 9 1980 Afghanis… AF AFG Asia 7 7 Not F… 10 1981 Afghanis… AF AFG Asia 7 7 Not F… # ℹ 9,033 more rows # ℹ 1 more variable: color <chr>
As an example, I can create a plot of the political rights and civil liberties scores in my country, Chile, where between 1973 and 1990 the country was under a dictatorship and then transitioned to democracy:
dplot <- freedom_house %>% filter(country == "Chile", year >= 1973) %>% mutate( political_rights = abs(political_rights - 8), civil_liberties = abs(civil_liberties - 8) ) %>% pivot_longer(cols = c("political_rights", "civil_liberties"), names_to = "category", values_to = "value") %>% mutate(category = str_to_title(str_replace_all(category, "_", " "))) ggplot(dplot) + geom_line(aes(x = year, y = value, color = category), linewidth = 1.2) + facet_wrap(~category, nrow = 2) + theme_minimal(base_size = 13) + theme(legend.position = "none") + labs( title = "Political Rights and Civil Liberties in Chile", subtitle = "Source: Freedom House Index.\nNote: Scores reversed for visual clarity (1 = least free, 7 = most free).", x = "Year", y = "Score" ) + scale_colour_tintin_d(option = "tintin_in_the_land_of_the_soviets")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.