Scraping data from the local elections
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of my journalist friend was looking at the result of the local election in Luxembourg and he was dissatisfied because he was unable to compare the results of all the communes. In fact, he wanted to compare the number of women that were candidates in each commune. So I asked him to hold on and I came back one hour later with this script that enables him to collect results of all communes in one table.
At the beginning, it was private code but I thought that it could be another great scraping example after the excellent post written by my colleague Bruno Rodrigues about scraping data from STATEC public tables.
So let’s get started. First, let’s load some packages:
#we have to load the packages if they are not installed on your computer, #begin with the commented following lines: #install.packages( "rvest" ) #install.packages( "tidyverse" ) #install.packages( "stringr" ) library(rvest) #to scrap library(dplyr) #to manipulate data library(stringr) #to manipulate string
We want to collect results of all communes in one data frame. We go to http://www.elections.public.lu and we collect data that is seen in the GIF bellow:
After clicking on different communes, we notice that the URLs have the same format. They have 3 parts :
1: “http://www.elections.public.lu/fr/elections-communales/2017/resultats/communes/”, the first part of the URL.
2: “communes_names” the name of the city is the second part of the URL.
3: “.html” is the last part.
For example, the complete URL for the commune of Luxembourg will be: http://www.elections.public.lu/fr/elections-communales/2017/resultats/communes/luxembourg.html
There are 103 communes, so we have to put all of them in a list. We scrape the 103 communes in one list via the script bellow :
url = "http://www.elections.public.lu/fr/elections-communales/2017/resultats/communes/bech.html" communes = read_html(url) %>% html_nodes("#communes #communes-az li") %>% html_text()
We verify that we have a list of 103 vectors and then we check the 5 first rows.
length(communes) ## [1] 103 head(communes,5) ## [1] "\r\n Beaufort\r\n a rendu l'ensemble de ses résultats\r\n " ## [2] "\r\n Bech\r\n a rendu l'ensemble de ses résultats\r\n " ## [3] "\r\n Beckerich\r\n a rendu l'ensemble de ses résultats\r\n " ## [4] "\r\n Berdorf\r\n a rendu l'ensemble de ses résultats\r\n " ## [5] "\r\n Bertrange\r\n a rendu l'ensemble de ses résultats\r\n "
It seems that the 103 communes are present but we still have to clean the data.
#Data is not cleaned and there are some useless characters that need to be removed. #We need to clean data. communes = gsub("a rendu l'ensemble de ses résultats"," ", communes ) communes = trimws(gsub("\r\n","",communes )) communes = gsub("/Attert","-sur-attert", communes ) communes = gsub(" - ", "-",communes ) communes = gsub(" ", "-",communes ) communes = gsub("'", "-",communes ) communes = gsub("é", "e",communes ) communes = gsub("û", "u",communes ) communes = gsub("ä", "a",communes ) #Lower case communes = tolower(communes) head(communes, 5) ## [1] "beaufort" "bech" "beckerich" "berdorf" "bertrange"
Now that we have the list of the 103 communes, we will write a function that will enable us to collect data that we want to display in our data frame.
#Function to have the result for one commune. result = function(x){ #scrapping data vote = read_html(paste("http://www.elections.public.lu/fr/elections-communales/2017/resultats/communes/",x,".html", sep="")) %>% html_nodes("#lux-number .lux-number ul li") %>% html_text()%>%.[-1] #Conditions need to be added to have clean data. #Here we add a trick, the vector 14 and 15 are the only ones that haven't the string "\r\n". #So we add "\r\n" to these vectors. if(nchar(vote[14]) > 21){ vote[14] = gsub("ble","ble\r\n", vote[14],perl = FALSE) } if(nchar(vote[15]) > 21){ vote[15] = gsub("mé","mé\r\n", vote[15],perl = FALSE) } #We split vectors to dissociate the results (numbers) and the titles (letters). vote = unlist(str_split(vote, "\r\n")) vote = trimws(vote) vote = vote[vote != ""] #Here we have similar title so we change them to not be confused. #Candidat Lux means Luxemburgish Candidates & Electeur Lux means Luxemburgish voters. vote[7] = gsub("Lux", "Candidat Lux", vote[7] ) vote[9] = gsub("Non Lu", "Candidat Non Lu", vote[9] ) vote[13] = gsub("Lux", "Electeur Lux", vote[13] ) vote[15] = gsub("Non Lu", "Electeur Non Lu", vote[15] ) #We create the data frame. #Vector with pair indice value are the results (the column val). pair = (1:15)*2 #Vectors with odd index value are the titles (the column title). impair = (1:15)*2 - 1 res = data.frame(communes = rep(x,15), title = vote[impair], val = vote[pair]) return(res) }
Then we use the lapply() function to apply this function to the 103 communes and bind them in one data frame:
#We use the result function on all the communes. res = lapply(communes, result) #We bind the rows to have a complete data frame with all the results from all communes #then we bind all the result. df = do.call(rbind, res)
To see the 5 first rows of our new data frame:
#Now the data that we have look like this: head(df,5) ## communes title val ## 1 beaufort Total 10 ## 2 beaufort Femmes 5 ## 3 beaufort Hommes 5 ## 4 beaufort Candidat Luxembourgeois 10 ## 5 beaufort Candidat Non Luxembourgeois 0
As you can see, it’s looking good but we are not completely satisfied because we would like to transpose data to have one result in one column. To this end, we use the tidyr
package.
library(tidyr) #Transposing data will enables us to make analysis faster. #We transform val in a numeric variable then we transpose data. tdf = df %>% mutate(val = as.numeric(gsub(" ", "", val))) %>% spread(title, val)
To see the 5 first rows of our new data frame:
#Now the data that we have look like this: head(tdf,5) ## communes Blancs Candidat Luxembourgeois Candidat Non Luxembourgeois ## 1 beaufort 78 10 0 ## 2 bech 0 8 0 ## 3 beckerich 67 10 1 ## 4 berdorf 23 20 0 ## 5 bertrange 75 45 7 ## Dans l'urne Electeur Luxembourgeois Electeur Non Luxembourgeois Femmes ## 1 1144 1170 114 5 ## 2 0 699 95 1 ## 3 1396 1393 137 2 ## 4 857 850 79 7 ## 5 2935 2972 478 22 ## Grand total exprimé Grand total possible Hommes Inscrits Nuls Total ## 1 4021 9234 5 1284 40 10 ## 2 0 0 7 794 0 8 ## 3 5792 11637 9 1530 36 11 ## 4 4929 7371 13 929 15 20 ## 5 32880 35620 30 3450 120 52 ## Valables Votes par correspondance ## 1 1026 71 ## 2 0 0 ## 3 1293 109 ## 4 819 57 ## 5 2740 269
Now you can export the table in excel or play with your data in R!
readr::write_excel_csv(tdf,"election_lux.csv") #To know where your file is saved, we use the following function: #setwd()
Don’t hesitate to follow us on twitter @rdata_lu and to subscribe to our youtube channel.
You can also contact us if you have any comments or suggestions. See you for the next post!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.