Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is the second part about my project that deals with the Twitter network of members of the Bundestag. After getting the necessary data, which was explained in part 1, we will now focus on creating a network graph with links between the representatives’ Twitter accounts for exploratory network analysis.
Note that again I will not reproduce full code examples but rather focus on some excerpts. If you want to have a look at the full code, please refer to the GitHub repository for this project, especially friends_network.R
. Also, this post only represents a starting point for exploratory network analysis and suggests some packages and techniques for that purpose. It is not an in-depth article about network analysis.
Preparation
We collected two datasets in part I: First, a dataset that lists all deputies with their Twitter account names (also known as Twitter handles) and some additional information, e.g. their party or their electoral district. This data was collected from Abgeordnetenwatch.de. I will call this dataset dep_twitter
. Second, for each deputy Twitter account we have collected data on their “friends” using the Twitter API. Remember that “friends” is the Twitter terminology for the list of users that someone follows (i.e. the users that appear in the “following” list of a certain Twitter account). I will call this dataset friends
. Let’s have a look at a sample from both datasets first:
A random sample from dep_twitter
:
twitter_name personal.first_name personal.last_name party johannesvogel Johannes Vogel FDP baerbelbas Bärbel Bas SPD schickgerhard Gerhard Schick DIE GRÜNEN gruenebeate Beate Müller-Gemmeke DIE GRÜNEN houbenreinhard Reinhard Arnold Houben FDP
A sample from friends
, where user
refers to a deputy Twitter handle from dep_twitter
, screen_name
is a friend’s Twitter handle and name
a friend’s name as stated on Twitter.
user screen_name name baerbelbas OlafLies Minister Olaf Lies baerbelbas COntraPipeline COntra-Pipeline baerbelbas Earthsmell Arturo de la Vega johannesvogel Peter_Schaar Peter Schaar baerbelbas BILD_Ruhrgebiet BILD Ruhrgebiet johannesvogel sophiespelsberg Sophie Spelsberg schickgerhard ulle_schauws Ulle Schauws johannesvogel alias_ccm Christa C. Müller johannesvogel marcusmeurer95 Marcus Meurer johannesvogel andreaslandwehr Andreas Landwehr
There are much more variables in both datasets, but for our purpose the selected columns are just fine. We’re interested only in the links between deputy accounts on Twitter, this means we can omit all observations in friends
, where screen_name
doesn’t refer to a deputy Twitter handle. Let’s do this now:
library(dplyr) # a few NAs for "screen_name"; remove those observations friends <- filter(friends, !is.na(screen_name)) dep_accounts <- unique(friends$user) # Twitter handles of deputies # only retain "friends" that are deputies dep_friends <- filter(friends, screen_name %in% dep_accounts)
The dataset dep_friends
now only contains the connections between deputies on Twitter. Connections to Twitter accounts that are not accounts of deputy colleagues, which might also be interesting for further analysis, are removed. This reduces the dataset from ~340,000 observations to ~8,500 observations.
Friends / followers share between parties
At first, I want to focus on aggregate data at party level: In a set of Twitter accounts associated with party A, how many of those follow an account from party B?
The first step to answer this question is to create a dataset that doesn’t only include which deputy follows which colleague on Twitter (dep_friends
already contains this information), but also their respective party affiliation. For this, we can make two joins between dep_friends
and dep_accounts_parties
, the latter simply mapping deputy Twitter handles to their party:
# deputy Twitter handles and their party dep_accounts_parties <- select(dep_twitter, twitter_name, party) # make two joins to create a data frame with edges defined by # "from_account", "from_party" and "to_account", "to_party" edges_parties <- select(dep_friends, from_account = user, to_account = screen_name) %>% left_join(dep_accounts_parties, by = c('from_account' = 'twitter_name')) %>% rename(from_party = party) %>% left_join(dep_accounts_parties, by = c('to_account' = 'twitter_name')) %>% rename(to_party = party)
The new dataset edges_parties
now contains the connections between deputies. These connections are also called the edges of a graph. They are specified in the columns from_account
and to_account
, plus the respective party affiliations as seen here in this random sample:
from_account to_account from_party to_party christianduerr bstrasser FDP FDP jenskoeppen kaiwegner CDU CDU berlinliebich c_bernstiel DIE LINKE CDU stefangelbhaar julia_verlinden DIE GRÜNEN DIE GRÜNEN gydej nicolabeerfdp FDP FDP
We can now count the connections at party level using group_by()
and count()
:
# count how often each "from_party" -> "to_party" edge occurs counts_p2p <- group_by(edges_parties, from_party, to_party) %>% count() %>% ungroup() head(counts_p2p, 10) ## from_party to_party n ## AfD AfD 180 ## AfD CDU 42 ## AfD CSU 2 ## AfD DIE GRÜNEN 25 ## AfD DIE LINKE 27 ## AfD FDP 51 ## AfD SPD 50 ## CDU AfD 5 ## CDU CDU 621 ## CDU CSU 130
Of course, the size of the factions in the Bundestag differ and the number of Twitter users per faction do too. Hence absolute numbers are not very useful and we will add a column prop
with the respective proportions:
# count the absolute number of edges per "from_party" # this is required to calculate the proportions counts_party_edges <- group_by(counts_p2p, from_party) %>% summarise(n_edges = sum(n)) # add a column "prop" for the "from_party" -> "to_party" edges proportions counts_p2p <- left_join(counts_p2p, counts_party_edges, by = 'from_party') %>% mutate(prop = n/n_edges) %>% select(-n_edges) head(counts_p2p, 3) ## from_party to_party n prop ## AfD AfD 180 0.47745358 ## AfD CDU 42 0.11140584 ## AfD CSU 2 0.00530504
This data can be visualized as a heatmap with ggplot2 and geom_raster()
. This displays the proportions of friends / followers between parties as a matrix, where the intensity of the color in the cells depends on the value in the cell. On the y-axis, i.e. the rows in the matrix, we put the party A that follows a party B which is listed on the x-axis. We also convert the proportions to percent (new column perc
) and display a rounded percentage (perc_label
) inside the cells. The fill color’s scale uses the viridis color map.
library(ggplot2) ggplot(counts_p2p, aes(x = to_party, y = from_party, fill = perc)) + geom_raster() + geom_text(aes(label = perc_label), color = 'white') + scale_fill_viridis_c(guide = guide_legend( title = 'Followers / following\nshare in percent')) + labs(x = 'party in column is followed by party in row', y = 'party in row follows party in column', title = 'Proportion of followings / followers between parties') + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
The following shows a heatmap with data collected on December 5 2018.
We see a strong diagonal, indicating that most parties’ deputies follow party colleagues on Twitter. The only exception here is the CSU, whose deputies twice as often follow CDU colleagues than colleagues from their own party. This doesn’t surprise much though, since both parties form an alliance in the Bundestag and the CSU is the smaller partner.
The SPD has the highest share of intra-party followings. Almost three quarters of their “followings” are towards other SPD colleagues. About 10% of these connections are towards the Green Party and almost 8% towards CDU. At the same time, SPD members are very frequently followed by members of other parties, as you can see at the high values between 10% and 17% across all parties in the SPD column.
The far-right party AfD has the fewest followers from other parties, as we can see in the (due to alphabetic ordering) left-most column. Their members mostly follow FDP and SPD accounts as visible in the bottom row.
This hasn’t changed much in newer data that I collected on July 2 of this year. However, the share of intra-party connections dropped from 58% to 48% for the AfD.
Similar things could be done on deputy level, too. However, I will continue with creating and visualizing a graph of the connections at deputy level.
Creating and visualizing the graph of Twitter friends with igraph
I will construct a graph of the deputy Twitter friends connections with the package igraph. The graph should display the connections between the individual deputies and also indicate their party membership. There are several ways to construct such a graph. I will use the graph_from_data_frame()
function and pass it a data frame of connections (called edge list) and a data frame that describes the nodes (vertices) with their attributes. In our case, the latter means all deputy Twitter accounts with the respective party affiliation.
We essentially already defined the edge list in the edges_parties
dataset that we created for the heatmap. It specifies the edges with the from_account
and to_account
columns. Additionally, it contains the respective party memberships in the columns from_party
and to_party
. We also already have the data frame that describes the nodes: dep_twitter
. However, if we directly use this dataset to create the graph, we will also include a few stray accounts that don’t connect to any of the other accounts. This is because a few deputies don’t follow any of their colleagues. We will create a nodes dataset without these accounts first:
accounts_connected <- unique(c(edges_parties$from_account, edges_parties$to_account)) accounts_not_connected <- dep_twitter$twitter_name[!(dep_twitter$twitter_name %in% accounts_connected)] # these accounts are used as vertices (aka nodes): dep_twitter_connected <- filter(dep_twitter, twitter_name %in% accounts_connected)
With this, we can create an igraph graph object now:
library(igraph) g <- graph_from_data_frame(edges_parties, vertices = dep_twitter_connected) g ## IGRAPH 17b614f DN-- 359 8416 -- ## + attr: name (v/c), personal.first_name (v/c), personal.last_name ## | (v/c), personal.gender (v/c), personal.birthyear (v/n), ## | personal.location.state (v/c), personal.location.city (v/c), ## | party (v/c), from_party (e/c), to_party (e/c) ## + edges from 17b614f (vertex names): ## [1] martinschulz->katarinabarley martinschulz->oezoguz ## [3] martinschulz->kahrs martinschulz->schneidercar ## [5] martinschulz->sigmargabriel martinschulz->fbrantner ## [7] fabiodemasi ->victorperli fabiodemasi ->f_schaeffler ## [9] fabiodemasi ->lgbeutin fabiodemasi ->pascalmeiser ## + ... omitted several edges
The output from the igraph object seems cryptic at first, because it is very condensed: 359 8416
refers to the number of vertices and edges respectively. Then follows a list of attributes after “+ attr”. After each attribute is a specifier in parentheses that denotes the scope of the attribute and its type. So for example name (v/c)
means that “name” is a vertex attribute of type character, personal.birthyear (v/n)
is a vertex attribute of type numeric and from_party (e/c)
is an edge attribute of type character. In the “+ edges” section a sample of edges is displayed.
With this igraph object we can calculate several graph centrality measures which allows us to identify the most important nodes in a graph. Let’s have a look at two measures: First, the degree which is the number of incoming and outgoing edges of a node. Second, the betweenness that roughly speaking quantifies the number of shortest paths that pass through a node. Let’s calculate both measures (we use the total degree per node, counting both incoming and outcoming edges):
degree_score <- degree(g, mode = 'total') betw_score <- betweenness(g) head(degree_score, 3) ## martinschulz fabiodemasi anked ## 6 13 107 head(betw_score, 3) ## martinschulz fabiodemasi anked ## 0.0000 0.0000 235.4485
We can combine this with the deputy data and order per score (see full script on GitHub) to get a top ten. The first table is ordered by degree, the second by betweenness score.
twitter_name full_name degr_score betw_score party 1 peteraltmaier Peter Altmaier 263 927.0231 CDU 2 kahrs Johannes Kahrs 251 3482.5345 SPD 3 sigmargabriel Sigmar Gabriel 248 1184.6123 SPD 4 katarinabarley Katarina Barley 224 1135.4914 SPD 5 c_lindner Christian Lindner 222 1422.2743 FDP 6 hubertus_heil Hubertus Heil 220 1410.6307 SPD 7 larsklingbeil Lars Klingbeil 216 1129.5727 SPD 8 petertauber Peter Tauber 192 588.7581 CDU 9 sven_kindler Sven-Christian Kindler 182 917.6740 DIE GRÜNEN 10 berlinliebich Stefan Liebich 179 2478.9492 DIE LINKE twitter_name full_name degr_score betw_score party 1 kahrs Johannes Kahrs 251 3482.534 SPD 2 mvabercron Michael von Abercron 149 2543.222 CDU 3 berlinliebich Stefan Liebich 179 2478.949 DIE LINKE 4 f_schaeffler Frank Schäffler 139 1902.670 FDP 5 c_lindner Christian Lindner 222 1422.274 FDP 6 hubertus_heil Hubertus Heil 220 1410.631 SPD 7 ulschzi Ulrike Schielke-Ziesing 45 1353.456 AfD 8 tobiaslindner Tobias Lindner 175 1265.906 DIE GRÜNEN 9 sigmargabriel Sigmar Gabriel 248 1184.612 SPD 10 katarinabarley Katarina Barley 224 1135.491 SPD
We can also visualize our graph with the igraph package. Our graph is quite large and will be better to comprehend when we use different colors for each party. Hence each deputy’s node and outgoing edges should be colored according to her or his party membership. We define a named character vector first with HTML color hex-codes for the nodes and also add a semi-transparent version which we will use for the edges:
party_colors <- c( 'SPD' = '#CC0000', 'CDU' = '#000000', 'DIE GRÜNEN' = '#33D633', 'DIE LINKE' = '#800080', 'FDP' = '#EEEE00', 'AfD' = '#0000ED', 'CSU' = '#ADD8E6' ) # add transparency as hex code (25% transparency) party_colors_semitransp <- paste0(party_colors, '40') names(party_colors_semitransp) <- names(party_colors)
We can assign a color to nodes and edges, by setting a color
attribute for both. The functions V(g)
and E(g)
give access to the vertex (i.e. node) and edge objects of a graph g
. We make the node color dependent on the party
attribute of each node. This attribute came from the dataset dep_twitter_connected
that we passed as vertices
argument to graph_from_data_frame()
when we constructed our graph. We also passed the edge list edges_parties
there, from which the from_party
attribute of each edge comes. We make the edge color dependent on that attribute:
V(g)$color <- party_colors[V(g)$party] E(g)$color <- party_colors_semitransp[E(g)$from_party]
Creating a layout for visualizing a complex graph is not an easy task. You usually want the edges to overlap and cross each other as little as possible. igraph contains several layout generation algorithms for that purpose, which are implemented in functions prefixed by layout_with_
. I tried out the classic Fruchterman-Reingold algorithm (layout_with_fr()
), Kamada-Kawai (layout_with_kk()
) and finally found that Distributed Recursive Layout (Shawn Martin et al., layout_with_drl()
) provided the best result, as it seems clusters the parties very well (because of the high amount of intra-party connections):
lay <- layout_with_drl(g, options=list(simmer.attraction=0))
We’re now ready to visualize the graph with the computed layout. This can be done with the base R plot()
function to which we pass the graph object g
, the layout lay
and several visual adjustments. We also set a title and a legend.
plot(g, layout = lay, vertex.size = 2, vertex.label.cex = 0.7, vertex.label.color = 'black', vertex.label.family = 'arial', vertex.label.dist = 0.5, vertex.frame.color = 'white', edge.arrow.size = 0.2, edge.curved = TRUE) title('Twitter network of members of the German Bundestag', cex = 1.2, line = -0.5) legend('topright', legend = names(party_colors), col = party_colors, pch = 15, bty = "n", pt.cex = 1.25, cex = 0.8, text.col = "black", horiz = FALSE)
You can see the plots for the data from December 2018 and July 2019 below. Make sure to click on the thumbnail because a graph of this size can only be visualized properly on a large image.
Making an interactive network visualization with visNetwork
Such a static image is fine for smaller graphs but we see that it gets quite crowded and hard to grasp in our scenario. One solution is to generate an interactive graph which allows us to zoom in and out, select specific deputies or parties and display additional information when hovering over certain nodes. The R package visNetwork can be used for that purpose. Switching from igraph to visNetwork is straight forward, as we can convert our igraph object to a visNetwork object via toVisNetworkData()
:
library(visNetwork) vis_nw_data <- toVisNetworkData(g)
vis_nw_data
contains two data frames: vis_nw_data$nodes
and vis_nw_data$edges
. They’re essentially a tabular form of V(g)
and E(g)
which means their columns (like color
or party
) represent the attributes from the igraph nodes and edges.
Setting a title
column for the nodes data frame will show this title in the interactive plot when you move the pointer over the node. Here, we set the title to a string of the format “@twitterhandle (Firstname Lastname)”:
vis_nw_data$nodes$title <- sprintf('@%s (%s %s)', vis_nw_data$nodes$id, vis_nw_data$nodes$personal.first_name, vis_nw_data$nodes$personal.last_name)
We strip the transparency channel from the color hex-codes for the edges (the last two characters), because visNetwork can’t display it properly:
vis_nw_data$edges$color <- substr(vis_nw_data$edges$color, 0, 7)
Finally, we create a data frame that defines the legend:
vis_legend_data <- data.frame(label = names(party_colors), color = unname(party_colors), shape = 'square')
Constructing the interactive network graph works by passing the nodes and edges data frames to visNetwork()
and then adjusting the appearance and behavior by concatenating other vis*()
functions with %>%
pipe operators. I’ve added comments below that describe the effect of each line:
visNetwork(nodes = vis_nw_data$nodes, edges = vis_nw_data$edges, height = '700px', width = '90%') %>% # use same layout as before visIgraphLayout(layout = 'layout_with_drl', options=list(simmer.attraction=0)) %>% # and same transparency visEdges(color = list(opacity = 0.25), arrows = 'to') %>% # set node highlighting visNodes(labelHighlightBold = TRUE, borderWidth = 1, borderWidthSelected = 12) %>% # add legend visLegend(addNodes = vis_legend_data, useGroups = FALSE, zoom = FALSE, width = 0.2) %>% # show drop down menus and highlight nearest edges visOptions(nodesIdSelection = TRUE, highlightNearest = TRUE, selectedBy = 'party') %>% # disable dragging of nodes visInteraction(dragNodes = FALSE)
If you run that directly in RStudio, it will show up in the viewer pane. You can also assign the result of the above code to an object like vis_nw
and then store it to an HTML file, which you can share and open in a web browser:
visSave(vis_nw, file = 'dep_visnetwork.html')
I’ve uploaded the results here:
Conclusion
We’ve seen that the first obstacle for creating and analyzing the Twitter network of members of the Bundestag is getting the data. This can be done with a combination of web scraping and querying the Twitter API as I’ve shown in part 1.
After some data preparation, we can already calculate some descriptive aggregate statistics like the friends / followers shares per party. Generating a graph with the igraph package opens the door for numerous network analysis tools. The nodes (vertices) of such a graph are deputy Twitter accounts and links (edges) between them represent who follows who on Twitter. Each node and edge can have additional attributes (data) such as name, weight or color. Several functions from igraph allow to compute centrality measures that help to identify important nodes in a network. A static plot of the graph can be generated using one of the layout algorithms that ship with igraph. Interactive plots, which can be created with the package visNetwork, give better insight when dealing with large graphs.
Of course there are a lot more things worth looking at. For example, we could also take into account the full friends network of deputies, i.e. not only concentrate on the links between deputies but also links to other Twitter accounts that are not members of the Bundestag. We’ve also not taken into account many variables from the Abgeordnetenwatch.de data. The tweets of the deputies were also not considered. We would still have to collect them (using the Twitter API), but at least we already have the Twitter handles of the deputies.
Besides the R code, the full data is also available in the GitHub repository for this project so this can act as a starting point for further analysis.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.