Site icon R-bloggers

Which world leaders are twitter bots?

[This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Set-up

Given that I do quite like twitter, I thought it would be a good idea to right about R’s interface to the twitter API; rtweet. As usual, we can grab the package in the usual way. We’re also going to need the tidyverse for the analysis, rvest for some initial webscraping of twitter names, lubridate for some date manipulation and stringr for some minor text mining.

install.packages(c("rtweet", "tidyverse", "rvest", "lubridate"))
library("rtweet")
library("tidyverse")
library("rvest")
library("lubridate")

Getting the tweets

So, I could just write the names of twitter’s 10 most followed world leaders, but what would be the fun in that? We’re going to scrape them from twiplomacy using rvest and a chrome extension called selector gadget:

world_leaders = read_html("https://twiplomacy.com/ranking/the-50-most-followed-world-leaders-in-2017/")
lead_r = world_leaders %>% 
  html_nodes(".ranking-entry .ranking-user-name") %>%
  html_text() %>%
  str_replace_all("\\t|\\n|@", "")
head(lead_r)
## [1] "realdonaldtrump" "pontifex"        "narendramodi"    "pmoindia"       
## [5] "potus"           "whitehouse"

The string inside html_nodes() is gathered using selector gadget. See this great tutorial on rvest and for more on selector gadget read vignette("selectorgadget"). Tabs (\t) and linebreaks (\n) are removed with str_replace_all() from the stringr package.

Now we can collect the twitter data using rtweet. We can use the function lookup_users() to grab basic user info such as number of tweets, friends, favourites and followers. Obviously analysing all 50 leaders at once would be a pain. So we’re only going to take the top 10 (WARNING: this could take a while)

lead_r_info = lookup_users(lead_r[1:10])
lead_r_info
## # A tibble: 10 x 20
##    user_id  name   screen_name  location description       url   protected
##    <chr>    <chr>  <chr>        <chr>    <chr>             <chr> <lgl>    
##  1 25073877 Donal… realDonaldT… Washing… 45th President o… http… FALSE    
##  2 5007043… Pope … Pontifex     Vatican… Welcome to the o… http… FALSE    
##  3 18839785 Naren… narendramodi India    Prime Minister o… http… FALSE    
##  4 4717417… PMO I… PMOIndia     "India " Office of the Pr… http… FALSE    
##  5 8222156… Presi… POTUS        Washing… 45th President o… http… FALSE    
##  6 8222156… The W… WhiteHouse   Washing… Welcome to @Whit… http… FALSE    
##  7 68034431 Recep… RT_Erdogan   Ankara,… Türkiye Cumhurba… <NA>  FALSE    
##  8 2196174… Sushm… SushmaSwaraj New Del… Minister of Exte… <NA>  FALSE    
##  9 3669871… Joko … jokowi       Jakarta  Akun resmi Joko … <NA>  FALSE    
## 10 44335525 HH Sh… HHShkMohd    Dubai, … Official Tweets … http… FALSE    
## # ... with 13 more variables: followers_count <int>, friends_count <int>,
## #   listed_count <int>, statuses_count <int>, favourites_count <int>,
## #   account_created_at <dttm>, verified <lgl>, profile_url <chr>,
## #   profile_expanded_url <chr>, account_lang <chr>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

We only want the columns of interest (name, followers_count, friends_count, statuses_count and favourites_count) and then we want the data in long format. To do this we’re going to use select() and gather()

(lead_r_info = lead_r_info %>%  
  select(name, followers_count, friends_count, statuses_count, favourites_count, screen_name) %>% 
  gather(type, value, -name, -screen_name))
## # A tibble: 40 x 4
##    name                 screen_name     type               value
##    <chr>                <chr>           <chr>              <int>
##  1 Donald J. Trump      realDonaldTrump followers_count 48426576
##  2 Pope Francis         Pontifex        followers_count 16858642
##  3 Narendra Modi        narendramodi    followers_count 40710139
##  4 PMO India            PMOIndia        followers_count 25156203
##  5 President Trump      POTUS           followers_count 22324998
##  6 The White House      WhiteHouse      followers_count 16713369
##  7 Recep Tayyip Erdoğan RT_Erdogan      followers_count 12513987
##  8 Sushma Swaraj        SushmaSwaraj    followers_count 11461412
##  9 Joko Widodo          jokowi          followers_count  9693534
## 10 HH Sheikh Mohammed   HHShkMohd       followers_count  8764455
## # ... with 30 more rows

Now we can use the fantastic ggplot to plot the respective counts for each world leader

ggplot(data = lead_r_info, 
       aes(x = reorder(name, value), y = value,fill = type, colour = type)) +
  geom_col() +
  facet_wrap(~type, scales = "free") + 
  theme_minimal() + 
  theme(
    strip.background = element_blank(),
    strip.text = element_blank(),
    title = element_blank(),
    axis.text.x = element_blank()
  ) + 
  coord_flip() +
  geom_text(aes(y = value, label = value), colour = "black", hjust = "inward")

Notice Donald trumps everyone in the followers and status area (from what I here he’s quite a prevalent tweeter), however Sushma Swaraj and Narendra Modi trump everyone when it comes to favourites and friends respectively.

Now, we’re going to use the function get_timelines() to retrieve the last 2000 tweets by each leader. Again this may take a while!

lead_r_tl = get_timelines(lead_r, n = 2000)

Unfortunately get_timelines() only gives us their twitter handle and doesn’t return their actual name. So I’m going to use select() and left_join() to add the column of names to make for easier reading on the upcoming graphs

names = select(lead_r_info, name, screen_name)
lead_r_twitt = left_join(lead_r_tl, names, by = "screen_name")

get_timelines() gives us the source of a persons tweet, i.e. iPhone, iPad, Android etc. So, what is the post popular tweet source among world leaders?

lead_r_twitt %>% 
  count(source) %>% 
  ggplot(aes(x = reorder(source, n), y = n)) + 
  geom_col(fill = "cornflowerblue") +
  theme_minimal() + 
  theme(
    strip.background = element_blank(),
    axis.text.x = element_blank()
    ) + 
  labs(
    x = NULL,
    y = NULL,
    title = "Tweet sources for world leaders"
  ) + 
  coord_flip() + 
  geom_text(aes(y = n, label = n), hjust = "inward")

Either world leaders really love iPhones or their social media / security teams do. Probably the latter. I can hear you all begging the question, using which source is more likely to give a world leader more retweets and favourites? To do this we’re going to summarise each source by it’s mean number of retweets and favourites and then gather the data into a long format for plotting

lead_r_twitt %>% 
  group_by(source) %>% 
  summarise(Retweet = mean(retweet_count), Favourite = mean(favorite_count)) %>% 
  gather(type,value,-source) %>%  
  ggplot(aes(x = reorder(source, value), y = value, fill = type)) + 
  geom_col() + 
  facet_wrap(~type, scales = "free") +
  theme_minimal() +  
  labs(
    x = "Source",
    y = NULL,
    title = "Which source is more likely to get more retweets and favourites?", 
    subtitle = "Values are the mean in each group"
  ) + 
  theme(
    legend.position = "none",
    axis.text.x = element_blank()
  ) + 
  geom_text(aes(y = value, label = round(value, 0)), colour = "black", hjust = "inward") + 
  coord_flip()

Naturally this leads me to the question of which leader, over their previous 2000 tweets, has the most overall retweets and favourites, and who has the highest average number of retweets and favourites?

lead_r_twitt %>% 
  group_by(name) %>% 
  summarise(rt_total = sum(retweet_count), fav_total = sum(favorite_count),
            rt_mean = mean(retweet_count), fav_mean = mean(favorite_count)) %>% 
  gather(type, value, -name) %>% 
  ggplot(aes(x = reorder(name, value), y = value, fill = type)) +
  geom_col() + 
  labs(
    x = NULL,
    y = NULL,
    title= "Mean and total retweets/favourites for each world leader"
  ) + 
  coord_flip() + 
  facet_wrap(~type, scales = "free") + 
  theme_minimal()

What about the mean retweets and favourites per month? ts_plot() provides us with a quick way to turn the data into a time series plot. However this wouldn’t work for me so I’m doing it the dplyr way. I’m going to a monthly time series so first we need to aggregate our data into months. The function rollback(), from lubridate, is fantastic for this. It will roll a date back to the first day of that month whilst also getting rid of the time information.

lead_r_twit2 = lead_r_twitt %>% 
  mutate(year_month = rollback(created_at, roll_to_first = TRUE, preserve_hms = FALSE)) %>% 
  group_by(name, year_month) %>% 
  mutate(fav_mean = mean(favorite_count), rt_mean  = mean(retweet_count))

We now have two columns, fav_mean and rt_mean, that have in them the mean number of retweets and favourites for each leader in each month. We can use select() and gather() to select the variables we want then turn this into long data for plotting

lead_r_twit2 = lead_r_twit2 %>% 
  select(name, year_month, fav_mean, rt_mean) %>% 
  gather(type, value, -name, -year_month)

Now we plot

lead_r_twit2 %>% 
  ggplot(aes(x = year_month, y = value, colour = name))  + 
  geom_line() +
  facet_wrap(~type, scales = "free", nrow = 2) + 
  labs(
    x = NULL,
    y = NULL,
    title = "Mean number of favourites/month for world leaders"
  ) +
  theme_minimal() 

Are world leaders actually bots?

botrnot is a fantastic package that uses machine learning to calculate the probability that a twitter user is a bot. So the obvious next question is, are our world leaders a bot or not?

We need to install the development package from GitHub and we also need to install the GitHub version of rtweet

devtools::install_github("mkearney/botrnot")
devtools::install_github("mkearney/rtweet")
library("botrnot")
library("rtweet")

The only function, botornot(), works on either given user names, or the output of the get_timelines() function from rtweet. To keep the inline with the rest of the blog, we’re going to use the output we’ve already created from get_timelines(), stored in lead_r_tl

bot = botornot(lead_r_tl) %>%
  arrange(prob_bot)

For a clearer look at the probabilities I’m going to plot them with their actual names instead of the screen names

bot %>% 
  rename(screen_name =  user) %>% 
  inner_join(distinct(names), by = "screen_name") %>% 
  select(name, prob_bot) %>% 
  arrange(prob_bot) %>% 
  ggplot()  + 
  geom_col(aes(x = reorder(name, -prob_bot), y = prob_bot), fill = "cornflowerblue") + 
  coord_flip() + 
  labs(y = "Probability of being a bot", 
       x = "World leader", 
       title = "Probability of world leaders being a bot") + 
  theme_minimal()

So apparently we are almost certain Donald J. Trump isn’t a bot and very very nearly certain the Pope is a bot!

That’s all for this time, thanks for reading!


To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.