Collection, Management, and Analysis of Twitter Data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As a highly relevant platform for political and social online interactions, researchers increasingly analyze Twitter data. As of 01/2021, Twitter renewed its API, which now includes access to the full history of tweets for academic usage. In this Methods Bites Tutorial, Andreas Küpfer (Technical University of Darmstadt & MZES) presents a walkthrough of the collection, management, and analysis of Twitter data.
After reading this blog post and engaging with the applied exercises, readers will be able to:
- complete the academic research track application process for the Twitter API.
- crawl tweets using customized queries based on the R package
academictwitteR
(Barrie and Ho 2021). - apply a selection of pre-processing steps to these tweets.
- take decisions in order to minimize reprodubcibility issues with Twitter data and to comply with the policies.
Note: This blog post provides a summary of Andreas’ workshop in the MZES Social Science Data Lab. The original workshop materials, including slides and scripts, are available from our GitHub. A live recording of the workshop is available on our YouTube channel.
Overview
Academic research track application process
As Application Programming Interfaces (APIs) are powerful tools which allow access to vast databases full of information, companies offering them are increasingly careful about who is allowed to use them. While the previous version of the Twitter API provided access without a dedicated application (for a detailed description, see this Methods Bites tutorial), the novel version requires you to go through an application process where you have to provide several details about you and your project with Twitter. This information includes data regarding yourself as well as the research project where you intend to work with Twitter data.
Prerequisites
Before getting access, you have to fulfill several formal prerequisites to be eligible for application:
- You are either a master’s student, a doctoral candidate, a post-doc, a faculty member, or a research-focused employee at an academic institution or university.
- You have a clearly defined research objective, and you have specific plans for how you intend to use, analyze, and share Twitter data from your research.
- You will use this access for non-commercial purposes.4
Furthermore, you need a Twitter account which is also used to log in to the Twitter Developer Platform after a successful application. This portal lets you configure your API projects, keep an eye on your monthly tweet cap5, and more. A more detailed explanation of prerequisites can be found on the Twitter API academic research track Track website.
Application
The whole process can be initiated by clicking Apply on the official Twitter API academic research track Website. You’ll be asked to log in with your personal Twitter account.
Source: Twitter API Application Process
The figure above visualizes the steps you have to complete before your application can finally be submitted for Twitter’s internal review:
- Basic Info: such as phone number verification and country selection
- Academic Profile: such as link to an official profile (department website or similar) and academic role
- Project Details: such as information about findings, description of the project itself, and how the API should be used there (e.g. methodologies and how the outcomes will be shared)
- Review: provides an overview of the previous steps
- Terms: developer agreement and policy
Before starting, it is recommended to carefully read which kind of career levels, projects and data behaviors are not allowed to use the API and thus have a high chance of receiving a refusal for their application. To give an example, if you plan to share the content of tweets publicly, you most probably won’t get access to the API as this would violate the Twitter rules. Again, more detailed information about this can be found on the Twitter API academic research track and Developer Terms information guides.
Step one requests generic information about your Twitter account while in step two you have to provide information about your academic profile. This includes a link to a publicly available record on an official department website and information regarding the academic institution you are working in. The third step is the most sophisticated one: your research project. It asks for short paragraphs about the project in general, what and how Twitter data is used there, and how the outcome of your work is shared with the public. The last two steps, review and terms, do not require any user-specific input but provide an overview of all filled-in information as well as the chance to read the developer agreement and policy.
After submitting your application
After submitting your application, you receive a decision via the e-mail address connected with your Twitter account (usually) within a few days. However, according to Twitter, this process can take up to two weeks.
You application may be rejected for two common reasons: First, you do violate the policy at one point according to the information given, or second, you do not meet the requirements (as described above). Further explanations what can be the next steps after a rejection can be found in the Developer Account Support FAQ.
As of writing this blog post (May 2022), submitting a reapplication for access using the same account is not possible.
Using the API
After your successful application, the Twitter Developer Portal is there to manage projects and environments (which belong to a project), generate API keys (“credentials” for API access), get an overview of real-time monthly tweet cap usage, check available API endpoints and their specifics and more.
After the creation of a project, an environment can be added and API keys generated.
Source: Twitter API Application Guide
The following keys are generated automatically and used depending on the API interface (e.g. the R package) at hand:
- API key \(\approx\) username (also called consumer key)
- API key secret \(\approx\) password (also called consumer secret)
- Bearer token \(\approx\) special access token (also called an authentication token)
It is crucial to keep them private and not push them to GitHub or similar! Otherwise someone else could gain access to your API account. Instead, store them somewhere locally or directly within an environment variable. The package we’re going into detail later on this blog post is guiding you safely through this process.
However, in case you’re plan to use them in other applications, you can store your keys in different ways. The most common way in R is to add them to the .Renviron
file. To do this with comfort, install the R package usethis
and call its method usethis::edit_r_environ()
which lets you edit the .Renviron
in the home directory of your computer. In the following you can add tokens (or anything else you want to keep stored locally) using this format:
Key1=value1 Key2=value2 # ...
After saving the file you can access values by calling Sys.getenv("Key1")
within your R application. More best practices on managing your secrets can be found on the website APIs for Social Scientists.
Postman-as-a-playground
Postman is an easily accessible application to try out different queries, tokens, and more. Without any programming knowledge, you get the API results immediately. Here you can find an official tutorial to use Postman with the Twitter API.
However, there are several reasons why Postman cannot replace a package and programming code.
- Building flexible queries (e.g., a list of users to retrieve tweets from)
- Handling large responses which come split up during pagination
- Handle rate limit restrictions
- Transforming responses into manageable data structures (e.g., dataframe and comma-separated values)
All of these tasks can be handled by a suitable package in your favorite programming language.
Which package should I choose?
It has to be noted that there are dozens of packages out there but only some of them already integrated the academic research track of the Twitter API. A selection of packages is listed below:
- academictwitteR (R): The package offers customizable fucntions for all common v2 API endpoints. Additionally, it smoothly guides the developer through all critical steps (e.g. authentication or data processing) of the API interaction.
- RTwitterV2 (R):
Although
RTwitterV2
as of now has less API endpoints included thanacademictwitteR
it still is a valuable alternative which covers all basic functionalities. - rtweet (R):
rtweet
does not support the academic research track yet, however it offers much basic functionality by using the previous API version. A dedicated Methods Bites blog post introducingrtweet
in detail can be found here. - searchtweets-v2 (Python): This is the official package developed and maintained by Twitter available for Python. The library offers flexible functions which even handle very specialized requests but one has to dive deeper into the technical aspects of the API.
- tweepy (Python):
tweepy
is the most common package for Python and backed up by a large developer community. As a bonus, it includes many examples of how to use the various features offered by the package.
Which package you pick should depend on your preferred programming language as well as whether the feature list of a package fits your research purpose.
academictwitteR: a code walkthrough using R
In this blog post academictwitteR (Barrie and Ho 2021) (available for R) is used to demonstrate a simple scenario of retrieving tweets from German members of the parliament. The name academictwitteR
is derived by the Twitter API academic research track for which it is developed for.
We will start by first loading all the needed R packages for the walkthrough:
Code: R packages used in this tutorial
## Save package names as a vector of strings pkgs <- c("dplyr", "academictwitteR", "quanteda", "purrr") ## Install uninstalled packages lapply(pkgs[!(pkgs %in% installed.packages())], install.packages) ## Load all packages to library and adjust options lapply(pkgs, library, character.only = TRUE)
After loading the packages, we need to share our API Bearer Token with academictwitteR
. The following code will guide you through the process to store the key in an R-specific environment file (.Renviron
) which we introduced earlier in this blog post:
academictwitteR::set_bearer() ## Instructions: ## ℹ 1. Add line: TWITTER_BEARER=YOURTOKENHERE to .Renviron ## on new line, replacing YOURTOKENHERE with actual bearer token ## ℹ 2. Restart R
After restarting R, everything is initialized and we can load a table of Twitter user IDs from German MPs into R:
german_mps <- read.csv("data/MP_de_twitter_uid.csv", colClasses=c("user_id"="character")) head(german_mps) ## user_id name party ## 1 44608858 Marc Henrichmann CDU ## 2 819914159915667456 Stephan Pilsinger CSU ## 3 1391875208 Markus Alexander Uhl CDU ## 4 569832889 Sigmar Hartmut Gabriel SPD ## ...
To prevent replication issues with your work, it is recommended to use the Twitter user ID (e.g. 819914159915667456) instead of the user handle (e.g. @StephPilsinger) as the user handle can be changed by the user over time. This would result in not being able anymore to recrawl tweets of these users. In case you only have access to the handle, there is a v2 API endpoint to receive a user object from a handle: /2/users/by/username/:username
Databases and lists of Twitter users can be retrieved from the following sources:
- The Twitter Parliamentarian Database (Vliet, Törnberg, and Uitermark 2020)
- Public Twitter lists (e.g. https://twitter.com/i/lists/912241909002833921)6
- legislatoR R Package (Göbel and Munzert 2021)
Afterward, we are ready to crawl our first tweets using a simple wrapper function (get_tweets_from_user()
) asking for a single user_id
. get_all_tweets()
, which is called inside this function is the heart of our code. It manages the generation of queries for the API, working with rate limits as well as storing the data in JSON
-files (which can be transformed later).
In case you look for specific content, tweet types, or even topics, you can add another parameter to the package function: query
. It allows you to narrow down your search by using specific strings. To give an example, one could look for English retweets containing the keywords putin or selenskij having a geo-location attached. This can be achieved by simply assigning the following string to the query
parameter:
(putin OR selenskyj) -is:retweet lang:en has:geo
Beyond that, there exist many more parameters to individualize the crawling method. All of them are documented in the official academictwitteR
CRAN documentation of the package.
However, in this tutorial I only restrict my search to a Twitter user ID as well as a start and end date for the tweets we are interested in:
# function to retrieve tweets in a specific time period of a single user # (list of user IDs would be possible but one should keep # the max. query string of 1024 characters in mind) get_tweets_from_user <- function(user_id) { # Another option is to add "query" parameter academictwitteR::get_all_tweets( users = user_id, start_tweets = "2021-01-01T00:00:00Z", end_tweets = "2021-09-30T00:00:00Z", data_path = "data/raw/", n = 100) }
The function is then called for each user_id
in the dataframe by using walk()
from the purrr
package (the purrr
package allows you to work with functions and vectors):
purrr::walk(german_mps[["user_id"]], get_tweets_from_user)
To import the tweets into a workable format, call bind_tweets()
from academictwitteR
. It consolidates all available files in the given data_path
and organizes them into the requested format (in our case tidy). In addition, only a relevant fraction of columns is selected in the code below by using select()
from the dplyr
-package.
# concatenate all retrieved tweets into one dataframe and select which columns # should be kept # Another option: set parameter "user" to TRUE to retrieve user information tweets_df <- academictwitteR::bind_tweets(data_path = "data/raw/", output_format = "tidy") %>% dplyr::select( tweet_id, text, author_id, user_username, created_at, sourcetweet_type, sourcetweet_text, lang )
Finally, I store the tweets in a single .csv-file:
write.csv(tweets_df, "data/raw/tweets_german_mp.csv", row.names = FALSE)
Congratulations! You successfully applied to the academic research track, got admitted, and crawled a selection of tweets using the R package academictwitteR
.
Preparing for methods: working with textual data
You are now ready to move on! The usual steps applied to textual data (lowercasing, stopwords removal, stemming, …) depending on the method at hand can be used for pre-processing your tweets. Additional fine-tuning of these steps could involve the removal of, e.g. party IDs, URLs, user mentions or similar. Such steps can be easily done by using regular expressions (regex
). Regular expressions are used to extract patterns in texts which then can be used for further analysis, removal or replacement. Many tutorials available on the web (e.g. RegexOne interactive tutorial) make it straightforward to learn how to bring such expressions into action within your domain.
The following code provides a first starting point for applying pre-processing steps. The code relies on the package quanteda
which is an R package that is often used when working with text data in R. If you want to dive deeper into text mining and text analysis, Methods Bites has more blog posts on these topics.
tweet_corpus <- quanteda::corpus(tweets_df[["text"]], docnames = tweets_df[["tweet_id"]])
The code first transforms the dataframe of tweets into another data format, corpus
, keeping the tweet_id
as an identifier attached to each tweet text
.
Having the tweets in the corpus format makes it easy to apply pre-processing steps after tokenizing. The following list shows a selection of common methods. However, it is important that the decision, on which methods are applied, heavily relies on the following text processing approach:
remove_punct
: removes all punctuationremove_numbers
: removes all numbersdfm_tolower()
: applies lowercasingdfm_remove(stopwords("german"))
: removes German stopwords which occur very frequentlydfm_wordstem(language = "german")
: applies German stemming (e.g., wurden \(\rightarrow\) wurd)
# "2020 wurden in Berlin ca. 18.800 Miet- # in Eigentumswohnungen umgewandelt. #Umwandlungsverbot" dfm <- quanteda::dfm(tweet_corpus %>% quanteda::tokens( remove_punct = TRUE, remove_numbers = TRUE)) %>% quanteda::dfm_tolower() %>% # removes capitialization quanteda::dfm_remove( stopwords("german")) %>% # removes German stopwords quanteda::dfm_wordstem( language = "german") # transforms words to their German wordstems # "wurd berlin ca miet- eigentumswohn umgewandelt #umwandlungsverbot"
The function dfm
(called above) returns a sparse document-feature matrix which could be a fruitful starting point for first-word frequency analysis:
head(dfm) ## Document-feature matrix of: 6 documents, 87 features (79.77% sparse) ## and 0 docvars. ## features ## docs leb plotzlich mehr schablon gut bos pass 😉 #esk #miet ## 44608858 1 1 1 1 1 1 1 1 1 1 ## 819914159915667456 0 0 0 0 0 0 0 0 0 0 ## 1391875208 0 0 0 0 0 0 0 0 0 0 ## 569832889 0 0 1 0 0 0 0 0 0 1 ## ...
You are finally at the step of applying further methods to tackle your research question and getting deeper insights into your crawled tweets. There is much more to explore: You can find further text-as-data tutorials on our blog.
Reproducibility of research based on Twitter data
As reproducible results are one of the major requirements of research projects, it has to be discussed how this could affect your work with Twitter data. The Twitter development agreement includes a clear statement of what researchers are allowed to publish along with their work:
“Academic researchers are permitted to distribute an unlimited number of Tweet IDs and/or User IDs if they are doing so on behalf of an academic institution and for the sole purpose of non-commercial research. For example, you are permitted to share an unlimited number of Tweet IDs for the purpose of enabling peer review or validation of your research.”7
This means that the content of tweets must not be shared publicly. As tweets can be deleted or accounts can be suspended this certainly states an issue for subsequent researchers attempting to replicate the findings as they won’t be able to recrawl such tweets via the API. However, there are also platforms like polititweet.org, which track public figures and based on that justify the publication even of deleted tweets:
Source: polititweet.org landing page
To conclude, this makes the decision of how to share what kind of data not easier. But one still has to ensure to choose the best available option to share his or her data without violating the Twitter rules which are as of today to at least share tweet IDs amongst the community.
Conclusion
This blog post provides a first glimpse into the academic research track Twitter API and the information richness of Twitter data. As there certainly will be further updates and changes to the API in the future, there are plenty of easy-to-use packages that build on active user communities. The community is there to keep the packages updated accordingly to the current Twitter API version. While there exist a lot of powerful packages to tackle the data gathering step, researchers still need to think carefully about how to further process the crawled information depending on their research question and method as well as how to make their research accessible to the community in an open science approach.
References
Barberá, Pablo. 2015. “Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data.” Political Analysis 23 (1): 76–91. https://doi.org/10.1093/pan/mpu011.
Barrie, Christopher, and Justin Chun-ting Ho. 2021. “AcademictwitteR: An R Package to Access the Twitter Academic Research Product Track V2 Api Endpoint.” Journal of Open Source Software 6 (62): 3272. https://doi.org/10.21105/joss.03272.
Göbel, Sascha, and Simon Munzert. 2021. “The Comparative Legislators Database.” British Journal of Political Science, 1–11. https://doi.org/10.1017/S0007123420000897.
Nguyen, Thu T., Shaniece Criss, Eli K. Michaels, Rebekah I. Cross, Jackson S. Michaels, Pallavi Dwivedi, Dina Huang, et al. 2021. “Progress and Push-Back: How the Killings of Ahmaud Arbery, Breonna Taylor, and George Floyd Impacted Public Discourse on Race and Racism on Twitter.” SSM - Population Health 15: 100922. https://doi.org/https://doi.org/10.1016/j.ssmph.2021.100922.
Sältzer, Marius. 2022. “Finding the Bird’s Wings: Dimensions of Factional Conflict on Twitter.” Party Politics 28 (1): 61–70. https://doi.org/10.1177/1354068820957960.
Valle-Cruz, David, Vanessa Fernandez, Asdrubal Lopez-Chau, and Rodrigo Sandoval Almazan. 2022. “Does Twitter Affect Stock Market Decisions? Financial Sentiment Analysis During Pandemics: A Comparative Study of the H1n1 and the Covid‐19 Periods.” Cognitive Computation 14 (January). https://doi.org/10.1007/s12559-021-09819-8.
Vliet, Livia van, Petter Törnberg, and Justus Uitermark. 2020. “The Twitter Parliamentarian Database: Analyzing Twitter Politics Across 26 Countries.” PLoS ONE 15.
Meta data serves as a explanatory information such as topical indicators or the language of the tweet which should explain and enrich the actual tweet, image or main object retrieved from the API↩
Queries are filter operators to narrow down the amount of tweets which should be retrieved↩
API stands for Application Programming Interface and allows, simply speaking, the communication between software.↩
There is a maximum of tweets which can be retrieved via the API which gets resetted once in a month.↩
Use such lists with caution as they may do not come from verified sources.↩
You can find a detailed description of the content redistribution of Twitter data in the official developer policies.↩
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.