Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In Part 2 of this series we learned that summing the valence measurement of the words in the Afrisenti data set did not agree well with the sentiments already provided. While the sentiments we calculated made sense in general, they were measured without any knowledge of how the original sentiments were determined. For this post we will apply machine learning techniques to try to reverse engineer the thinking that went into the sentiment assignments.
In our previous analysis we were only in agreement with original data about 50% of the time. Can our trained models do any better? If we pass the tweets through Google Translate first to convert them all to English, will we improve or worsen the accuracy of the model? Let’s see.
We will take the usual machine learning approach of splitting the data into test and training sets, then run a classifier model on the training set and finally validate it against the test set.
suppressPackageStartupMessages({ library(tidyverse) library(tidytext) library(quanteda) library(tictoc) library(xgboost)}) # set up some chart defaults tan1 <- "#FDECCD" yellow1 <- "#FFBF00" green1 <- "#007000" theme_afri <- function(...){ # making a function allows passing further theme elements ggplot2::theme( plot.background = element_rect(fill = tan1, color = NA), panel.background = element_blank(), legend.background = element_blank(), legend.box.background = element_blank(), legend.key = element_blank(), ... ) } # the the previously translated tweets. afrisenti_translated <- readr::read_csv('https://raw.githubusercontent.com/apsteinmetz/tidytuesday/master/2023-02-28_african_language/data/afrisenti_translated.csv', show_col_types = FALSE) |> mutate(lang = as.factor(assigned_long)) |> mutate(sentiment = as.factor(label)) |> mutate(intended_use = as.factor(intended_use)) |> select(lang,tweet_num,sentiment,translatedText,tweet,intended_use) afrisenti_translated
# A tibble: 111,720 × 6 lang tweet_num sentiment translatedText tweet intended_use <fct> <dbl> <fct> <chr> <chr> <fct> 1 Amharic 1 negative Amhara region needs moderatio… አማራ … dev 2 Amharic 2 negative Having a mood in a way that a… ሰውን … dev 3 Amharic 3 negative Domestic violence - without h… የቤት … dev 4 Amharic 4 negative Ethiopia, what was your role … Ethi… dev 5 Amharic 5 negative No matter how Oromo learns, t… ኦሮሞ … dev 6 Amharic 6 negative Tish, dark powder, are you st… ቲሽ ጨ… dev 7 Amharic 7 negative Local residents said that the… በምዕራ… dev 8 Amharic 8 negative What happened to Schimmels is… ያየሰው… dev 9 Amharic 9 negative An unstable country does not … ያልተረ… dev 10 Amharic 10 negative The idiot and the bubble, the… ደደቡ … dev # ℹ 111,710 more rows
Explore the Data Set
summary(afrisenti_translated)
lang tweet_num sentiment translatedText Hausa :22152 Min. : 1 negative:36564 Length:111720 Igbo :15715 1st Qu.: 27931 neutral :39151 Class :character Yoruba :15127 Median : 55861 positive:36005 Mode :character Arabic :12061 Mean : 55861 English:10556 3rd Qu.: 83790 Amharic: 9480 Max. :111720 (Other):26629 tweet intended_use Length:111720 dev :13728 Class :character test :34307 Mode :character train:63685
The data set is already tagged into training, test and “dev” sets. The training set is twice the size of the test set. What is “dev?” I don’t know if this split is random or not but we are concerned whether the profile of the training set is similar to the test set. Let’s split it according to the tags.
tweet_train <- afrisenti_translated |> filter(intended_use == "train") |> select(tweet_num,sentiment,lang,tweet,translatedText) tweet_test <- afrisenti_translated |> filter(intended_use == "test") |> select(tweet_num,sentiment,lang,tweet,translatedText) tweet_dev <- afrisenti_translated |> filter(intended_use == "dev") |> select(tweet_num,sentiment,lang,tweet,translatedText)
Now we’ll see if the training set is representative of the test set. Do the languages align?
afrisenti_translated |> ggplot(aes(lang,group=intended_use)) + geom_bar(aes(y = after_stat(prop)),fill = yellow1) + scale_y_continuous(labels=scales::percent) + theme_afri(axis.text.x = element_text()) + coord_flip() + facet_grid(~intended_use) + labs(title = "Splits Are Reasonably Aligned by Language", y = "Proportion", x= "Language")
Looks okay.
Do the sentiments align?
afrisenti_translated |> ggplot(aes(sentiment,group=intended_use)) + geom_bar(aes(y = after_stat(prop)),fill=yellow1) + scale_y_continuous(labels=scales::percent) + theme_afri(axis.text.x = element_text()) + facet_grid(~intended_use) + coord_flip() + labs(title = "Splits Are Balanced by Sentiment", y = "Proportion", x= "Sentiment")
The splits are well balanced.
< section id="approach-to-the-problem" class="level2">Approach to the Problem
The structure of our model is basically a regression with one dependent variable and thousands of independent variables which are all of the words (“tokens”) in all the tweets. This is a “document feature matrix” (DFM). What goes in the cells of the matrix? One approach would simply code each of the words by their presence or absence in the tweet. A more nuanced approach is to code each word in each tweet by how important it is in the tweet. “tf-idf”, is sort of a uniqueness measure. This has the added benefit of down-ranking stop words that appear very frequently all over the place, even if we have no stop-word lexicon for a certain language.
There are several machine learning models we might try. There are two R packages suitable for classifiers where there are more than two categores that also work with sparse matrices (see below), ranger
and xgboost
. In preview, we will use xgboost
here because the results are about same but xgboost
is much faster. Here is a short comparison of different machine learning approaches to sentiment analysis.(Saifullah, Fauziah, and Aribowo 2021)
I set out to use RStudio’s suite of machine learning workflow packages, tidymodels
, in this project, but quickly became frustrated. I could not get the wrapper around xgboost
to give a sensible result (it’s probably my fault) and there is a bug in the wrapper around ranger
that prevents making predictions with the model. So we’ll do it the old-fashioned way.
Pre-Processing.
There are some things we can to do boost our chances of a satisfying outcome. As noted above we’re going to build a matrix with the tweet numbers, an arbitrary index to uniquely identify each tweet (“document”), as the rows, and each word in every tweet (“feature”) as the columns. This will create a “document feature matrix.”
Before we create the matrix, we first split all the tweets into individual words (“tokens”) and refine that list to make it more managable. Let’s create a few functions to help us with this.
What do we do when a negation flips the sentiment of a tweet? “I Love” is positive but “I do not love” is negative. In our previous post we tried sentence-level analysis to handle negation. Here we are doing word level training. We will address this by creating new tokens where any instance of, say “not love” is replaced by “not_love,” an entirely new word for our analysis. This is very crude and only includes English (9% of the tweets), but it’s something. It’s actually simpler to do use the str_replace_all(tweet,"not |no ","not_")
function to do this before we tokenize the tweets.
When we measured the tweets using a sentiment lexicon, only the words in the lexicon contributed to the sentiment measurement. Everything else was neutral. With machine learning everything counts and the more words we have, the bigger the model and the longer it will take to train. It is common in analyzing text to drop low-information words or “stop words.” In English, we drop words like “the” and “that.” We want to build a list of stop words relevant to our data set. On Kaggle I found a list of stop words in various African languages. It doesn’t cover every language in our data set but will reduce the matrix size a bit. We’ll add that to the lexicon of English stop words and a custom lexicon built from a quick inspection of the data set. In practice, the tf-idf score is the biggest indicator of a low-information word, irrespective of language.
stop_words_af <- readr::read_csv('https://raw.githubusercontent.com/apsteinmetz/tidytuesday/master/data/stopwords_af.csv', show_col_types = FALSE) # add my stop words to defaults my_stop_words = tibble(word = c("http","https","dey","de","al","url","na","t.co","rt","user","users","wey","don", as.character(1:100))) # make a stopword list of any 1-character words # this is a somewhat arbitrary rubric for African language stop words stop_words_1char <- afrisenti_translated |> unnest_tokens(word,tweet) |> select(word) |> filter(str_length(word)<2) |> unique() full_stop_words <- c( tidytext::stop_words$word, my_stop_words$word, stop_words_af$word, stop_words_1char$word ) |> enframe(value = "word") remove_stopwords <- function(tokens) { tokens |> # remove stop words anti_join(full_stop_words) }
Did we say every word? Well, not EVERY word. 260,000 is more than we can handle so let’s create a helper function to prune the data set to only the words with the highest frequency.
only_top_words <- function(tokens, word_count = 2000) { chosen_words <- tokens |> ungroup() |> select(word) |> count(word) |> slice_max(order_by = n, n = word_count) |> select(-n) return(inner_join(tokens,chosen_words)) }
Even after pruning there are over 2,000 unique words in this set of tweets. 2,000 variables and over 110,000 tweets. That’s a pretty big matrix, over 120 million elements, but the vast majority of those elements are filled with zero. We can make the memory size of this monster manageable by using a “sparse matrix.” Such a matrix describes what elements are empty without actually populating them. Fortunately both ranger
and xgboost
understand sparse matrices.
The function make_dfm
is our workhorse. It takes the raw tweet data and turns it into sparse document feature matrix after applying our pre-processing steps. Note that the matrix does not contain our independent variable, “sentiment.” That is a separate vector we attach to the matrix as a docvar
, part of a quanteda::dfm
object.
Note the order of our pre-processing matters. First we create negation tokens, then we prune stop words, then we compute the tf-idf and finally we take the top words. By trial and error I learned that computing each word’s tf-idf against the whole data set before choosing the top words yields a better result. Spoiler alert: around 2000 words is the sweet spot.
# make sparse document-feature matrix make_dfm <- function(tweet_data, translated = FALSE, num_words = 1000) { if (translated) { tweet_tokens <- tweet_data |> mutate(translatedText = str_replace_all(translatedText, "no |not ", "not_")) |> select(tweet_num, translatedText) |> unnest_tokens(word, translatedText) } else{ tweet_tokens <- tweet_data |> mutate(tweet = str_replace_all(tweet, "no |not ", "not_")) |> select(tweet_num, tweet) |> unnest_tokens(word, tweet) } tweet_tokens <- tweet_tokens |> remove_stopwords() |> count(tweet_num, word) |> bind_tf_idf(word, tweet_num, n) |> only_top_words(num_words) |> select(tweet_num, word, tf_idf) sentiment_subset <- tweet_tokens |> slice_head(n=1,by=tweet_num) |> left_join(tweet_data) |> pull(sentiment) tweet_dfm <- cast_dfm(tweet_tokens,tweet_num,word,tf_idf) # add dependent variable to sparse matrix docvars(tweet_dfm,"sentiment") <- sentiment_subset return(tweet_dfm) }
Train on African Language Tweets
We’ll establish a baseline by training a model on the African-language tweets.
Note that we don’t care what language the token is. It could be any language or no language. It could be an emoji, as long as it is associated with a sentiment. There is a risk that the same word could convey the opposite sentiment in two different languages but I assume it is rare enough to ignore.
# more words in common in the translated word list translated = FALSE tweet_train_dfm <- make_dfm(tweet_train,translated = translated,num_words = 2000) tweet_test_dfm <- make_dfm(tweet_test,translated = translated,num_words = 2000)
How sparse is the training DFM? 99.8% of the the entries are zero.
After creating DFMs for both training and testing we see that the DFMs have about 3/4 of the words in common so there is a good bit of information for prediction out-of-sample.
# how close are the word lists?} # more words in common in the translated word list inner_join(enframe(dimnames(tweet_train_dfm)[[2]]), enframe(dimnames(tweet_test_dfm)[[2]]), by = "value") |> nrow() |> paste("Words in both train and test sets")
[1] "1508 Words in both train and test sets"
Running predictions on a test set requires the feature list of the training and test set be the same. Three quarters, but not all, of the tokens overlap in our DFMs. The dfm_match
function will ensure the test set features are congruent with the training set.
# make sure test set has all variables in both train and test sets tweet_test_dfm <- dfm_match(tweet_test_dfm, features = featnames(tweet_train_dfm))
We will use the gradient boosted tree approach to training our model. An excellent introduction to the theory is contatined in the documentation to the xgboost
package, which is available for many languages, by the way.
This is a supervised model, meaning we know all the possible predictions ahead of time. In this case, “negative,” “neutral” and “positive.” A slight “gotcha” is xgboost
requires numeric classifiers, with the first one as “0.” As such, we convert our dependent variable, which is a factor, to numeric and then covert it back after predicting.
# run the models tic() xg_fit <- xgboost( data = tweet_train_dfm, max.depth = 100, nrounds = 100, objective = "multi:softmax", num_class = 3, label = as.numeric(tweet_train_dfm$sentiment)-1, print_every_n = 10 )
[1] train-mlogloss:0.984030 [11] train-mlogloss:0.664378 [21] train-mlogloss:0.579607 [31] train-mlogloss:0.530083 [41] train-mlogloss:0.493999 [51] train-mlogloss:0.465732 [61] train-mlogloss:0.441981 [71] train-mlogloss:0.422400 [81] train-mlogloss:0.404571 [91] train-mlogloss:0.389673 [100] train-mlogloss:0.377797
toc()
61.94 sec elapsed
Results With Native Language Tweets
In the interest of space, we won’t go over tuning the model or cross validation, both of which are used to optimize performance. I played around with the parameters of the model and settled on the ones used here as roughly optimal. If we plot the loss function below we see the diminishing marginal return to additional training rounds. Adding more rounds continues to improve the fit to the training set (in theory we would ultimately achieve a perfect fit) but the fit to the test set doesn’t improve at all.
xg_fit$evaluation_log |> ggplot(aes(iter,train_mlogloss)) + geom_line() + theme_afri()
Alas, the accuracy of our machine learning model is not much better than our simple valence measurements from the last post.
# predict and convert classes back to factors predicted <- predict(xg_fit,tweet_test_dfm) |> as.factor() levels(predicted) <- levels(tweet_test$sentiment) predicted_for_table <- tibble(actual = tweet_test_dfm$sentiment, predicted) caret::confusionMatrix(table(predicted_for_table))
Confusion Matrix and Statistics predicted actual negative neutral positive negative 5524 4598 1261 neutral 2489 7011 1182 positive 2100 3287 4981 Overall Statistics Accuracy : 0.5401 95% CI : (0.5346, 0.5455) No Information Rate : 0.4593 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.3095 Mcnemar's Test P-Value : < 2.2e-16 Statistics by Class: Class: negative Class: neutral Class: positive Sensitivity 0.5462 0.4707 0.6709 Specificity 0.7375 0.7907 0.7846 Pos Pred Value 0.4853 0.6563 0.4804 Neg Pred Value 0.7820 0.6375 0.8893 Prevalence 0.3118 0.4593 0.2289 Detection Rate 0.1703 0.2162 0.1536 Detection Prevalence 0.3510 0.3294 0.3197 Balanced Accuracy 0.6419 0.6307 0.7278
Train on English Translated Tweets
# more words in common in the translated word list translated = TRUE tweet_train_dfm <- make_dfm(tweet_train,translated = translated,num_words = 2000) tweet_test_dfm <- make_dfm(tweet_test,translated = translated,num_words = 2000)
Now how many words are shared among the training and test sets? If there is a case to be made for a better outcome with the translated tweets, this is it. We have more words in common across the training and test sets since, by converting 13 languages to 1, we have fewer unique words and thus more information in each word. In practice this doesn’t make much of a difference. There are less than 200 additional words in common. Surprising.
# how close are the word lists?} # more words in common in the translated word list inner_join(enframe(dimnames(tweet_train_dfm)[[2]]), enframe(dimnames(tweet_test_dfm)[[2]]), by = "value") |> nrow() |> paste("Words in both train and test sets")
[1] "1678 Words in both train and test sets"
Once again we match the features.
# make sure test set has all variables in both train and test sets tweet_test_dfm <- dfm_match(tweet_test_dfm, features = featnames(tweet_train_dfm))
We will use the same model parameters as we did with the native tweets.
# run the models tic() xg_fit <- xgboost( data = tweet_train_dfm, max.depth = 100, nrounds = 100, objective = "multi:softmax", num_class = 3, label = as.numeric(tweet_train_dfm$sentiment)-1, print_every_n = 10 )
[1] train-mlogloss:1.000106 [11] train-mlogloss:0.721196 [21] train-mlogloss:0.647356 [31] train-mlogloss:0.604540 [41] train-mlogloss:0.573208 [51] train-mlogloss:0.548874 [61] train-mlogloss:0.528251 [71] train-mlogloss:0.511289 [81] train-mlogloss:0.495794 [91] train-mlogloss:0.482234 [100] train-mlogloss:0.470904
toc()
58.2 sec elapsed
The loss function looks the same
xg_fit$evaluation_log |> ggplot(aes(iter,train_mlogloss)) + geom_line() + theme_afri()
Results With English Language Tweets
# predict and convert classes back to factors predicted <- predict(xg_fit,tweet_test_dfm) |> as.factor() levels(predicted) <- levels(tweet_test$sentiment) predicted_for_table <- tibble(actual = tweet_test_dfm$sentiment, predicted) caret::confusionMatrix(table(predicted_for_table))
Confusion Matrix and Statistics predicted actual negative neutral positive negative 5678 4146 1432 neutral 2233 6607 1483 positive 1905 3245 5178 Overall Statistics Accuracy : 0.5473 95% CI : (0.5418, 0.5528) No Information Rate : 0.4387 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.3217 Mcnemar's Test P-Value : < 2.2e-16 Statistics by Class: Class: negative Class: neutral Class: positive Sensitivity 0.5784 0.4720 0.6398 Specificity 0.7475 0.7925 0.7837 Pos Pred Value 0.5044 0.6400 0.5014 Neg Pred Value 0.7996 0.6576 0.8649 Prevalence 0.3076 0.4387 0.2536 Detection Rate 0.1780 0.2071 0.1623 Detection Prevalence 0.3528 0.3235 0.3237 Balanced Accuracy 0.6630 0.6323 0.7118
Conclusion
Again, we see a disappointing result. Machine learning did not significantly increase the accuracy of our sentiment measurement. It is interesting that translating the tweets to English DOES improve the model accuracy, but by a puny amount. I’m not sure the conclusion would survive cross-validation. Here’s the summary for all the methods we’ve done.
Valence by Word | Valence by Sentence | XGBoost – Native | XGBoost – English | |
---|---|---|---|---|
Accuracy | 51% | 51% | 54% | 55% |
Kappa Statistic | 26% | 27% | 31% | 32% |
Obviously, we’re doing “toy” models. Serious researchers on this subject have achieved better results. In particular, a model called “AfroXLMR-Large” cited in the original paper (Muhammad et al. 2023) achieved an average accuracy across all languages of 67.2% on this data set. This is a pre-trained model with over 270 million variables.
Hopefully this was useful as a introduction to some of the techniques used in sentiment analysis and the R tools we can use for the task.
In the final post in this series we’ll look at something of more general interest. There has been much talk about how the “Global South” feels neglected by the rich countries in the northern hemisphere. Can we use the sentiments expressed in the Afrisenti data set to get feelings about specific regions?
References
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.