twitter users, demographic inference & reticulate
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A simple code-through for using the Python library m3inference in R via reticulate
. As described in Wang et al. (2019): Demographic Inference and Representative Population Estimates from Multilingual Social Media Data. Library facilitates demographic attribute inference of Twitter users, namely, gender, age, and organizational status, based on profile images, screen names, names, and biographies. Some notes here on using m3inference
via reticulate
to infer demographic characteristics of my followers on Twitter, as well as some thoughts on visualizing classifications.
Reticulate & Python
First, we build a conda environment (via the terminal) comprised of m3inference
and pip
(and their respective dependencies).
## <TERMINAL> conda create -n m3demo source activate m3demo conda install pip /home/jtimm/anaconda3/envs/m3demo/bin/pip install m3inference
Then we establish Python and conda environment paths.
## <R-console> Sys.setenv(RETICULATE_PYTHON = "/home/jtimm/anaconda3/envs/m3demo/bin/python") library(reticulate) #reticulate::use_python("/home/jtimm/anaconda3/envs/m3demo/bin/python") reticulate::use_condaenv(condaenv = "m3demo", conda = "/home/jtimm/anaconda3/bin/conda")
Twitter data
For demonstration purposes, we identify/extract my Twitter followers (and some of their M3-relevant features) using the rtweet
package.
## <R-console> library(tidyverse) fws <- rtweet::get_followers(user = 'DrJayTimm') users <- rtweet::lookup_users(fws$user_id) %>% select(user_id, name, screen_name, description, profile_image_url)
Below is a simple hack to provide the M3 model with an actual image file for Twitter profiles that lack profile pics.
## <R-console> jk <- 'http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png' jk0 <- 'https://twirpz.files.wordpress.com/2015/06/twitter-avi-gender-balanced-figure.png' dir0 <- tempdir() users2 <- users %>% mutate(profile_image_url = ifelse(profile_image_url == jk, jk0, profile_image_url)) %>% rename(id_str = user_id)
Profile pics via M3
Output Twitter user details to local temp directory as a ~ ndjson
file.
## <R-console> tmp2 <- tempfile() jsonlite::stream_out(users2, file(tmp1 <- tempfile()), verbose = F)
In a Python console, we then import the M3Twitter
module, and set the directory in which Twitter profile pics will be stored. (Note that the directory established in the R chunk above is accessed below via the r.
prefix.)
## <PYTHON-console> from m3inference import M3Twitter m3twitter = M3Twitter(cache_dir = r.dir0) ## 06/10/2022 21:07:32 - INFO - m3inference.m3inference - Version 1.1.5 ## 06/10/2022 21:07:32 - INFO - m3inference.m3inference - Running on cpu. ## 06/10/2022 21:07:32 - INFO - m3inference.m3inference - Will use full M3 model. ## 06/10/2022 21:07:33 - INFO - m3inference.m3inference - Model full_model exists at /home/jtimm/m3/models/full_model.mdl. ## 06/10/2022 21:07:33 - INFO - m3inference.utils - Checking MD5 for model full_model at /home/jtimm/m3/models/full_model.mdl ## 06/10/2022 21:07:34 - INFO - m3inference.utils - MD5s match. ## 06/10/2022 21:07:35 - INFO - m3inference.m3inference - Loaded pretrained weight at /home/jtimm/m3/models/full_model.mdl
Then, via the transform_jsonl
function, we restructure the ndjson
/jsonl
file and download Twitter users’ profile pics to the temp directory. This function also identifies description language. Note: While we can download profile images and identify description language in R, things tend to go much more smoothly (& quicker) using the functionality included in m3inference
.
## <PYTHON-console> m3twitter.transform_jsonl(input_file = r.tmp1, output_file = r.tmp2, img_path_key = "profile_image_url")#, #lang_key = "lang") ## /home/jtimm/anaconda3/envs/m3demo/lib/python3.10/site-packages/PIL/Image.py:992: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images ## warnings.warn(
Deomgraphic inference via M3
Apply M3 classification model. Attribute classes:
Gender:
male
,female
;Age:
<= 18
,19-29
,30-39
,>=40
; andOrganization:
non-org
,is-org
.
## <PYTHON-console> from m3inference import M3Inference m3 = M3Inference() pred = m3.infer(r.tmp2)
Accessing classifications
Output/predictions from the Python-based M3 model can be moved into R via the (R-based) reticulate::py
function.
## <R-console> py_predictions <- reticulate::py$pred
The table below details age-gender-organization inferences by Twitter ID for a small subset of my followers.
## <R-console> df <- reshape2::melt(py_predictions) df0 <- data.table::setDT(df)[, .SD[which.max(value)], by = list(L1, L2)] df1 <- data.table::dcast(df0, L1 ~ L2, value.var = 'L3') ## <R-console> df1 %>% sample_n(10) %>% knitr::kable()
id | age | gender | org |
---|---|---|---|
42796751 | >=40 | male | non-org |
18193994 | 30-39 | male | non-org |
7433942 | >=40 | female | non-org |
856895716479913985 | 30-39 | male | non-org |
1267501055832719362 | 30-39 | female | is-org |
1082060756122660864 | >=40 | male | non-org |
1306702216590438402 | >=40 | male | non-org |
3533767335 | >=40 | male | non-org |
1388056224667770881 | 30-39 | male | non-org |
3001349463 | 19-29 | male | non-org |
Demographic summary
By Organization
table(df1$org) ## ## is-org non-org ## 16 159
By Age & Gender
(for followers that have not been classified as organizations):
## <R-console> df2 <- df1 %>% mutate(age = factor(age, levels = c('<=18', '19-29', '30-39', '>=40'))) %>% filter(org != 'is-org') %>% count(gender, age) %>% mutate(percent = round(n/sum(n)*100,1)) %>% mutate(percent = ifelse(gender == "male", percent*-1, percent)) df2 %>% knitr::kable()
gender | age | n | percent |
---|---|---|---|
female | 19-29 | 12 | 7.5 |
female | 30-39 | 5 | 3.1 |
female | >=40 | 23 | 14.5 |
male | <=18 | 6 | -3.8 |
male | 19-29 | 32 | -20.1 |
male | 30-39 | 25 | -15.7 |
male | >=40 | 56 | -35.2 |
Age-Gender “pyramid”
## <R-console> maxs <- max(abs(df2$percent)) df2 %>% ggplot(aes(x = age, y = percent, fill =gender)) + geom_col(alpha = .75) + ylim(-maxs - 1, maxs + 1) + coord_flip() + ggthemes::scale_fill_stata() + # scale_y_continuous(breaks = c(-5, 0, 5), # labels = c("5%", "0%", "5%")) + labs(title="Inferred age-gender demographics of my followers")
Profile pics & demographic inference
## <R-console> users2$paths <- grep('224x224', dir(dir0, full.names = TRUE), value = T) users3 <- users2 %>% arrange(id_str) %>% mutate(paths = grep('224x224', dir(dir0, full.names = TRUE), value = T)) %>% left_join(df1, by = c('id_str' = 'id'))
A simple function for modifying profile pics. Including: (1) “charcoal-ing” photos for user privacy, and (2) labeling photos with predicted age, gender, and organization classes.
## <R-console> modify_images <- function(paths){ for(i in 1:length(paths)){ y1 <- magick::image_read(paths[i]) y2 <- magick::image_charcoal(y1) y3 <- magick::image_border(y2, 'white', '5x5') ll <- paste0(users3$org[i], '\n', users3$gender[i], '\n', users3$age[i]) y4 <- magick::image_annotate(y3, text = ll, color = "black", size = 26, weight = 700, location = "+10+10") magick::image_write(y4, paths[i]) } }
Apply function, and build a collage of profile pics with predicted demographics using the photomoe package.
## <R-console> modify_images(paths = users3$paths) # devtools::install_github("jaytimm/photomoe") photomoe::img_build_collage(paths = users3$paths, dimx = 7, dimy = 12)
References
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.