Site icon R-bloggers

twitter users, demographic inference & reticulate

[This article was first published on Jason Timm, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A simple code-through for using the Python library m3inference in R via reticulate. As described in Wang et al. (2019): Demographic Inference and Representative Population Estimates from Multilingual Social Media Data. Library facilitates demographic attribute inference of Twitter users, namely, gender, age, and organizational status, based on profile images, screen names, names, and biographies. Some notes here on using m3inference via reticulate to infer demographic characteristics of my followers on Twitter, as well as some thoughts on visualizing classifications.


Reticulate & Python

First, we build a conda environment (via the terminal) comprised of m3inference and pip (and their respective dependencies).

## <TERMINAL>
conda create -n m3demo
source activate m3demo
conda install pip 
/home/jtimm/anaconda3/envs/m3demo/bin/pip install m3inference

Then we establish Python and conda environment paths.

## <R-console>
Sys.setenv(RETICULATE_PYTHON = "/home/jtimm/anaconda3/envs/m3demo/bin/python")

library(reticulate)
#reticulate::use_python("/home/jtimm/anaconda3/envs/m3demo/bin/python")
reticulate::use_condaenv(condaenv = "m3demo",
                         conda = "/home/jtimm/anaconda3/bin/conda")

Twitter data

For demonstration purposes, we identify/extract my Twitter followers (and some of their M3-relevant features) using the rtweet package.

## <R-console>
library(tidyverse)
fws  <- rtweet::get_followers(user = 'DrJayTimm') 

users <- rtweet::lookup_users(fws$user_id) %>%
  select(user_id, name, screen_name, description, profile_image_url)

Below is a simple hack to provide the M3 model with an actual image file for Twitter profiles that lack profile pics.

## <R-console>
jk <- 'http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png'
jk0 <- 'https://twirpz.files.wordpress.com/2015/06/twitter-avi-gender-balanced-figure.png'

dir0 <- tempdir()

users2 <- users %>%
  mutate(profile_image_url = ifelse(profile_image_url == jk, jk0, profile_image_url)) %>%
  rename(id_str = user_id) 

Profile pics via M3

Output Twitter user details to local temp directory as a ~ ndjson file.

## <R-console>
tmp2 <- tempfile()
jsonlite::stream_out(users2, file(tmp1 <- tempfile()), verbose = F)

In a Python console, we then import the M3Twitter module, and set the directory in which Twitter profile pics will be stored. (Note that the directory established in the R chunk above is accessed below via the r. prefix.)

## <PYTHON-console>
from m3inference import M3Twitter
m3twitter = M3Twitter(cache_dir = r.dir0) 
## 06/10/2022 21:07:32 - INFO - m3inference.m3inference -   Version 1.1.5
## 06/10/2022 21:07:32 - INFO - m3inference.m3inference -   Running on cpu.
## 06/10/2022 21:07:32 - INFO - m3inference.m3inference -   Will use full M3 model.
## 06/10/2022 21:07:33 - INFO - m3inference.m3inference -   Model full_model exists at /home/jtimm/m3/models/full_model.mdl.
## 06/10/2022 21:07:33 - INFO - m3inference.utils -   Checking MD5 for model full_model at /home/jtimm/m3/models/full_model.mdl
## 06/10/2022 21:07:34 - INFO - m3inference.utils -   MD5s match.
## 06/10/2022 21:07:35 - INFO - m3inference.m3inference -   Loaded pretrained weight at /home/jtimm/m3/models/full_model.mdl

Then, via the transform_jsonl function, we restructure the ndjson/jsonl file and download Twitter users’ profile pics to the temp directory. This function also identifies description language. Note: While we can download profile images and identify description language in R, things tend to go much more smoothly (& quicker) using the functionality included in m3inference.

## <PYTHON-console>
m3twitter.transform_jsonl(input_file = r.tmp1, 
                          output_file = r.tmp2, 
                          img_path_key = "profile_image_url")#, 
                          #lang_key = "lang")
## /home/jtimm/anaconda3/envs/m3demo/lib/python3.10/site-packages/PIL/Image.py:992: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images
##   warnings.warn(

Deomgraphic inference via M3

Apply M3 classification model. Attribute classes:

  • Gender: male, female;

  • Age: <= 18, 19-29, 30-39, >=40; and

  • Organization: non-org, is-org.

## <PYTHON-console>
from m3inference import M3Inference
m3 = M3Inference() 
pred = m3.infer(r.tmp2)

Accessing classifications

Output/predictions from the Python-based M3 model can be moved into R via the (R-based) reticulate::py function.

## <R-console>
py_predictions <- reticulate::py$pred

The table below details age-gender-organization inferences by Twitter ID for a small subset of my followers.

## <R-console>
df <- reshape2::melt(py_predictions) 
df0 <- data.table::setDT(df)[, .SD[which.max(value)], by = list(L1, L2)]
df1 <- data.table::dcast(df0, L1  ~ L2, value.var = 'L3')
## <R-console>
df1 %>% sample_n(10) %>% knitr::kable()
id age gender org
42796751 >=40 male non-org
18193994 30-39 male non-org
7433942 >=40 female non-org
856895716479913985 30-39 male non-org
1267501055832719362 30-39 female is-org
1082060756122660864 >=40 male non-org
1306702216590438402 >=40 male non-org
3533767335 >=40 male non-org
1388056224667770881 30-39 male non-org
3001349463 19-29 male non-org

Demographic summary

By Organization

table(df1$org)
## 
##  is-org non-org 
##      16     159

By Age & Gender

(for followers that have not been classified as organizations):

## <R-console>
df2 <- df1 %>%
  mutate(age = factor(age, levels = c('<=18', '19-29', '30-39', '>=40'))) %>%
  filter(org != 'is-org') %>%
  count(gender, age) %>%
  mutate(percent = round(n/sum(n)*100,1)) %>%
  mutate(percent = ifelse(gender == "male", percent*-1, percent))

df2 %>% knitr::kable()
gender age n percent
female 19-29 12 7.5
female 30-39 5 3.1
female >=40 23 14.5
male <=18 6 -3.8
male 19-29 32 -20.1
male 30-39 25 -15.7
male >=40 56 -35.2

Age-Gender “pyramid”

## <R-console>
maxs <- max(abs(df2$percent))
df2 %>%
  ggplot(aes(x = age, y = percent, fill =gender)) +
  geom_col(alpha = .75) + 
  ylim(-maxs - 1, maxs + 1) +
  coord_flip() +
  ggthemes::scale_fill_stata() +
  # scale_y_continuous(breaks = c(-5, 0, 5),
  #                    labels = c("5%", "0%", "5%")) +
  labs(title="Inferred age-gender demographics of my followers")

Profile pics & demographic inference

## <R-console>
users2$paths <- grep('224x224', dir(dir0, full.names = TRUE), value = T)

users3 <- users2 %>%
  arrange(id_str) %>%
  mutate(paths = grep('224x224', dir(dir0, full.names = TRUE), value = T)) %>%
  left_join(df1, by = c('id_str' = 'id'))

A simple function for modifying profile pics. Including: (1) “charcoal-ing” photos for user privacy, and (2) labeling photos with predicted age, gender, and organization classes.

## <R-console>
modify_images <- function(paths){
  
  for(i in 1:length(paths)){
    y1 <- magick::image_read(paths[i])
    y2 <- magick::image_charcoal(y1)
    y3 <- magick::image_border(y2, 'white', '5x5')
    
    ll <- paste0(users3$org[i], '\n',
                 users3$gender[i], '\n',
                 users3$age[i])
    
    y4 <- magick::image_annotate(y3, 
                                 text = ll, 
                                 color = "black", 
                                 size = 26,
                                 weight = 700,
                                 location = "+10+10")
        
    magick::image_write(y4, paths[i]) 
    }
}

Apply function, and build a collage of profile pics with predicted demographics using the photomoe package.

## <R-console>
modify_images(paths = users3$paths)

# devtools::install_github("jaytimm/photomoe")
photomoe::img_build_collage(paths = users3$paths, 
                            dimx = 7, 
                            dimy = 12)

References

Wang, Zijian, Scott Hale, David Ifeoluwa Adelani, Przemyslaw Grabowicz, Timo Hartman, Fabian Flöck, and David Jurgens. 2019. “Demographic Inference and Representative Population Estimates from Multilingual Social Media Data.” In The World Wide Web Conference, 2056–67.
To leave a comment for the author, please follow the link and comment on their blog: Jason Timm.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.