Site icon R-bloggers

How to collect dataviz from Twitter into your note-taking system

[This article was first published on Albert Rapp, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Intro

It is mid-April and the #30daychartchallenge is well on its way. One glace at the hashtag’s Twitter feed suffices to realize that there are great contributions. That’s a perfect opportunity to collect data viz examples for future inspirations.

Ideally, I can scroll through Twitter and with a few clicks incorporate these contributions straight into my Obsidian or any other Markdown-based note-taking system. Unfortunately, rtweet’s snapshot function does not seem to work anymore. So, let’s build something on our own that gets the job done. The full script can be found on GitHub gist. Here’s what we will need:

Overview

Before we begin, let me summarize what kind of note-taking process I have in mind:

  1. Stroll through Twitter and see great data viz on twitter.

  2. Send tweet link and a few comments via mail to a dummy mail account

  3. A scheduled process accesses the dummy mail account and scans for new mails from authorized senders.

  4. If there is a new mail, R extracts tweet URL and uses Twitter’s API to download the tweet’s pictures and texts.

  5. A template Markdown file is used to create a new note that contains the images and texts.

  6. Markdown file is copied to your note-taking system within your file system.

  7. Ideally, your Markdown template contains tags like #dataviz and #twitter so that your new note can be easily searched for.

  8. Next time you look for inspiration, stroll through your collections or search for comments.

Preparations

Ok, we know what we want to accomplish. Time to get the prelims done. First, we will need a Twitter developer account. Then, we have to mask sensitive information in our code. If you already have a twitter app resp. a bearer token and know the keyring package, feel free to skip this section.

Get Twitter developer account

Let’s create a developer account for Twitter. Unfortunately, there is no way to get such an account without providing Twitter with your phone number. Sadly, if this burden on your privacy is a problem for you, then you cannot proceed. Otherwise, create an account at developer.twitter.com.

In your developer portal, create a project. Within this project create an app. Along the way, you will get a bunch of keys, secrets, IDs and tokens. You will see them only once, so you will have to save them somewhere. I suggest saving them into a password manager like bitwarden.

When you create your app or shortly after, you will need to set the authentication settings. I use OAuth 2.0. This requires

Next, you will likely need to upgrade your project to ‘elevated’ status. This can be done for free on your project’s dashboard. From what I recall, you will have to fill out a form and tell Twitter what you want to do with your app. Just be honest and chances are that your request will immediately be granted. Just be yourself! What could possibly go wrong? Go get the girl elevated status (ahhh, what a perfect opportunity for a Taylor song).

How to embed your bearer token and other sensitive material in your code

Use the keyring package to first save secrets via key_set and then extract them in your session via key_get(). This way, you won’t share your sensitive information by accident when you share your code (like I do). In this post, I do this for my bearer token, my dummy mail, my dummy mail’s password and for the allowed senders (that will be the mail where the tweets come from).

1
2
3
4
bearer_token <- keyring::key_get('twitter-bearer-token')
user_mail <- keyring::key_get('dataviz-mail')
password_mail <- keyring::key_get('dataviz-mail-password')
allowed_senders <- keyring::key_get('allowed_senders')

The allowed_senders limitation is a precaution so that we do not accidentally download some malicious spam mail from God knows who onto our computer. I am no security expert but this feels like a prudent thing to do. If one of you fellow readers knows more about this security business, feel kindly invited to reach out to me with better security strategies.

What to do once we have a URL

Let’s assume for the sake of this section that we already extracted a tweet URL from a mail. Here’s the URL that we will use. In fact, it’s Christian Gebhard’s tweet that inspired me to start this project. From the URL we can extract the tweet’s ID (the bunch of numbers after /status/). Also, we will need the URL of Twitter’s API.

1
2
3
4
5
library(stringr) # for regex matching
library(dplyr) # for binding rows and pipe
tweet_url <- 'https://twitter.com/c_gebhard/status/1510533315262042112'
tweet_id <- tweet_url %>% str_match("status/([0-9]+)") %>% .[, 2]
API_url <- 'https://api.twitter.com/2/tweets'

Use GET() to access Twitter API

Next, we use the GET() function from the httr package to interact with Twitter’s API.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
library(httr) # for API communication

auth <- paste("Bearer", bearer_token) # API needs format "Bearer <my_token>"

# Make request to API
request <- GET(
  API_url, 
  add_headers(Authorization = auth), 
  query = list(
    ids = tweet_id, 
    tweet.fields = 'created_at', # time stamp
    expansions = 'attachments.media_keys,author_id', 
    # necessary expansion fields for img_url
    media.fields = 'url' # img_url
  )
) 
request
## Response [https://api.twitter.com/2/tweets?ids=1510533315262042112&tweet.fields=created_at&expansions=attachments.media_keys%2Cauthor_id&media.fields=url]
##   Date: 2022-04-14 07:32
##   Status: 200
##   Content-Type: application/json; charset=utf-8
##   Size: 690 B

So, how do we know how to use the GET() function? Well, I am no expert on APIs but let me try to explain how I came up with the arguments I used here.

Remember those toys you would play with as a toddler where you try to get a square through a square-shaped hole, a triangle through a triangle-shaped hole and so on? You don’t? Well, neither do I. Who remembers that stuff from very early childhood?

But I hear that starting a sentence with “Remember those…” is good for building a rapport with your audience. So, great! Now that we feel all cozy and connected, I can tell you how I managed to get the API request to work.

And the truth is actually not that far from the toddler “intelligence test”. First, I took a look at a help page from Twitter’s developer page. Then, I hammered at the GET() function until its output contained a URL that looks similar to the example I found. Here’s the example code I was aiming at.

curl --request GET 'https://api.twitter.com/2/tweets?ids=1263145271946551300&
expansions=attachments.media_keys&
media.fields=duration_ms,height,media_key,preview_image_url,public_metrics,type,url,width,alt_text' 
--header 'Authorization: Bearer $BEARER_TOKEN'

This is not really R code but it looks like usually you have to feed a GET request with a really long URL. In fact, it looks like the URL needs to contain everything you want to extract from the API. Specifically, the structure of said URL looks like

Therefore, it is only a matter of figuring out how to make the output of GET() deliver this result. Hints on that came from GET() examples in the docs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
GET("http://google.com/", path = "search", query = list(q = "ham"))
## Response [http://www.google.com/search?q=ham]
##   Date: 2022-04-14 07:32
##   Status: 200
##   Content-Type: text/html; charset=ISO-8859-1
##   Size: 98.2 kB
## <!doctype html><html lang="de"><head><charset="UTF-8"><content="/im...
## document.documentElement.addEventListener("submit",function(b){var a;if(a=b.t...
## var a=window.performance;window.start=Date.now();a:{var b=window;if(a){var c=...
## var f=this||self;var g,h=null!=(g=f.mei)?g:1,m,n=null!=(m=f.sdo)?m:!0,p=0,q,r...
## e);var l=a.fileName;l&&(b+="&script="+c(l),e&&l===_____&&(e=do...
## var c=[],e=0;window.ping=function(b){-1==b.indexOf("&zx")&&(b+="&zx="+Date.no...
## var k=this||self,l=function(a){var b=typeof a;return"object"==b&&null!=a||"fu...
## b}).join(" "))};function w(){var a=k.navigator;return a&&(a=a.userAgent)?a:""...
## !1}e||(d=null)}}else"mouseover"==b?d=a.fromElement:"mouseout"==b&&(d=a.toElem...
## window.screen&&window.screen.width<=c&&window.screen.height<=c&&document.getE...
## ...
GET("http://httpbin.org/get", add_headers(a = 1, b = 2))
## Response [http://httpbin.org/get]
##   Date: 2022-04-14 07:32
##   Status: 200
##   Content-Type: application/json
##   Size: 397 B
## {
##   "args": {}, 
##   "headers": {
##     "A": "1", 
##     "Accept": "application/json, text/xml, application/xml, */*", 
##     "Accept-Encoding": "deflate, gzip", 
##     "B": "2", 
##     "Host": "httpbin.org", 
##     "User-Agent": "libcurl/7.64.1 r-curl/4.3.2 httr/1.4.2", 
##     "X-Amzn-Trace-Id": "Root=1-6257ce20-7902b3b55883ec5d3c0a1389"
## ...

So, the first example shows how an argument query can be filled with a list that creates the URL we need. The second examples shows us that there is something called add_headers(). Do I know exactly what that is? I mean, from a technical perspective of what is going on behind the scenes? Definitely not. But Twitter’s example request had something called header. Therefore, add_headers() is probably something that does what the Twitter API expects.

And where do the key-value pairs come from? I found these strolling through Twitter’s data dictionary. Thus, a GET() request was born and I could feel like a true Hackerman.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
auth <- paste("Bearer", bearer_token) # API needs format "Bearer <my_token>"

# Make request to API and parse to list
request <- GET(
  API_url, 
  add_headers(Authorization = auth), 
  query = list(
    ids = tweet_id, 
    tweet.fields = 'created_at', # time stamp
    expansions = 'attachments.media_keys,author_id', 
    # necessary expansion fields for img_url
    media.fields = 'url' # img_url
  )
) 

Alright, we successfully requested data. Now, it becomes time to parse it to something useful. The content() function will to that.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
parsed_request <- request %>% content('parsed')
parsed_request
## $data
## $data[[1]]
## $data[[1]]$author_id
## [1] "1070306701"
## 
## $data[[1]]$created_at
## [1] "2022-04-03T08:23:01.000Z"
## 
## $data[[1]]$text
## [1] "#30DayChartChallenge #Day3 - Topic: historical\n\nBack to the Shakespeare data! Hamlet is is longest play, the comedies tend to be shorter.\n\nTool: #rstats\nData: kaggle users LiamLarsen, aodhan\nColor-Scale: MetBrewer\nFonts: Niconne, Noto Sans (+Mono)\nCode: https://t.co/iXAbniQDCb https://t.co/JCNrYH9uP4"
## 
## $data[[1]]$attachments
## $data[[1]]$attachments$media_keys
## $data[[1]]$attachments$media_keys[[1]]
## [1] "3_1510533145334104067"
## 
## 
## 
## $data[[1]]$id
## [1] "1510533315262042112"
## 
## 
## 
## $includes
## $includes$media
## $includes$media[[1]]
## $includes$media[[1]]$media_key
## [1] "3_1510533145334104067"
## 
## $includes$media[[1]]$type
## [1] "photo"
## 
## $includes$media[[1]]$url
## [1] "https://pbs.twimg.com/media/FPZ95H0XwAMHA8q.jpg"
## 
## 
## 
## $includes$users
## $includes$users[[1]]
## $includes$users[[1]]$id
## [1] "1070306701"
## 
## $includes$users[[1]]$name
## [1] "Christian Gebhard"
## 
## $includes$users[[1]]$username
## [1] "c_gebhard"

Extract tweet data from what the API gives us and download images

We have seen that parsed_request is basically a large list that contains everything we requested from the API. Unfortunately, it is a highly nested list, so we have to do some work to extract the parts we actually want. pluck() from the purrr package is our best friend on this one. Here’s all the information we extract from the parsed_request.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
library(purrr) # for pluck and map functions
# Extract necessary information from list-like structure
tweet_text <- parsed_request %>% 
  pluck("data", 1, 'text') 
tweet_text
## [1] "#30DayChartChallenge #Day3 - Topic: historical\n\nBack to the Shakespeare data! Hamlet is is longest play, the comedies tend to be shorter.\n\nTool: #rstats\nData: kaggle users LiamLarsen, aodhan\nColor-Scale: MetBrewer\nFonts: Niconne, Noto Sans (+Mono)\nCode: https://t.co/iXAbniQDCb https://t.co/JCNrYH9uP4"

tweet_user <-  parsed_request %>% 
  pluck("includes", 'users', 1, 'username')
tweet_user
## [1] "c_gebhard"

# We will use the tweet date and time as part of unique file names
# Replace white spaces and colons (:) for proper file names
tweet_date <- parsed_request %>% 
  pluck("data", 1, 'created_at') %>% 
  lubridate::as_datetime() %>% 
  str_replace(' ', '_') %>% 
  str_replace_all(':', '')
tweet_date
## [1] "2022-04-03_082301"

img_urls <- parsed_request %>% 
  pluck("includes", 'media') %>% 
  bind_rows() %>% # bind_rows for multiple pictures, i.e. multiple URLS
  filter(type == 'photo') %>% 
  pull(url)
img_urls
## [1] "https://pbs.twimg.com/media/FPZ95H0XwAMHA8q.jpg"

Next, download all the images via the img_urls and download.file(). We will use walk2() to download all files (in case there are multiple images/URLs) and save the files into PNGs that are named using the unique tweet_date IDs. Remember to set mode = 'wb' in download.file(). I am not really sure why but without it you will save poor quality images.

1
2
3
# Download image - set mode otherwise download is blurry
img_names <- paste('tweet', tweet_user, tweet_date, seq_along(img_urls), sep = "_")
walk2(img_urls, img_names, ~download.file(.x, paste0(.y, '.png'), mode = 'wb'))

So let’s do a quick recap of what we have done so far. We

Let’s cherish this mile stone with a dedicated function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
request_twitter_data <- function(tweet_url, bearer_token) {
  # Extract tweet id by regex
  tweet_id <- tweet_url %>% str_match("status/([0-9]+)") %>% .[, 2]
  auth <- paste("Bearer", bearer_token) # API needs format "Bearer <my_token>"
  API_url <- 'https://api.twitter.com/2/tweets'
  
  # Make request to API and parse to list
  parsed_request <- GET(
    API_url, 
    add_headers(Authorization = auth), 
    query = list(
      ids = tweet_id, 
      tweet.fields = 'created_at', # time stamp
      expansions='attachments.media_keys,author_id', 
      # necessary expansion fields for img_url
      media.fields = 'url' # img_url
    )
  ) %>% content('parsed')
  
  # Extract necassary information from list-like structure
  tweet_text <- parsed_request %>% 
    pluck("data", 1, 'text') 
  
  tweet_user <-  parsed_request %>% 
    pluck("includes", 'users', 1, 'username')
  
  # Make file name unique through time-date combination
  # Replace white spaces and colons (:) for proper file names
  tweet_date <- parsed_request %>% 
    pluck("data", 1, 'created_at') %>% 
    lubridate::as_datetime() %>% 
    str_replace(' ', '_') %>% 
    str_replace_all(':', '')
  
  img_urls <- parsed_request %>% 
    pluck("includes", 'media') %>% 
    bind_rows() %>% 
    filter(type == 'photo') %>% 
    pull(url)
  
  # Download image - set mode otherwise download is blurry
  img_names <- paste('tweet', tweet_user, tweet_date, seq_along(img_urls), sep = "_")
  walk2(img_urls, img_names, ~download.file(.x, paste0(.y, '.png'), mode = 'wb'))
  
  # Return list with information
  list(
    url = tweet_url,
    text = tweet_text,
    user = tweet_user,
    file_names = paste0(img_names, '.png'),
    file_paths = paste0(getwd(), '/', img_names, '.png')
  )
}

Fill out Markdown template using extracted information and images

We have our images and the original tweet now. Thanks to our previous function, we can save all of the information in a list.

1
request <- request_twitter_data(tweet_url, bearer_token)

So, let’s bring all that information into a Markdown file. Here is the template.md file that I have created for this joyous occasion.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
library(readr) # for reading and writing files from/to disk
cat(read_file('template.md'))
## #dataviz #twitter
## 
## ![[insert_img_name_here]]
## 
## ### Original Tweet
## 
## insert_text_here
## 
## Original: insert_URL_here
## 
## ### Original Mail
## 
## insert_mail_here

As you can see, I started the Markdown template with two tags #dataviz and #twitter. This helps me to search for a specific dataviz faster. Also, I have already written out the Markdown syntax for image imports ![[...]] and added a placeholder insert_img_name_here. This one will be replaced by the file path to the image. Similarly, other placeholders like insert_text_here and insert_mail_here allow me to save the tweet and the mail content into my note taking system too.

To do so, I will need a function that replaces all the placeholders. First, I created a helper function that changes the image import placeholder properly, when there are multiple images.

1
2
3
md_import_strings <- function(file_names) {
  paste0('![[', file_names, ']]', collapse = '\n') 
}

Then, I created a function that takes the request list that we got from calling our own request_twitter_data() function and iteratively uses str_replace_all(). This iteration is done with reduce2() which will replace all placeholders in template.md .

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
library(tibble) # for easier readable tribble creation
# Replace the placeholders in the template
# We change original mail place holder later on
replace_template_placeholder <- function(template_name, request) {
  # Create a dictionary for what to replace in template
  replace_dict <- tribble(
    ~template, ~replacement,
    '\\!\\[\\[insert_img_name_here\\]\\]', md_import_strings(request$file_names),
    'insert_text_here', request$text %>% str_replace_all('#', '(#)'),
    'insert_URL_here', request$url
  )
  
  # Iteratively apply str_replace_all and keep only final result
  reduce2(
    .x = replace_dict$template, 
    .y = replace_dict$replacement,
    .f = str_replace_all,
    .init =  read_lines(template_name) 
  ) %>% 
    paste0(collapse = '\n') # Collaps lines into a single string
}

replace_template_placeholder('template.md', request) %>% cat()
## #dataviz #twitter
## 
## ![[tweet_c_gebhard_2022-04-03_082301_1.png]]
## 
## ### Original Tweet
## 
## (#)30DayChartChallenge (#)Day3 - Topic: historical
## 
## Back to the Shakespeare data! Hamlet is is longest play, the comedies tend to be shorter.
## 
## Tool: (#)rstats
## Data: kaggle users LiamLarsen, aodhan
## Color-Scale: MetBrewer
## Fonts: Niconne, Noto Sans (+Mono)
## Code: https://t.co/iXAbniQDCb https://t.co/JCNrYH9uP4
## 
## Original: https://twitter.com/c_gebhard/status/1510533315262042112
## 
## ### Original Mail
## 
## insert_mail_here

As you can see, my replace_template_placeholder() function also replaces the typical # from Twitter with (#). This is just a precaution to avoid wrong interpretation of these lines as headings in Markdown. Also, the original mail has not been inserted yet because we have no mail yet. But soooon. Finally, we need to write the replaced strings to a file. I got some helpers for that right here.

1
2
3
4
5
6
7
write_replaced_text <- function(replaced_text, request) {
  file_name <- request$file_name[1] %>% str_replace('_1.png', '.md')
  write_lines(replaced_text, file_name)
  paste0(getwd(), '/', file_name) 
}
replaced_template <- replace_template_placeholder('template.md', request) %>%
  write_replaced_text(request)

Shuffle files around on your file system

Awesome! We created new image files and a new Markdown note in our working directory. Now, we have to move them to our Obsidian vault. This is the place where I collect all my Markdown notes for use in Obsidian. In my case, I will need to move the Markdown note to the vault directory and the images to a subdirectory within this vault. This is because I changed settings in Obsidian that makes sure that all attachments, e.g. images, are saved in a separate subdirectory.

Here’s the function I created to get that job done. The function uses the request list again because it contains the file paths of the images. Here, vault_location and attachments_dir are the file paths to my Obsidian vault.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
library(tidyr) # for unnesting
move_files <- function(request, replaced_template, vault_location, attachments_dir) {
  # Create from-to dictionary with file paths in each column
  move_dict <- tribble(
    ~from, ~to,
    request$file_path, paste0(vault_location, '/', attachments_dir),
    replaced_template, vault_location
  ) %>% 
    unnest(cols = 'from')
  # Copy files from current working directory to destination
  move_dict %>% pwalk(file.copy, overwrite = T)
  # Delete files in current working directory
  walk(move_dict$from, file.remove)
}

How to extract URL and other stuff from mail

Let’s take a quick breather and recap. We have written functions that

Not to brag but that is kind of cool. But let’s not rest here. We still have to get some work done. After all, we want our workflow to be email-based. So, let’s access our mails using R. Then, we can extract a Twitter URL and apply our previous functions. Also, this lets us finally replace the insert_mail_here placeholder in our Markdown note.

Postman gives you access

I have created a dummy mail account at gmail. Using the mRpostman package, we can establish a connection to our mail inbox. After the connection is established, we can filter for all new emails that are sent from our list of allowed_senders.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
library(mRpostman) # for email communication
imap_mail <- 'imaps://imap.gmail.com' # mail client
# Establish connection to imap server
con <- configure_imap(
  url = imap_mail,
  user = user_mail,
  password = password_mail
)

# Switch to Inbox
con$select_folder('Inbox') 
## 
## ::mRpostman: "Inbox" selected.

# Extract mails that are from the list of allowed senders
mails <- allowed_senders %>% 
  map(~con$search_string(expr = ., where = 'FROM')) %>% 
  unlist() %>% 
  na.omit() %>% # Remove NAs if no mail from a sender
  as.numeric() # avoids attributes

Grab URLs from mail

If mails is not empty, i.e. if there are new mails, then we need to extract the tweet URLs from them. Unfortunately, depending on where you sent your email from, the mail text can be encoded.

For example, I send most of the tweets via the share button on Twitter using my Android smartphone. And for some reason, my Android mail client encodes the mails in something called base64. But sending a tweet URL from Thunderbird on my computer works without any encoding. Here are two example mails I have sent to my dummy mail account.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
if (!is_empty(mails)) mail_bodys <- mails %>% con$fetch_text()
cat(mail_bodys[[1]])
## ----_com.samsung.android.email_9939060766922910
## Content-Type: text/plain; charset=utf-8
## Content-Transfer-Encoding: base64
## 
## aHR0cHM6Ly90d2l0dGVyLmNvbS9tYXJpbmExMDE3MjAxMy9zdGF0dXMvMTUxNDI4MDkzNjA5NTI0
## MDE5ND9zPTIwJnQ9TFp5TzI3aWhjZk5kbUFPTHlxbTByUUhlcmUgaXMgdGhlIHNhbWUgdHdlZXQg
## YnV0IHNlbnQgZnJvbSBteSBBbmRyb2lkIHBob25lLg==
## 
## ----_com.samsung.android.email_9939060766922910
## Content-Type: text/html; charset=utf-8
## Content-Transfer-Encoding: base64
## 
## PGh0bWw+PGhlYWQ+PG1ldGEgaHR0cC1lcXVpdj0iQ29udGVudC1UeXBlIiBjb250ZW50PSJ0ZXh0
## L2h0bWw7IGNoYXJzZXQ9VVRGLTgiPjwvaGVhZD48Ym9keSBkaXI9ImF1dG8iPmh0dHBzOi8vdHdp
## dHRlci5jb20vbWFyaW5hMTAxNzIwMTMvc3RhdHVzLzE1MTQyODA5MzYwOTUyNDAxOTQ/cz0yMCZh
## bXA7dD1MWnlPMjdpaGNmTmRtQU9MeXFtMHJRPGRpdiBkaXI9ImF1dG8iPjxicj48L2Rpdj48ZGl2
## IGRpcj0iYXV0byI+SGVyZSBpcyB0aGUgc2FtZSB0d2VldCBidXQgc2VudCBmcm9tIG15IEFuZHJv
## aWQgcGhvbmUuPC9kaXY+PC9ib2R5PjwvaHRtbD4=
## 
## ----_com.samsung.android.email_9939060766922910--
## 
cat(mail_bodys[[2]])
## Here is the first data viz that popped up on my Twitter feed.
## 
## https://twitter.com/marina10172013/status/1514280936095240194?s=20&t=1_WWgRGzcDEVmOpRMK6E8w
## 
## This mail was sent from my computer using Thunderbird.
## 
## Ahoy! (Just a random fun thing I wanted to write)
## 

As you can see, the mail sent from my computer is legible but the other one is gibberish. Thankfully, Allan Cameron helped me out on Stackoverflow to decode the mail. To decode the mail, the trick was to extract the parts between base64 and ----.

There are two such texts in the encoded mail. Surprisingly, the first one decoded to a text without line breaks. This is why we take the second encoded part and decode it. However, this will give us an HTML text with all kinds of tags like <div> and what not. Therefore, we use html_read() and html_text2() from the rvest package to handle that. All of this is summarized in this helper function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
decode_encoded_mails <- function(encoded_mails) {
  # Ressource: https://stackoverflow.com/questions/71772972/translate-encoding-of-android-mail-in-r
  # Find location in each encoded string where actual text starts
  start_encoded <- encoded_mails %>% 
    str_locate_all('base64\r\n\r\n') %>% 
    map(~pluck(., 4) + 1) %>% 
    unlist()
  
  # Find location in each encoded string where actual text starts
  end_encoded <- encoded_mails %>% 
    str_locate_all('----') %>% 
    map(~pluck(., 3) - 1)%>% 
    unlist()
  
  # Use str_sub() to extract encoded text
  encoded_text <- tibble(
    string = unlist(encoded_mails), 
    start = start_encoded, 
    end = end_encoded
  ) %>% 
    pmap(str_sub) 
  
  # Decode: base64 -> raw -> char -> html -> text
  encoded_text %>% 
    map(base64enc::base64decode) %>% 
    map(rawToChar) %>% 
    map(rvest::read_html) %>% 
    map(rvest::html_text2)
}

I feel like this is the most hacky part of this blog post. Unfortunately, your milage may vary here. If your phone or whatever you use encodes the mails differently, then you may have to adjust the function. But I hope that I have explained enough details and concepts for you to manage that if it comes to this.

Recall that I send both plain mails from Thunderbird and encoded mails from Android. Therefore, here is another helper that decoded mails if neccessary from both types in one swoop.

1
2
3
4
5
6
7
8
decode_all_mails <- function(mail_bodys) {
  # Decode in case mail is base64 decoded
  is_encoded <- str_detect(mail_bodys, 'Content-Transfer-Encoding')
  encoded_mails <- mail_bodys[is_encoded]
  plain_mails <- mail_bodys[!is_encoded]
  decoded_mails <- encoded_mails %>% decode_encoded_mails()
  c(decoded_mails, plain_mails)
}

The remaining part of the code should be familiar:

The only new thing is that we use our postman connection to move the processed mails into a new directory (which I called “Processed”) on the email server. This way, the inbox is empty again or filled only with mails from unauthorized senders.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
if (!is_empty(mails)) {
  # Grab mail texts and URLs
  mail_bodys <- mails %>% con$fetch_text() %>% decode_all_mails
  urls <- mail_bodys %>% str_extract('https.*')
  
  # Remove mails from vector in case s.th. goes wrong 
  # and urls cannot be detected
  mail_bodys <- mail_bodys[!is.na(urls)]
  mails <- mails[!is.na(urls)]
  urls <- urls[!is.na(urls)]
  
  # For each url request twitter data
  requests <- map(urls, request_twitter_data, bearer_token = bearer_token)
  
  # Use requested twitter data to insert texts into Markdown template 
  # and write to current working directory
  replaced_templates_wo_mails <- 
    map(requests, replace_template_placeholder, template = 'template.md') 
  
  # Now that we have mails, replace that placeholder too
  replaced_templates <- replaced_templates_wo_mails %>% 
    map2(mail_bodys, ~str_replace(.x, 'insert_mail_here' ,.y)) %>% 
    map2(requests, ~write_replaced_text(.x, .y))
  
  # Move markdown files and extracted pngs to correct place on HDD
  walk2(
    requests, 
    replaced_templates, 
    move_files, 
    vault_location = vault_location, 
    attachments_dir = attachments_dir
  )
  
  # Move emails on imap server to Processed directory
  con$move_msg(mails, to_folder = 'Processed')
}

Last Step: Execute R script automatically

Alright, alright, alright. We made it. We have successfully

The only thing that is left to do is execute this script automatically. Again, if you don’t want to assemble the R script yourself using the code chunks in this blog post, check out this GitHub gist.

On Windows, you can write a VBS script that will execute the R script. Window’s task scheduler is easily set up to run that VBS script regularly, say every hour. For completeness’ sake let me give you an example VBS script. But beware that I have no frikkin clue how VBS scripts work beyond this simple call.

Set wshshell = WScript.CreateObject ("wscript.shell")
wshshell.run """C:\Program Files\R\R-4.0.5\bin\Rscript.exe"" ""D:\Local R Projects\Playground\TwitterTracking\my_twitter_script.R""", 6, True
set wshshell = nothing

The idea of this script is to call Rscript.exe and give it the location of the R script that we want to execute. Of course, you will need to adjust the paths to your file system. Notice that there are super many double quotes in this script. This is somewhat dumb but it’s the only way I could find to make file paths with white spaces work (see StackOverflow).

On Ubuntu (and probably other Unix-based systems), I am sure that every Unix user knows that there is CronTab to schedule regular tasks. On Mac, I am sure there is something. But instead of wandering even further from my expertise, I will refer to your internet search skills.

Mind the possibilities

We made it! We connected to Twitter’s API and our dummy email to get data viz (what’s the plural here? viz, vizz, vizzes, vizzeses?) into our note-taking system. Honestly, I think that was quite an endeavor. But now we can use the same ideas for all kind of other applications! From the top of my head I can think of more scenarios where similar solutions should be manageable. Here are two ideas.

So, enjoy the possibilities! If you liked this blog post, then consider following me on Twitter and/or subscribing to my RSS feed. Until next time!

To leave a comment for the author, please follow the link and comment on their blog: Albert Rapp.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.