[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
After you click “Create File,” it will take a while to compile – you’ll get an email when it’s ready. You’ll need to reenter your password when you go to download the file.
The result is a Zip file, which contains folders for Posts, Photos, and Videos. Posts includes your own posts (on your and others’ timelines) as well as posts from others on your timeline. And, of course, the file needed a bit of cleaning. Here’s what I did.
Since the post data is a JSON file, I need the jsonlite package to read it.
setwd("C:/Users/slocatelli/Downloads/facebook-saralocatelli35/posts") library(jsonlite) FBposts <- fromJSON("your_posts.json")
This creates a large list object, with my data in a data frame. So as I did with the Taylor Swift albums, I can pull out that data frame.
myposts <- FBposts$status_updates
The resulting data frame has 5 columns: timestamp, which is in UNIX format; attachments, any photos, videos, URLs, or Facebook events attached to the post; title, which always starts with the author of the post (you or your friend who posted on your timeline) followed by the type of post; data, the text of the post; and tags, the people you tagged in the post.
First, I converted the timestamp to datetime, using the anytime package.
library(anytime) myposts$timestamp <- anytime(myposts$timestamp)
Next, I wanted to pull out post author, so that I could easily filter the data frame to only use my own posts.
library(tidyverse) myposts$author <- word(string = myposts$title, start = 1, end = 2, sep = fixed(" "))
Finally, I was interested in extracting URLs I shared (mostly from YouTube or my own blog) and the text of my posts, which I did with some regular expression functions and some help from Stack Overflow (here and here).
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" myposts$links <- str_extract(myposts$attachments, url_pattern) library(qdapRegex) myposts$posttext <- myposts$data %>% rm_between('"','"',extract = TRUE)
There’s more cleaning I could do, but this gets me a data frame I could use for some text analysis. Let’s look at my most frequent words.
myposts$posttext <- as.character(myposts$posttext) library(tidytext) mypost_text <- myposts %>% unnest_tokens(word, posttext) %>% anti_join(stop_words) ## Joining, by = "word" counts <- mypost_text %>% filter(author == "Sara Locatelli") %>% drop_na(word) %>% count(word, sort = TRUE) counts ## # A tibble: 9,753 x 2 ## word n ## <chr> <int> ## 1 happy 4702 ## 2 birthday 4643 ## 3 today's 666 ## 4 song 648 ## 5 head 636 ## 6 day 337 ## 7 post 321 ## 8 009f 287 ## 9 ð 287 ## 10 008e 266 ## # ... with 9,743 more rows
These data include all my posts, including writing “Happy birthday” on other’s timelines. I also frequently post the song in my head when I wake up in the morning (over 600 times, it seems). If I wanted to remove those, and only include times I said happy or song outside of those posts, I’d need to apply the filter in a previous step. There are also some strange characters that I want to clean from the data before I do anything else with them. I can easily remove these characters and numbers with string detect, but cells that contain numbers and letters, such as “008e” won’t be cut out with that function. So I’ll just filter them out separately.
drop_nums <- c("008a","008e","009a","009c","009f") counts <- counts %>% filter(str_detect(word, "[a-z]+"), !word %in% str_detect(word, "[0-9]"), !word %in% drop_nums)
Now I could, for instance, create a word cloud.
library(wordcloud) counts %>% with(wordcloud(word, n, max.words = 50))