Site icon R-bloggers

Movies and Series subtitles in R with subtools

[This article was first published on R_EN – Piece of K, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Every time I download subtitles for a movie or a series episode, I cannot help thinking about all this text that could be analyzed with text mining methods. As my PhD came to its end in April and I started a postdoc in September, I could use some time of my looong summer break to work on a new R package, subtools which aims to provide a toolbox to read, manipulate and write subtitles files in R.

In this post I will present briefly the functions of subtools. For more details you can check the documentation.

Installation

The package is available on GitHub.
You can install it using the devtools package:

devtools::install_github("fkeck/subtools")

Read subtitles

A subtitles file can be read directly from R with the function read.subtitles. Currently, four formats of subtitles are supported: SubRip, (Advanced) Substation Alpha, MicroDVD and SubViewer. The parsers are probably not optimal but they seem to do the job as expected, at least with valid files. The package also provides some wrappers to import whole directories of series subtitles (see read.subtitles.season, read.subtitles.serie and read.subtitles.multiseries)

Manipulate subtitles

The subtools package stores imported subtitles as simple S3 objects of class Subtitles. They are lists with two main components : subtitles (IDs, timecodes and texts) and optional meta-data. Multiple Subtitles objects can be stored as a list of class MultiSubtitles. The subtools package provides functions to easily manipulate Subtitles and MultiSubtitles objects.

Basically, you can :

Extract/convert/write subtitles

Although you can conduct statistical analyses on subtitles objects, the package subtools is not designed for text mining. The following functions allow you to convert subtitles to other classes/format to analyze them. You can:

Application: Game of Thrones subtitles wordcloud

To illustrate how subtools can be used to get started with a subtitles analysis project, I propose to create a wordcloud showing the most frequent words in the popular TV series Game of Thrones. We will use subtools to import the subtitles, the tm and SnowballC packages to pre-process the data and finally wordcloud to generate the cloud. I will not provide the data I use here, because subtitles files are in a grey zone concerning licensing. But no worries, it’s pretty easy to find subtitles on Internet.

library(subtools)
library(tm)
library(SnowballC)
library(wordcloud)

Because the subtitles are correctly organized in directories, we import them in one command line using the function read.subtitles.serie. The nice thing is that this function will try to automatically extract basic meta-data (like series title, season number and episode number) from directories/files name.

a <- read.subtitles.serie(dir = "/subs/Game of Thrones/")

a is a MultiSubtitles object with 60 Subtitles elements (episodes). We can convert it directly to a tm corpus using tmCorpus. Note that meta-data are preserved.

c <- tmCorpus(a)

And then, we can prepare the data:

c <- tm_map(c, content_transformer(tolower))
c <- tm_map(c, removePunctuation)
c <- tm_map(c, removeNumbers)
c <- tm_map(c, removeWords, stopwords("english"))
c <- tm_map(c, stripWhitespace)

Compute a term-document matrix and aggregate counts by season:

TDM <- TermDocumentMatrix(c)
TDM <- as.matrix(TDM)
vec.season <- rep(1:6, each = 10)
TDM.season <- t(apply(TDM, 1, function(x) tapply(x, vec.season, sum)))
colnames(TDM.season) <- paste("S", 1:6)

And finally plot the cloud!

set.seed(100)
comparison.cloud(TDM.season, title.size = 1, max.words = 100, random.order = T)
Subtitles wordcloud of the six seasons of Game of Thrones.

Few words about this plot. Like every wordcloud, I think it’s a very simple and limited descriptive way to represent the information. However, I like it. The people who have watched the TV show will look at it and say « Oh of course! ». In one hundred word, the cloud is not revealing the scenario of GoT, but for each season I can see one or two critical events popping out (wedding, kill/joffrey, queen/shame, hodor/hold/door).
What I find funny here (perhaps interesting?), is that these very important and emotional moments are supported in the dialogs by the repetition of one or two keywords. I don’t know if this is exclusive to GoT, or if it’s a trick of my mind, or something else. And I will not make any hypothesis, I’m not a linguist. But this is the first idea which came to my mind and I wanted to write it down.

Now, there are plenty of text-mining ideas and hypotheses and methods that can be tested with movies and series subtitles. So have fun

To leave a comment for the author, please follow the link and comment on their blog: R_EN – Piece of K.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.