Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Every time I download subtitles for a movie or a series episode, I cannot help thinking about all this text that could be analyzed with text mining methods. As my PhD came to its end in April and I started a postdoc in September, I could use some time of my looong summer break to work on a new R package, subtools
which aims to provide a toolbox to read, manipulate and write subtitles files in R.
In this post I will present briefly the functions of subtools
. For more details you can check the documentation.
Installation
The package is available on GitHub.
You can install it using the devtools package:
devtools::install_github("fkeck/subtools")
Read subtitles
A subtitles file can be read directly from R with the function read.subtitles
. Currently, four formats of subtitles are supported: SubRip, (Advanced) Substation Alpha, MicroDVD and SubViewer. The parsers are probably not optimal but they seem to do the job as expected, at least with valid files. The package also provides some wrappers to import whole directories of series subtitles (see read.subtitles.season
, read.subtitles.serie
and read.subtitles.multiseries
)
Manipulate subtitles
The subtools
package stores imported subtitles as simple S3 objects of class Subtitles
. They are lists with two main components : subtitles (IDs, timecodes and texts) and optional meta-data. Multiple Subtitles objects can be stored as a list of class MultiSubtitles
. The subtools package provides functions to easily manipulate Subtitles
and MultiSubtitles
objects.
Basically, you can :
- combine subtitles objects with
combineSubs
- extract parts of subtitles with
[.Subtitles
- clean subtitles content with
cleanTags
,cleanCaptions
andcleanPatterns
- reorganize subtitles as sentences with
sentencify
Extract/convert/write subtitles
Although you can conduct statistical analyses on subtitles objects, the package subtools is not designed for text mining. The following functions allow you to convert subtitles to other classes/format to analyze them. You can:
- extract text content as a simple character string with
rawText
- convert subtitles and meta-data to a virtual corpus with
tmCorpus
if you want to work with the standard text mining framework tm. - convert subtitles and meta-data to a data.frame with
subDataFrame
. If you want to use tidy data principles with tidytext and dplyr, you should probably start here. - finally, it’s also possible to write subtitles objects to a file with
write.subtitles
. Though, it is unclear to me if there is any sense in doing that
Application: Game of Thrones subtitles wordcloud
To illustrate how subtools can be used to get started with a subtitles analysis project, I propose to create a wordcloud showing the most frequent words in the popular TV series Game of Thrones. We will use subtools to import the subtitles, the tm
and SnowballC
packages to pre-process the data and finally wordcloud
to generate the cloud. I will not provide the data I use here, because subtitles files are in a grey zone concerning licensing. But no worries, it’s pretty easy to find subtitles on Internet.
library(subtools) library(tm) library(SnowballC) library(wordcloud)
Because the subtitles are correctly organized in directories, we import them in one command line using the function read.subtitles.serie
. The nice thing is that this function will try to automatically extract basic meta-data (like series title, season number and episode number) from directories/files name.
a <- read.subtitles.serie(dir = "/subs/Game of Thrones/")
a
is a MultiSubtitles
object with 60 Subtitles elements (episodes). We can convert it directly to a tm corpus using tmCorpus. Note that meta-data are preserved.
c <- tmCorpus(a)
And then, we can prepare the data:
c <- tm_map(c, content_transformer(tolower)) c <- tm_map(c, removePunctuation) c <- tm_map(c, removeNumbers) c <- tm_map(c, removeWords, stopwords("english")) c <- tm_map(c, stripWhitespace)
Compute a term-document matrix and aggregate counts by season:
TDM <- TermDocumentMatrix(c) TDM <- as.matrix(TDM) vec.season <- rep(1:6, each = 10) TDM.season <- t(apply(TDM, 1, function(x) tapply(x, vec.season, sum))) colnames(TDM.season) <- paste("S", 1:6)
And finally plot the cloud!
set.seed(100) comparison.cloud(TDM.season, title.size = 1, max.words = 100, random.order = T)
Few words about this plot. Like every wordcloud, I think it’s a very simple and limited descriptive way to represent the information. However, I like it. The people who have watched the TV show will look at it and say « Oh of course! ». In one hundred word, the cloud is not revealing the scenario of GoT, but for each season I can see one or two critical events popping out (wedding, kill/joffrey, queen/shame, hodor/hold/door).
What I find funny here (perhaps interesting?), is that these very important and emotional moments are supported in the dialogs by the repetition of one or two keywords. I don’t know if this is exclusive to GoT, or if it’s a trick of my mind, or something else. And I will not make any hypothesis, I’m not a linguist. But this is the first idea which came to my mind and I wanted to write it down.
Now, there are plenty of text-mining ideas and hypotheses and methods that can be tested with movies and series subtitles. So have fun
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.