R⁶ — Using pandoc from R + A Neat Package For Reading Subtitles
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Once I realized that my planned, larger post would not come to fruition today I took the R⁶ post (i.e. “minimal expository, keen focus) route, prompted by a Twitter discussion with some R mates who needed to convert “lightly formatted” Microsoft Word (docx
) documents to markdown. Something like this:
to:
Does pandoc work? ================= Simple document with **bold** and *italics*.
This is definitely a job that pandoc
can handle.
pandoc
is a Haskell (yes, Haskell) program created by John MacFarlane and is an amazing tool for transcoding documents. And, if you’re a “modern” R/RStudio user, you likely use it every day because it’s ultimately what powers rmarkdown
/ knitr
.
Yes, you read that correctly. You’re beautiful PDF, Word and HTML R reports are powered by — and, would not be possible without — Haskell.
Doing the aforementioned conversion from docx
to markdown is super-simple from R:
rmarkdown::pandoc_convert("simple.docx", "markdown", output="simple.md")
Give the help on rmarkdown::pandoc_convert()
a read as well as the very thorough and helpful documentation over at [pandoc.org
])(http://pandoc.org) to see the power available at your command.
Just One More Thing
This section — technically — violates the R⁶ principle so you can stop reading if you’re a purist 🙂
There’s a neat, non-on-CRAN package by François Keck called subtools
— https://github.com/fkeck/subtools which can slice, dice and reformat digital content subtitles. There are multiple formats for these subtitle files and it seems to be able to handle them all.
There was a post (earlier in April) about Ranking the Negativity of Black Mirror Episodes. That post is python and I’ve never had time to fully replicate it in R.
Here’s a snippet (sans expository) that can get you started pulling in subtitles into R and tidytext
. I would have written scraper code but the various subtitle aggregation sites make that a task suited for something like my splashr
package and I just had no cycles to write the code. So, I grabbed the first season of “The Flash” and use the Bing sentiment lexicon from tidytext
to see how the season looked.
The overall scoring for a given episode is naive and can definitely be improved upon.
Definitely drop a link to anything you create in the comments!
# devtools::install_github("fkeck/subtools") library(subtools) library(tidytext) library(hrbrthemes) library(tidyverse) data(stop_words) bing <- get_sentiments("bing") afinn <- get_sentiments("afinn") fils <- list.files("flash/01", pattern = "srt$", full.names = TRUE) pb <- progress_estimated(length(fils)) map_df(1:length(fils), ~{ pb$tick()$print() read.subtitles(fils[.x]) %>% sentencify() %>% .$subtitles %>% unnest_tokens(word, Text) %>% anti_join(stop_words, by="word") %>% inner_join(bing, by="word") %>% inner_join(afinn, by="word") %>% mutate(season = 1, ep = .x) }) %>% as_tibble() -> season_sentiments count(season_sentiments, ep, sentiment) %>% mutate(pct = n/sum(n), pct = ifelse(sentiment == "negative", -pct, pct)) -> bing_sent ggplot() + geom_ribbon(data = filter(bing_sent, sentiment=="positive"), aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) + geom_ribbon(data = filter(bing_sent, sentiment=="negative"), aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) + scale_x_continuous(expand=c(0,0.5), breaks=seq(1, 23, 2)) + scale_y_continuous(expand=c(0,0), limits=c(-1,1), labels=c("100%\nnegative", "50%", "0", "50%", "positive\n100%")) + labs(x="Season 1 Episode", y=NULL, title="The Flash — Season 1", subtitle="Sentiment balance per episode") + scale_fill_ipsum(name="Sentiment") + guides(fill = guide_legend(reverse=TRUE)) + theme_ipsum_rc(grid="Y") + theme(axis.text.y=element_text(vjust=c(0, 0.5, 0.5, 0.5, 1)))
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.