Retrieve & process TV News chyrons with newsflash
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Internet Archive recently announced a new service they’ve dubbed ‘Third Eye’. This service scrapes the chyrons that annoyingly scroll across the bottom-third of TV news broadcasts. IA has a vast historical archive of TV news that they’ll eventually process, but — for now — the more recent broadcasts from four channels are readily available. There’s tons of information about the project on its main page where you can interactively work with the API if that’s how you roll.
Since my newsflash
? package already had a “news” theme and worked with the joint IA-GDELT project TV data, it seemed to be a good home for a Third Eye interface to live.
Basic usage
You can read long-form details of the Third Eye service on their site. The TLDR is that they provide two feeds:
- a “raw” one which has massive duplicates and tons of errors
- a “clean” one that filters out duplicates, cleans up the text and is much better to work with
You can retrieve either with newsflash::read_chyrons()
but the default is to use the clean feed. If you are studying text processing and or NLP/text-cleanup via machine learning, then the raw feed may be very interesting for you. I suspect most data journalists will want to use the clean feed that also powers the IA chyron twitter bots.
Since it’s the Internet Archive, they’re awesome at providing metadata about their data. Heck, even their metadata has metadata about metadata. We can use the fact that they provide a metadata feed to enable listing available chyron archive dates:
library(newsflash) # devtools::install_github("hrbrmstr/newsflash") library(hrbrthemes) library(tidyverse) list_chyrons() ## # A tibble: 61 x 3 ## ts type size ## <date> <chr> <dbl> ## 1 2017-09-30 cleaned 539061 ## 2 2017-09-30 raw 17927121 ## 3 2017-09-29 cleaned 635812 ## 4 2017-09-29 raw 19234407 ## 5 2017-09-28 cleaned 414067 ## 6 2017-09-28 raw 12663606 ## 7 2017-09-27 cleaned 613474 ## 8 2017-09-27 raw 20442644 ## 9 2017-09-26 cleaned 659930 ## 10 2017-09-26 raw 19942951 ## # ... with 51 more rows
Reading the chyrons in only requires passing in a Date
object or a YYYY-mm-dd
format date string:
chyrons <- read_chyrons("2017-09-30") glimpse(chyrons) ## Observations: 2,729 ## Variables: 5 ## $ ts <dttm> 2017-09-30 00:00:00, 2017-09-30 00:00:00, 2017-09-30 00:00:00, 2017-09-30... ## $ channel <chr> "BBCNEWS", "CNNW", "FOXNEWSW", "BBCNEWS", "CNNW", "MSNBCW", "BBCNEWS", "CN... ## $ duration <int> 18, 42, 26, 10, 47, 19, 14, 62, 26, 11, 45, 17, 35, 11, 62, 32, 35, 35, 15... ## $ details <chr> "BBCNEWS_20170929_233000_Race_and_Pace/start/1800", "CNNW_20170929_230000_... ## $ text <chr> "TRUMP CABINET SECRETARY QUITS\\n'MIRACLE NEEDED' ON BREXIT", "TRUMP BRAGS...
You get five columns in a data frame on a successful retrieval:
ts
(POSIXct
) chyron timestampchannel
(character
) news channel the chyron appeared onduration
(integer
) see Descriptiondetails
(character
) Internet Archive details pathtext
(character
) the chyron text
We’ll talk about the details
path in a bit. The text
is likely what you want, so here’s a sample:
head(chyrons$text, 30) ## [1] "TRUMP CABINET SECRETARY QUITS\\n'MIRACLE NEEDED' ON BREXIT" ## [2] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL\\nAnderson Cooper" ## [3] "ALIFORNIA STUDENT SWIPES 'MAGA' HAT" ## [4] "US HEALTH SECRETARY QUITS. Mr Price apologised for use O126 private \\ufb02ights since May\\nUS HEALTH SECRETARY QUITS. Private flights cost taxpayers 4OO,OOO dollars\\nLAURA BICKER. Washington" ## [5] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nTRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL" ## [6] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL" ## [7] "US HEALTH SECRETARY QUITS. Private flights cost taxpayers 4OO,OOO dollars\\nUS HEALTH SECRETARY QUITS. Government otficials required to take commercial \\ufb02ights\\nUS HEALTH SECRETARY QUITS. Scandal emerged after..." ## [8] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL" ## [9] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL\\nTRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL" ## [10] "US HEALTH SECRETARY QUITS. Scandal emerged after investigation by Politico magazine\\nUS HEALTH SECRETARY QUITS. Tom Price resigned over use of private planes\\nUS HEALTH SECRETARY QUITS. Mr Price apologised for use O126..." ## [11] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nHHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL. . Ryan Nobles (J\\\\N Washington Correspondent" ## [12] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I" ## [13] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL\\nREPORTER WHO BROKE PRICE SCANDAL SPEAKS OUT" ## [14] "US HEALTH SECRETARY QUITS. Tom Price resigned over use of private planes\\nUS HEALTH SECRETARY QUITS. Scandal emerged after investigation by Politico magazine\\nUS HEALTH SECRETARY QUITS. Mr Price apologised for..." ## [15] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL" ## [16] "BIZARRE LIBERAL MELTDOWNS I\\nTUCKER & THE CAT IN THE HAT I. . _ < 'rnnwnn FAD! cnm tint-\\ufb01nk" ## [17] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL\\nTOM PRICE RESIGNS AMID PRIVATE JET SCANDAL" ## [18] "HHS SECY. PRICE OUT AFI'ER PRIVATE JET SCANDAL\\nTRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL" ## [19] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I\\nBIZARRE LIBERAL MELTDOWN I" ## [20] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL" ## [21] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL" ## [22] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS I\\nSCHOOL LIBRARIAN REJECTS DR. SEUSS. BOOKS GIFTED BY MELANIA TRUMP. . _' tnnx'nkr" ## [23] "TRUMP: \\\"I CERTAINLY DON'T LIKE THE OPTICS\\\" OF PRICE SCANDAL\\nTOM PRICE RESIGNS AMID PRIVATE JET SCANDAL" ## [24] "YEMEN WAR CRIMES. UN Human Rights Council agrees on investigation\\nINIGO MENDEZ DE VIGO. Spanish Education Minister" ## [25] "TRUMP BRAGS ABOUT PUERTO RICO RESPONSE AS FED-UP. SURVIVORS PLEAD FOR ELECTRICITY, WATER, FUEL\\nSAN JUAN MAYOR: \\\"THIS IS NOT A GOOD NEWS STORY\\\"\\nSAN JUAN MAYOR: \\\"THIS IS NOT A GOOD NEWS STORY\\\". . Mavor Carmen Yulin Cruz San Juan,..." ## [26] "BRARIAN REJECTS \\\"RACIST\\\" DR. SEUSS BOOKS" ## [27] "TOM PRICE RESIGNS AMID PRIVATE JET SCANDAL" ## [28] "TRUMP ASIA TOUR. US President to visit Japan, South Korea and China\\nYEMEN WAR CRIMES. UN Human Rights Council agrees on investigation" ## [29] "SAN JUAN MAYOR: \\\"MAD AS HELL\\\" OVER HURRICANE RESPONSE\\nSAN JUAN MAYOR: \\\"MAD AS HELLII OVER HURRICANE RESPONSE. . Dr. Saniav Gupta (J\\\\N Chief Medical Correspondent" ## [30] "SAN JUAN MAYOR: \\\"MAD AS HELL\\\" OVER HURRICANE RESPONSE"
Be warned: even the “clean” text is often kinda messy.
For now, there are only four channels, so it’s easy to show a quick example. Since chyrons are supposed to be super-important things you need to know NOW, let’s see how many times Puerto Rico was mentioned on them in the above archive. NOTE: This is a quick example, not a thorough one. I’m searching for some key letter combinations to see just mentions of something looking like “Puerto Rico”. “San Juan” and other text that might be associated with the topic aren’t being considered for this toy example.
mutate( chyrons, hour = lubridate::hour(ts), text = tolower(text), mention = grepl("erto ri", text) ) %>% filter(mention) %>% count(hour, channel) %>% ggplot(aes(hour, n)) + geom_segment(aes(xend=hour, yend=0)) + scale_x_continuous(name="Hour (GMT)", breaks=seq(0, 23, 6), labels=sprintf("%02d:00", seq(0, 23, 6))) + scale_y_continuous(name="# Chyrons", limits=c(0,30)) + facet_wrap(~channel, scales="free") + labs(title="Chyrons mentioning 'Puerto Rico' per hour per channel", subtitle="Chyron date: 2017-09-30", caption="Source: Internet Archive Third Eye project & <github.com/hrbrmstr/newsflash>") + theme_ipsum_rc(grid="Y")
Details, details, details
Entries in details
column look like this:
head(chyrons$details) ## [1] "BBCNEWS_20170929_233000_Race_and_Pace/start/1800" ## [2] "CNNW_20170929_230000_Erin_Burnett_OutFront/start/3600" ## [3] "FOXNEWSW_20170929_230000_The_Story_With_Martha_MacCallum/start/3600" ## [4] "BBCNEWS_20170930_000000_BBC_News/start/60" ## [5] "CNNW_20170930_000000_Anderson_Cooper_360/start/60" ## [6] "MSNBCW_20170930_000000_All_In_With_Chris_Hayes/start/60"
They are path fragments that can be attached to a URL prefix to see the news clip from that station on that day/time. newsflash::view_clip()
does that work for you:
view_clip(chyrons$details[2])
The URL for that is https://archive.org/details/CNNW_20170929_230000_Erin_Burnett_OutFront/start/3600/end/3660
in the event the iframe load failed or you really like being annoyed with cable news shows.
FIN
Grab the package on GitHub, kick the tyres and don’t hesitate to file issues, questions or jump on board with package development. There’s plenty of room for improvement before it hits CRAN and your ideas are most welcome.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.