Site icon R-bloggers

Seinfeld Characters – A Post About Nothing

[This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is dedicated to my mother – Seinfeld’s greatest fan.

Seinfeld is a classic TV sitcom. It featured four main characters surrounded by relatively normal, everyday, run of the mill scenarios. In the spirit of Seinfeld, this post will also “be about nothing.”

Load Required Libraries

library(scales)  
library(RMySQL)  
library(stringr)  
library(tidyr)  
library(dplyr)  
library(igraph)  
library(ggplot2)  

I used python to create a web scraper which gathered scripts from various sites on the internet and input them into a local MySQL database.

Data From Local MySQL
db = dbConnect(MySQL(),  
                 user='root', 
                 password='root', 
                 dbname='tvscripts', 
                 host='localhost')
rs = dbSendQuery(db, "select episodeTitle,rawLine from seinfeld_raw")  
data = fetch(rs, n=-1)

print(data$rawLine[600:616])  

Scraping the raw lines without parsing looked simple enough. Parsing before inserting into MySQL created some difficulties so it made sense to parse once loaded into R.
A sample of the raw data:

 [1] "Jerry: I don't know. He's an importer."                            
 [2] "Vanessa: Importer?"                                                
 [3] "George: ...And exporter."                                          
 [4] "Jerry: He's an importer/exporter."                                 
 [5] "George: I'm, uh, I'm an architect."                                
 [6] "Vanessa: Really. What do you design?"                              
 [7] "George: Uh, railroads, uh..."                                      
 [8] "Vanessa: I thought engineers do that."                             
 [9] "George: They can..."                                               
[10] "Jerry: Y'know I'm sorry you had to leave so early the other night."
[11] "Vanessa: Oh, me too. My cousin had to go back to Boston."          
[12] "Jerry: Oh, that guy was your *cousin*!"                            
[13] "Vanessa: Yeah, and that woman was your--"                          
[14] "Jerry: Friend!"                                                    
[15] "George: I'll just, uh, get a paper..."                             
[16] "Jerry: So, um, do you date immature men?"                          
[17] "Vanessa: Almost exclusively..."                                       

I modified the raw data in order to strip out the character name and line spoken. I also removed lines to clean up the data a bit.

data$characterName = str_extract(sub(" ","",  
                                     sub(":.*","",data$rawLine)),
                                 "[A-Z][A-Z]+")
data = data[is.na(data$characterName)==FALSE,]  
data = data %>%  
  filter(characterName != 'TV') %>% 
  filter(characterName != 'DVD') %>%
  filter(characterName != 'MAN')

I used a shift function to create a new column which is the same as the character column but moved up by one row. This should help to show the conversation between two people. Inherently, this will be flawed because the beginning and ends of scenes will run together. I made the assumption that it wouldn’t impact the results since the instances would likely be evenly distributed across characters.

shift <- function(x, n){  
  c(x[-(seq(n))], rep(NA, n))
}

#One character to the next in new column
data$characterNext = shift(data$characterName, 1)  
data$characterInteraction = paste(data$characterName  
                                  ,data$characterNext,sep="-->")

I created a list of characters with the most lines recorded.

Observations

df = data %>%  
  group_by(characterName) %>%
  summarise(n=n()) %>%
  arrange(desc(n)) %>%
  top_n(20) %>%
  mutate(freq = n / sum(n))

p = ggplot(df,aes(reorder(characterName,n),y=freq))  
p + geom_bar(stat = 'identity') +  
  coord_flip() + 
  scale_y_continuous(labels=percent_format()) +
  geom_text(aes(label=paste(round(100*freq,0),"%",sep=''),hjust=0)) +
  labs(x='',y='Volume') + 
  ggtitle('Total Lines')


I created a list of two characters speaking to each other. This is directional data (so Jerry speaking to George is separate from George speaking to Jerry).

Observations

df = data %>%  
  group_by(characterInteraction) %>%
  summarise(n=n()) %>%
  arrange(desc(n)) %>%
  top_n(20) %>%
  mutate(freq = n / sum(n))

p = ggplot(df,aes(reorder(characterInteraction,n),y=freq))  
p + geom_bar(stat = 'identity') +  
  coord_flip()+ scale_y_continuous(labels=percent_format()) +
  geom_text(aes(label=paste(round(100*freq,0),"%",sep=''),hjust=0)) +
  labs(x='',y='Volume') + 
  ggtitle('Lines Between Two Characters (Directional)')

I used a shift function once again to see how the conversation flows two lines after. This will give a hint as to whether the conversation is between two characters or more. Again, this is directional data.

data$characterNext2 = shift(data$characterName, 2)  
data$characterInteraction2 = paste(data$characterName,  
                                   data$characterNext2,sep="-->")

df = data %>%  
  group_by(characterInteraction2) %>%
  summarise(n=n()) %>%
  arrange(desc(n)) %>%
  top_n(20) %>%
  mutate(freq = n / sum(n))

p = ggplot(df,aes(reorder(characterInteraction2,n),y=freq))  
p + geom_bar(stat = 'identity') +  
  coord_flip()+ scale_y_continuous(labels=percent_format()) + 
  geom_text(aes(label=paste(round(100*freq,0),"%",sep=''),hjust=0)) +
  labs(x='',y='Volume') + 
  ggtitle('Lines Skipping One Character')


I combined three lines in a row created a nice view of groups which speak in order.

Observations

#Three interactions in a row
data$threeInteraction = paste(data$characterName,  
                              data$characterNext,
                              data$characterNext2,sep="-->")
df = data %>%  
  group_by(threeInteraction) %>%
  summarise(n=n()) %>%
  arrange(desc(n)) %>%
  top_n(20) %>%
  mutate(freq = n / sum(n))

p = ggplot(df,aes(reorder(threeInteraction,n),y=freq))  
p + geom_bar(stat = 'identity') +  
  coord_flip() + 
  scale_y_continuous(labels=percent_format()) +
  geom_text(aes(label=paste(round(100*freq,0),"%",sep=''),hjust=0)) +
  labs(x='',y='Volume') + 
  ggtitle('Lines Between 3 Chars (Directional)')


The igraph library allows for the visualization of how two vectors are related to each other. Nodes represent the characters, the edges resemble lines (relationships) between the two.

Observations

df = data %>%  
  group_by(characterName) %>%
  summarise(n=n()) %>%
  arrange(desc(n)) %>%
  top_n(11)
topCharacters = df$characterName  
df.data = data %>%  
  filter(characterName %in% topCharacters) %>% 
  filter(characterNext %in% topCharacters)

df = data.frame(A = df.data$characterName, B = df.data$characterNext)  
df = na.omit(df)  
g4 = graph.data.frame(d=df,directed=FALSE)  
g4 = simplify(g4,  
              remove.multiple=F,
              remove.loops=F,
              edge.attr.comb=c(weight='sum',type='ignore'))
plot(g4,  
     vertex.label.color='black',
     vertex.frame.color=NA,
     vertex.label.=2,
     vertex.label.dist = 0.5)

Relationship of Top Seinfeld Characters

Conclusion
Ultimately, the “show about nothing” didn’t contain many surprises. Further analysis of the seasons could perhaps show some additional insights. Sentiment analysis would be useful in determining the “tone” of episodes and characters. Decision trees based off of lines or bi-grams, could perhaps predict which character is speaking. Maybe there will be more to come…

Code used in this post is on my GitHub

To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.