Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post is dedicated to my mother – Seinfeld’s greatest fan. Seinfeld is a classic TV sitcom. It featured four main characters surrounded by relatively normal, everyday, run-of-the-mill scenarios. In the spirit of Seinfeld, this post will also “be about nothing.”
I used Python to create a web scraper which gathered scripts from various sites on the internet and input them into a local MySQL database.
Data From Local MySQL
Scraping the raw lines without parsing looked simple enough. Parsing before inserting into MySQL created some difficulties so it made sense to parse once loaded into R. A sample of the raw data:
I modified the raw data in order to strip out the character name and line spoken. I also removed lines to clean up the data a bit.
I used a shift function to create a new column which is the same as the character column but moved up by one row. This should help to show the conversation between two people. Inherently, this will be flawed because the beginning and ends of scenes will run together. I made the assumption that it wouldn’t impact the results since the instances would likely be evenly distributed across characters.
I created a list of characters with the most lines recorded.
Observations
Jerry is obviously the main character – the show is named after him after all
The main characters: Jerry, George, Elaine, and Kramer make up roughly 90% of the top 20 character’s lines
George and Jerry’s parents are actually a major part of the show (they showed up in the top 10)
I created a list of two characters speaking to each other. This is directional data (so Jerry speaking to George is separate from George speaking to Jerry).
Observations
Jerry is involved in the top 6 two person interactions
It seems as if the others don’t speak to each other nearly as often as they do to Jerry
Due to the nature of how this data was created, it would only make sense that character links are so closely related
I used a shift function once again to see how the conversation flows two lines after. This will give a hint as to whether the conversation is between two characters or more. Again, this is directional data.
I combined three lines in a row created a nice view of groups which speak in order.
Observations
Kramer did not show up in the top 20 lines between 3 unique characters
Kramer likely interacts more often in one-on-one than the other main characters
The igraph library allows for the visualization of how two vectors are related to each other. Nodes represent the characters, the edges resemble lines (relationships) between the two.
Observations
The main characters, Jerry, George, Elaine, and Kramer, are the center of attention and have the most relationships
Parents have strong relationships with their kids and each other
Newman seems to play a large role but primarily interacts only with the main characters
Peterman interacts almost solely with Elaine – which makes sense because he’s her boss
Susan primarily interacts with George – which makes sense because she engaged to him. She also interacts with Jerry but to a lesser extent
Relationship of Top Seinfeld Characters
Conclusion
Ultimately, the “show about nothing” didn’t contain many surprises. Further analysis of the seasons could perhaps show some additional insights. Sentiment analysis would be useful in determining the “tone” of episodes and characters. Decision trees based off of lines or bi-grams, could perhaps predict which character is speaking. Maybe there will be more to come…Code used in this post is on my GitHub
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.