Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
596 episodes. 40 seasons. 1 package!
I’m a pretty big fan of Survivor and have religiously watched every season since the first. With 40 seasons under its belt, there’s a tonne of data to dive into. However, getting that data in one place has been tedious. Hence, the survivoR package.
survivoR is a collection of datasets detailing events across all 40 seasons of the US Survivor, including castaway information, vote history, immunity and reward challenge winners, jury votes, and viewers.
Installation
Currently, the package exists on Github and can be installed with the following code.
devtools::install_github("doehm/survivoR")
Cran: TBA
Dataset overview
Below are all the datasets that are contained within the package.
Season summary
A data frame containing summary details of each season of Survivor, including the winner, runner ups and location. This is a nested data frame given there maybe 1 or 2 runner-ups. By using a nested data frame the grain is maintained to 1 row per season.
season_summary #> # A tibble: 40 x 17 #> season_name season location country tribe_setup full_name winner runner_ups #> <chr> <int> <chr> <chr> <chr> <glue> <chr> <list> #> 1 Survivor: ~ 1 Pulau T~ Malays~ Two tribes~ Richa~ Richard <tibble [~ #> 2 Survivor: ~ 2 Herbert~ Austra~ Two tribes~ Tina ~ Tina <tibble [~ #> 3 Survivor: ~ 3 Shaba N~ Kenya Two tribes~ Ethan~ Ethan <tibble [~ #> 4 Survivor: ~ 4 Nuku Hi~ Polyne~ Two tribes~ Vecep~ Vecepia <tibble [~ #> 5 Survivor: ~ 5 Ko Taru~ Thaila~ Two tribes~ Brian~ Brian <tibble [~ #> 6 Survivor: ~ 6 Rio Neg~ Brazil Two tribes~ Jenna~ Jenna <tibble [~ #> 7 Survivor: ~ 7 Pearl I~ Panama Two tribes~ Sandr~ Sandra <tibble [~ #> 8 Survivor: ~ 8 Pearl I~ Panama Three trib~ Amber~ Amber <tibble [~ #> 9 Survivor: ~ 9 Efate, ~ Vanuatu Two tribes~ Chris~ Chris <tibble [~ #> 10 Survivor: ~ 10 Koror, ~ Palau A schoolya~ Tom W~ Tom <tibble [~ #> # ... with 30 more rows, and 9 more variables: final_vote <chr>, #> # timeslot <chr>, premiered <date>, premier_viewers <dbl>, ended <date>, #> # finale_viewers <dbl>, reunion_viewers <dbl>, rank <dbl>, viewers <dbl> season_summary %>% select(season, viewers_premier, viewers_finale, viewers_reunion, viewers_mean) %>% pivot_longer(cols = -season, names_to = "episode", values_to = "viewers") %>% mutate( episode = to_title_case(str_replace(episode, "viewers_", "")) ) %>% ggplot(aes(x = season, y = viewers, colour = episode)) + geom_line() + geom_point(size = 2) + theme_minimal() + scale_colour_survivor(16) + labs( title = "Survivor viewers over the 40 seasons", x = "Season", y = "Viewers (Millions)", colour = "Episode" )
Castaways
Season and demographic information about each castaway. Within a season the data is ordered by the first voted out to sole survivor indicated by order
which represents the order they castaways left the island. This may be by being voted off the island, being evacuated due to medical reasons, or quitting. When demographic information is missing, it likely means that the castaway re-entered the game at a later stage by winning the opportunity to return. Castaways that have played in multiple seasons will feature more than once with the age and location representing that point in time.
castaways %>% filter(season == 40) #> # A tibble: 22 x 15 #> season_name season castaway nickname age city state day original_tribe #> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> #> 1 Survivor: ~ 40 Natalie~ Natalie <NA> <NA> <NA> 2 Sele #> 2 Survivor: ~ 40 Amber M~ Amber 40 Pens~ Flor~ 3 Dakal #> 3 Survivor: ~ 40 Danni B~ Danni 43 Shaw~ Kans~ 6 Sele #> 4 Survivor: ~ 40 Ethan Z~ Ethan 45 Hill~ New ~ 9 Sele #> 5 Survivor: ~ 40 Tyson A~ Tyson <NA> <NA> <NA> 11 Dakal #> 6 Survivor: ~ 40 Rob Mar~ Rob 43 Pens~ Flor~ 14 Sele #> 7 Survivor: ~ 40 Parvati~ Parvati 36 Los ~ Cali~ 16 Sele #> 8 Survivor: ~ 40 Sandra ~ Sandra 44 Rive~ Flor~ 16 Dakal #> 9 Survivor: ~ 40 Yul Kwon Yul 44 Los ~ Cali~ 18 Dakal #> 10 Survivor: ~ 40 Wendell~ Wendell 35 Phil~ Penn~ 21 Dakal #> # ... with 12 more rows, and 6 more variables: merged_tribe <chr>, #> # result <chr>, jury_status <chr>, order <int>, swapped_tribe <chr>, #> # swapped_tribe2 <chr>
Vote history
This data frame contains a complete history of votes cast across all seasons of Survivor. This allows you to see who voted for who at which tribal council. It also includes details on who had individual immunity as well as who had their votes nullified by a hidden immunity idol. This details the key events for the season.
While there are consistent events across the seasons such as the tribe swap, there are some unique events such as the ‘mutiny’ in Survivor: Cook Islands (Season 13) or the ‘Outcasts’ in Survivor: Pearl Islands (season 7). When castaways change tribes by some means other than a tribe swap, it is still recorded as ‘swapped’ to maintain a standard.
The data is recorded as ‘swapped’ with a trailing digit if a swap has occurred more than once. This includes absorbed tribes when 3 tribes are reduced to 2 or when Stephanie was ‘absorbed’ in Survivor: Palau (season 10) when everyone but herself was voted off the tribe (and making Palau one of the classic seasons of Survivor). To indicate a change in tribe status these events are also considered ‘swapped’.
This data frame is at the tribal council by castaway grain, so there is a vote for everyone that attended the tribal council. However, there are some edge cases such as when the ‘steal a vote’ advantage is played. In this case, there is a second row for the castaway indicating their second vote.
In the case of a tie and a revote, the first vote is recorded and the result is recorded as ‘Tie’. The deciding vote is recorded as normal. Where there is a double tie, it is recorded as ‘Tie2’ (for lack of a better name). In the case of a double tie and it goes to rocks, the vote is either ‘Black rock’ or ‘White rock’. In the older episodes of Survivor, when there were two ties in a row, rather than going to rocks there was a countback of votes.
vh <- vote_history %>% filter( season == 40, episode == 10 ) vh #> # A tibble: 9 x 11 #> season_name season episode day tribe_status castaway immunity vote #> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> #> 1 Survivor: ~ 40 10 25 merged Tony individ~ Tyson #> 2 Survivor: ~ 40 10 25 merged Michele <NA> Tyson #> 3 Survivor: ~ 40 10 25 merged Sarah <NA> Deni~ #> 4 Survivor: ~ 40 10 25 merged Sarah <NA> Tyson #> 5 Survivor: ~ 40 10 25 merged Ben <NA> Tyson #> 6 Survivor: ~ 40 10 25 merged Nick <NA> Tyson #> 7 Survivor: ~ 40 10 25 merged Kim <NA> Soph~ #> 8 Survivor: ~ 40 10 25 merged Sophie <NA> Deni~ #> 9 Survivor: ~ 40 10 25 merged Tyson <NA> Soph~ #> # ... with 3 more variables: nullified <lgl>, voted_out <chr>, order <dbl> vh %>% count(vote) #> # A tibble: 5 x 2 #> vote n #> <chr> <int> #> 1 Denise 2 #> 2 Immune 1 #> 3 None 1 #> 4 Sophie 2 #> 5 Tyson 5
Events in the game such as fire challenges, rock draws, steal-a-vote advantages, or countbacks (in the early days) often mean a vote wasn’t placed for an individual. Rather a challenge may be won, lost, no vote cast, etc but attended tribal council. These events are recorded in the vote
field. I have included a function clean_votes
for when only the votes cast for individuals are needed. If the input data frame has the vote
column it can simply be piped.
vh %>% clean_votes() %>% count(vote) #> # A tibble: 3 x 2 #> vote n #> <chr> <int> #> 1 Denise 2 #> 2 Sophie 2 #> 3 Tyson 5
Immunity
A nested tidy data frame of immunity challenge results. Each row in this dataset is a tribal council. It is a nested data frame since there may be multiple people or tribes that win immunity. But more so multiple tribes when there are 3 or more tribes in the first phase of the game. You can extract the immunity winners by expanding the data frame. There may be duplicates for the rare event when there are multiple eliminations after a single immunity challenge.
immunity %>% filter(season == 40) %>% unnest(immunity) #> # A tibble: 23 x 8 #> season_name season episode title voted_out day order immunity #> <chr> <dbl> <dbl> <chr> <chr> <dbl> <int> <chr> #> 1 Survivor: Winner~ 40 1 Greatest of ~ Natalie 2 1 Dakal #> 2 Survivor: Winner~ 40 1 Greatest of ~ Amber 3 2 Sele #> 3 Survivor: Winner~ 40 2 It's Like a ~ Danni 6 3 Dakal #> 4 Survivor: Winner~ 40 3 Out for Blood Ethan 9 4 Dakal #> 5 Survivor: Winner~ 40 4 I Like Reven~ Tyson 11 5 Sele #> 6 Survivor: Winner~ 40 5 The Buddy Sy~ Rob 14 6 Sele #> 7 Survivor: Winner~ 40 5 The Buddy Sy~ Rob 14 6 Dakal #> 8 Survivor: Winner~ 40 6 Quick on the~ Parvati 16 7 Yara #> 9 Survivor: Winner~ 40 6 Quick on the~ Sandra 16 8 Yara #> 10 Survivor: Winner~ 40 7 We're in the~ Yul 18 9 Yara #> # ... with 13 more rows
Rewards
A nested tidy data frame of reward challenge result where each row is a reward challenge. Typically in the merge, if a single person wins a reward they are allowed to bring others along with them. The first castaway in the expanded list is the winner. Subsequent players are those who the winner brought along with them to the reward. Although, not always. Occasionally in the merge, the castaways are split into two teams for the purpose of the reward, in which case all castaways win the reward rather than a single person. If reward
is missing there was no reward challenge for the episode.
rewards %>% filter(season == 40) %>% unnest(reward) #> # A tibble: 29 x 6 #> season_name season episode title day reward #> <chr> <dbl> <dbl> <chr> <dbl> <chr> #> 1 Survivor: Winners at ~ 40 1 Greatest of the Greats 2 Dakal #> 2 Survivor: Winners at ~ 40 1 Greatest of the Greats 3 <NA> #> 3 Survivor: Winners at ~ 40 2 It's Like a Survivor Econ~ 6 Dakal #> 4 Survivor: Winners at ~ 40 3 Out for Blood 9 Dakal #> 5 Survivor: Winners at ~ 40 4 I Like Revenge 11 Sele #> 6 Survivor: Winners at ~ 40 5 The Buddy System on Stero~ 14 <NA> #> 7 Survivor: Winners at ~ 40 6 Quick on the Draw 16 Yara #> 8 Survivor: Winners at ~ 40 7 We're in the Majors 18 Yara #> 9 Survivor: Winners at ~ 40 7 We're in the Majors 18 Sele #> 10 Survivor: Winners at ~ 40 8 This is Where the Battle ~ 21 Tyson #> # ... with 19 more rows
Jury votes
This data frame contains the history of jury votes. It is more verbose than it needs to be. However, having a 0-1 column indicating if a vote was placed for the finalist makes it easier to summarise castaways that received no votes.
jury_votes %>% filter(season == 40) #> # A tibble: 48 x 5 #> season_name season castaway finalist vote #> <chr> <dbl> <chr> <chr> <dbl> #> 1 Survivor: Winners at War 40 Sarah Michele 0 #> 2 Survivor: Winners at War 40 Sarah Natalie 0 #> 3 Survivor: Winners at War 40 Sarah Tony 1 #> 4 Survivor: Winners at War 40 Ben Michele 0 #> 5 Survivor: Winners at War 40 Ben Natalie 0 #> 6 Survivor: Winners at War 40 Ben Tony 1 #> 7 Survivor: Winners at War 40 Denise Michele 0 #> 8 Survivor: Winners at War 40 Denise Natalie 0 #> 9 Survivor: Winners at War 40 Denise Tony 1 #> 10 Survivor: Winners at War 40 Nick Michele 0 #> # ... with 38 more rows jury_votes %>% filter(season == 40) %>% group_by(finalist) %>% summarise(votes = sum(vote)) #> # A tibble: 3 x 2 #> finalist votes #> <chr> <dbl> #> 1 Michele 0 #> 2 Natalie 4 #> 3 Tony 12
Viewers
A data frame containing the viewer information for every episode across all seasons. It also includes the rating and viewer share information for viewers aged 18 to 49 years.
viewers %>% filter(season == 40) #> # A tibble: 14 x 9 #> season_name season episode_number_~ episode title episode_date viewers #> <chr> <dbl> <dbl> <dbl> <chr> <date> <dbl> #> 1 Survivor: ~ 40 583 1 Grea~ 2020-02-12 6.68 #> 2 Survivor: ~ 40 584 2 It's~ 2020-02-19 7.16 #> 3 Survivor: ~ 40 585 3 Out ~ 2020-02-26 7.14 #> 4 Survivor: ~ 40 586 4 I Li~ 2020-03-04 7.08 #> 5 Survivor: ~ 40 587 5 The ~ 2020-03-11 6.91 #> 6 Survivor: ~ 40 588 6 Quic~ 2020-03-18 7.83 #> 7 Survivor: ~ 40 589 7 We'r~ 2020-03-25 8.18 #> 8 Survivor: ~ 40 590 8 This~ 2020-04-01 8.23 #> 9 Survivor: ~ 40 591 9 War ~ 2020-04-08 7.85 #> 10 Survivor: ~ 40 592 10 The ~ 2020-04-15 8.14 #> 11 Survivor: ~ 40 593 11 This~ 2020-04-22 8.16 #> 12 Survivor: ~ 40 594 12 Frie~ 2020-04-29 8.08 #> 13 Survivor: ~ 40 595 13 The ~ 2020-05-06 7.57 #> 14 Survivor: ~ 40 596 14 It A~ 2020-05-13 7.94 #> # ... with 2 more variables: rating_18_49 <dbl>, share_18_49 <dbl>
Tribe colours
This data frame contains the tribe names and colours for each season, including the RGB values. These colours can be joined with the other data frames to customise colours for plots. Another option is to add tribal colours to ggplots with the scale functions.
tribe_colours #> # A tibble: 139 x 7 #> season_name season tribe_name r g b tribe_colour #> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> #> 1 Survivor: Winners at War 40 Sele 0 103 214 #0067D6 #> 2 Survivor: Winners at War 40 Dakal 216 14 14 #D80E0E #> 3 Survivor: Winners at War 40 Yara 4 148 81 #049451 #> 4 Survivor: Winners at War 40 Koru 0 0 0 #000000 #> 5 Survivor: Island of the Ido~ 39 Lairo 243 148 66 #F39442 #> 6 Survivor: Island of the Ido~ 39 Vokai 217 156 211 #D99CD3 #> 7 Survivor: Island of the Ido~ 39 Lumuwaku 48 78 210 #304ED2 #> 8 Survivor: Edge of Extinction 38 Manu 16 80 186 #1050BA #> 9 Survivor: Edge of Extinction 38 Lesu 0 148 128 #009480 #> 10 Survivor: Edge of Extinction 38 Kama 250 207 34 #FACF22 #> # ... with 129 more rows
ggplot2 scale functions
Included are ggplot2 scale functions (of the form scale_*_survivor()
) to add tribe colours to ggplot. Simply input the season number desired to use those tribe colours. If the fill or colour aesthetic is the tribe name, this needs to be passed to the scale function as scale_fill_survivor(…, tribe = tribe)
(for now) where tribe
is on the input data frame. If the fill or colour aesthetic is independent of the actual tribe names, tribe
does not need to be specified and will simply use the tribe colours as a colour palette, for example, the viewers line graph above which used the Micronesia colour palette.
ssn <- 35 labels <- castaways %>% filter( season == ssn, str_detect(result, "Sole|unner") ) %>% select(nickname, original_tribe) %>% mutate(label = glue("{nickname} ({original_tribe})")) %>% select(label, nickname) jury_votes %>% filter(season == ssn) %>% left_join( castaways %>% filter(season == ssn) %>% select(nickname, original_tribe), by = c("castaway" = "nickname") ) %>% group_by(finalist, original_tribe) %>% summarise(votes = sum(vote)) %>% left_join(labels, by = c("finalist" = "nickname")) %>% { ggplot(., aes(x = label, y = votes, fill = original_tribe)) + geom_bar(stat = "identity", width = 0.5) + scale_fill_survivor(ssn, tribe = .$original_tribe) + theme_minimal() + labs( x = "Finalist (original tribe)", y = "Votes", fill = "Original\ntribe", title = "Votes received by each finalist" ) }
Visualise the events of each season
This data provides a way to deeper analyse each season and the plays within each episode. For example, we could construct a graph of who voted for who, where the castaway is the node and the edge is who they voted for using the vote history data. While in this representation it’s possible to use clustering algorithms to identify alliances in the data. Other uses include identifying the probability of players jumping ship and pivotal votes. This is particularly interesting for the first 1 or 2 tribals of the merge to see if players stick with their original tribe or jump ship.
ssn <- 40 df <- vote_history %>% filter( season == ssn, order == 13 ) nodes <- df %>% distinct(castaway) %>% mutate(id = 1:n()) %>% rename(label = castaway) edges <- df %>% count(castaway, vote) %>% left_join( nodes %>% rename(from = id), by = c("castaway" = "label") ) %>% left_join( nodes %>% rename(to = id), by = c("vote" = "label") ) %>% mutate(arrows = "to") %>% rename(value = n) %>% left_join( castaways %>% filter(season == ssn) %>% select(nickname, original_tribe), by = c("castaway" = "nickname") ) labels <- edges %>% select(from, to, castaway, original_tribe) %>% distinct(from, castaway, original_tribe) %>% arrange(castaway) %>% left_join( edges %>% count(vote), by = c("castaway" = "vote") ) cols <- tribe_colours$tribe_colour names(cols) <- tribe_colours$tribe ggraph( edges %>% rename(`Original tribe` = original_tribe), layout = "linear") + geom_edge_arc(aes(colour = `Original tribe`), arrow = arrow(length = unit(4, "mm"), type = "closed"), end_cap = circle(10, 'mm')) + geom_node_point(size = 26, colour = cols[labels$original_tribe]) + geom_node_point(size = 24, colour = "black") + geom_node_text(aes(label = labels$castaway), colour = "grey", size = 4, vjust = 0, family = ft) + geom_node_text(aes(label = labels$n), colour = "grey", size = 4, vjust = 2, family = ft) + scale_edge_colour_manual(values = cols[unique(edges$original_tribe)]) + scale_colour_manual(values = cols[unique(edges$original_tribe)]) + theme_graph()
New features and future seasons
I intend to update the survivoR package each week during the airing of future seasons. For Survivor and data nuts like myself, this will enable a deeper analysis of each episode, and just neat ways visualise the evolution of the game.
New features will be added, such as details on exiled castaways across the seasons. If you have a request for specific data let me know in the comments and I’ll see what I can do. Also, if you’d like to contribute by adding to existing datasets or contribute a new dataset, please contact me directly on the contacts page.
Issues
Given the variable nature of the game of Survivor and how the rules are tweaked each season, there are bound to be edge cases where the data is not quite right. Please log an issue on Github, or with me directly in the comments and I will correct the datasets.
References
Data in the survivoR package was mostly sourced from Wikipedia. Other data, such as the tribe colours, was manually recorded and entered myself.
Torch graphic in hex: Fire Torch Vectors by Vecteezy
The post survivoR | Data from the TV series in R appeared first on Daniel Oehm | Gradient Descending.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.