Penguins Dataset Overview – iris alternative in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If there’s a dataset that’s been most used by data scientists / data analysts while they’re learning something or coaching something – it’s either iris
(more R users) or titanic
(more Python users).
iris
dataset isn’t most used just because it’s easy accessible but it’s something that you can use to demonstrate many data science concepts like correlation, regression, classification.
The objective of this post is to introduce you to penguins
dataset and get you started with a few code snippets so that you can take off yourself!
Very recently, there’s been growing sentiment in the community to move away from iris
due to Ronald Fisher’s eugenicist past.
At this very time, We’re blessed with another iris
-like dataset about penguins
. Thanks to Allison Horst who packaged it as an R package palmerpenguins
under CC-0 license.
Youtube – https://www.youtube.com/watch?v=4zUmlZg9Dd4
Video Walkthrough
### Please subscribe to the channel and leave a feedback if it’s useful. It’ll really good to hear from you!
Installation
palmerpenguins
is yet to make it to CRAN, so you can install it from Github
remotes::install_github("allisonhorst/palmerpenguins")
Accessing Data
After successful installation, you can find out that there are two datasets attached with the package – penguins
and penguins_raw
. You can check out their help page (?penguins_raw
) to understand more about respective datasets.
Loading Library
library(tidyverse) library(palmerpenguins)
Meta – Glimpse of penguins
dataset
penguins
dataset has got the following 7 columns and 344 columns
names(penguins) ## [1] "species" "island" "bill_length_mm" ## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## [7] "sex"
Of the 7 columns, 3 are categorical (species
,island
,sex
) and the rest are numeric.
glimpse(penguins) ## Rows: 344 ## Columns: 7 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade… ## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,… ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,… ## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18… ## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,… ## $ sex <fct> male, female, female, NA, female, male, female, mal…
Penguins Data Column Definition
species a factor denoting penguin species (Adélie, Chinstrap and Gentoo)
island a factor denoting island in Palmer Archipelago, Antarctica (Biscoe, Dream or Torgersen)
bill_length_mm a number denoting bill length (millimeters)
bill_depth_mm a number denoting bill depth (millimeters)
flipper_length_mm an integer denoting flipper length (millimeters)
body_mass_g an integer denoting body mass (grams)
sex a factor denoting penguin sex (female, male)
Missing Values
A good thing about penguins
over iris
is that, It’s got missing values
NA
. It’s quite an important thing to be present while using for educational purposes!
penguins %>% #group_by(species) %>% select(everything()) %>% summarise_all(funs(sum(is.na(.)))) %>% pivot_longer(cols = 1:7, names_to = 'columns', values_to = 'NA_count') %>% arrange(desc(NA_count)) %>% ggplot(aes(y = columns, x = NA_count)) + geom_col(fill = 'darkorange') + geom_label(aes(label = NA_count)) + # scale_fill_manual(values = c("darkorange","purple","cyan4")) + theme_minimal() + labs(title = 'Penguins - NA Count') ## Warning: `funs()` is deprecated as of dplyr 0.8.0. ## Please use a list of either functions or lambdas: ## ## # Simple named list: ## list(mean = mean, median = median) ## ## # Auto named with `tibble::lst()`: ## tibble::lst(mean, median) ## ## # Using lambdas ## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE)) ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_warnings()` to see where this warning was generated.
Simple Scatter Plot
Like iris
, You can simply make a scatter plot using base-R’s plot()
plot(penguins)
### Bar Plot
In this Bar plot, We can visualize the count of each species in the penguins dataset
penguins %>% count(species) %>% ggplot() + geom_col(aes(x = species, y = n, fill = species)) + geom_label(aes(x = species, y = n, label = n)) + scale_fill_manual(values = c("darkorange","purple","cyan4")) + theme_minimal() + labs(title = 'Penguins Species & Count')
### Bar Plot for each Species
In this Bar Plot, We can visualize Species distribution of each Sex (with faceted plot)
penguins %>% drop_na() %>% count(sex, species) %>% ggplot() + geom_col(aes(x = species, y = n, fill = species)) + geom_label(aes(x = species, y = n, label = n)) + scale_fill_manual(values = c("darkorange","purple","cyan4")) + facet_wrap(~sex) + theme_minimal() + labs(title = 'Penguins Species ~ Gender')
Correlation Matrix
penguins %>% select_if(is.numeric) %>% drop_na() %>% cor() ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## bill_length_mm 1.0000000 -0.2350529 0.6561813 0.5951098 ## bill_depth_mm -0.2350529 1.0000000 -0.5838512 -0.4719156 ## flipper_length_mm 0.6561813 -0.5838512 1.0000000 0.8712018 ## body_mass_g 0.5951098 -0.4719156 0.8712018 1.0000000
Scatter Plot – Penguins Size Relation wrt Species
In this scatter plot, we’ll try to visualize relationship between flipper_length_mm
and body_mass_g
with respect to each species.
library(tidyverse) ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = species), size = 3, alpha = 0.8) + #theme_minimal() + scale_color_manual(values = c("darkorange","purple","cyan4")) + labs(title = "Penguin size, Palmer Station LTER", subtitle = "Flipper length and body mass for Adelie, Chinstrap and Gentoo Penguins", x = "Flipper length (mm)", y = "Body mass (g)", color = "Penguin species", shape = "Penguin species") + theme_minimal()
Scatter Plot – Penguins Size Relation wrt Island
library(tidyverse) ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = island, shape = species), size = 3, alpha = 0.8) + #theme_minimal() + scale_color_manual(values = c("darkorange","purple","cyan4")) + labs(title = "Penguin size, Palmer Station LTER", subtitle = "Flipper length and body mass for each island", x = "Flipper length (mm)", y = "Body mass (g)", color = "Penguin island", shape = "Penguin species") + theme_minimal()
References
citation('palmerpenguins') ## ## To cite palmerpenguins in publications use: ## ## Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism ## and Environmental Variability within a Community of Antarctic ## Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. ## https://doi.org/10.1371/journal.pone.0090081 ## ## A BibTeX entry for LaTeX users is ## ## @Article{, ## title = {Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis)}, ## author = {Gorman KB and Williams TD and Fraser WR}, ## journal = {PLoS ONE}, ## year = {2014}, ## volume = {9(3)}, ## number = {e90081}, ## pages = {-13}, ## url = {https://doi.org/10.1371/journal.pone.0090081}, ## }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.