Site icon R-bloggers

Penguins Dataset Overview – iris alternative in R

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If there’s a dataset that’s been most used by data scientists / data analysts while they’re learning something or coaching something – it’s either iris (more R users) or titanic (more Python users).

iris dataset isn’t most used just because it’s easy accessible but it’s something that you can use to demonstrate many data science concepts like correlation, regression, classification.

The objective of this post is to introduce you to penguins dataset and get you started with a few code snippets so that you can take off yourself!

Very recently, there’s been growing sentiment in the community to move away from iris due to Ronald Fisher’s eugenicist past.

At this very time, We’re blessed with another iris-like dataset about penguins. Thanks to Allison Horst who packaged it as an R package palmerpenguins under CC-0 license.

Youtube – https://www.youtube.com/watch?v=4zUmlZg9Dd4

Video Walkthrough

### Please subscribe to the channel and leave a feedback if it’s useful. It’ll really good to hear from you!

Installation

palmerpenguins is yet to make it to CRAN, so you can install it from Github

remotes::install_github("allisonhorst/palmerpenguins")

Accessing Data

After successful installation, you can find out that there are two datasets attached with the package – penguins and penguins_raw. You can check out their help page (?penguins_raw) to understand more about respective datasets.

Loading Library

library(tidyverse)
library(palmerpenguins)

Meta – Glimpse of penguins dataset

penguins dataset has got the following 7 columns and 344 columns

names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"

Of the 7 columns, 3 are categorical (species,island,sex) and the rest are numeric.

glimpse(penguins)
## Rows: 344
## Columns: 7
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
## $ sex               <fct> male, female, female, NA, female, male, female, mal…

Penguins Data Column Definition

species a factor denoting penguin species (Adélie, Chinstrap and Gentoo)

island a factor denoting island in Palmer Archipelago, Antarctica (Biscoe, Dream or Torgersen)

bill_length_mm a number denoting bill length (millimeters)

bill_depth_mm a number denoting bill depth (millimeters)

flipper_length_mm an integer denoting flipper length (millimeters)

body_mass_g an integer denoting body mass (grams)

sex a factor denoting penguin sex (female, male)

Missing Values

A good thing about penguins over iris is that, It’s got missing values NA. It’s quite an important thing to be present while using for educational purposes!

penguins %>%
  #group_by(species) %>%
   select(everything()) %>% 
  summarise_all(funs(sum(is.na(.)))) %>%
  pivot_longer(cols = 1:7, names_to = 'columns', values_to = 'NA_count') %>%
  arrange(desc(NA_count)) %>%
  ggplot(aes(y = columns, x = NA_count)) + geom_col(fill = 'darkorange') +
  geom_label(aes(label = NA_count)) +
#   scale_fill_manual(values = c("darkorange","purple","cyan4")) +
  theme_minimal() +
  labs(title = 'Penguins - NA Count')
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Simple Scatter Plot

Like iris, You can simply make a scatter plot using base-R’s plot()

plot(penguins)

### Bar Plot

In this Bar plot, We can visualize the count of each species in the penguins dataset

penguins %>%
  count(species) %>%
  ggplot() + geom_col(aes(x = species, y = n, fill = species)) +
  geom_label(aes(x = species, y = n, label = n)) +
  scale_fill_manual(values = c("darkorange","purple","cyan4")) +
  theme_minimal() +
  labs(title = 'Penguins Species & Count')

### Bar Plot for each Species

In this Bar Plot, We can visualize Species distribution of each Sex (with faceted plot)

penguins %>%
  drop_na() %>%
  count(sex, species) %>%
  ggplot() + geom_col(aes(x = species, y = n, fill = species)) +
  geom_label(aes(x = species, y = n, label = n)) +
  scale_fill_manual(values = c("darkorange","purple","cyan4")) +
  facet_wrap(~sex) +
  theme_minimal() +
  labs(title = 'Penguins Species ~ Gender')

Correlation Matrix

penguins %>%
  select_if(is.numeric) %>%
  drop_na() %>%
  cor() 
##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm         1.0000000    -0.2350529         0.6561813   0.5951098
## bill_depth_mm         -0.2350529     1.0000000        -0.5838512  -0.4719156
## flipper_length_mm      0.6561813    -0.5838512         1.0000000   0.8712018
## body_mass_g            0.5951098    -0.4719156         0.8712018   1.0000000

Scatter Plot – Penguins Size Relation wrt Species

In this scatter plot, we’ll try to visualize relationship between flipper_length_mm and body_mass_g with respect to each species.

library(tidyverse)
ggplot(data = penguins, 
                       aes(x = flipper_length_mm,
                           y = body_mass_g)) +
  geom_point(aes(color = species, 
                 shape = species),
             size = 3,
             alpha = 0.8) +
  #theme_minimal() +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(title = "Penguin size, Palmer Station LTER",
       subtitle = "Flipper length and body mass for Adelie, Chinstrap and Gentoo Penguins",
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin species",
       shape = "Penguin species") +
  theme_minimal()

Scatter Plot – Penguins Size Relation wrt Island

library(tidyverse)
ggplot(data = penguins, 
                       aes(x = flipper_length_mm,
                           y = body_mass_g)) +
  geom_point(aes(color = island, 
                 shape = species),
             size = 3,
             alpha = 0.8) +
  #theme_minimal() +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(title = "Penguin size, Palmer Station LTER",
       subtitle = "Flipper length and body mass for each island",
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Penguin island",
       shape = "Penguin species") +
  theme_minimal()

References

citation('palmerpenguins')
## 
## To cite palmerpenguins in publications use:
## 
##   Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism
##   and Environmental Variability within a Community of Antarctic
##   Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081.
##   https://doi.org/10.1371/journal.pone.0090081
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis)},
##     author = {Gorman KB and Williams TD and Fraser WR},
##     journal = {PLoS ONE},
##     year = {2014},
##     volume = {9(3)},
##     number = {e90081},
##     pages = {-13},
##     url = {https://doi.org/10.1371/journal.pone.0090081},
##   }

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.