filter(species_uncertainty == "Uncertain") |> count(species, sort = TRUE, name = "Count") |> slice_head(n = 6) |> kbl() species Count delphis/coeruleoalba 48 phocoena? 42 melaena? 20 delphis? 18 truncatus? 18 acutorostrata? 12 The date variable has many NA’s. Fortunately, the components to construct many of these are in the year_val, month_val and day_val variables. With a little wrangling and imputation, we can coalesce these variables into a new date. This will be useful since plots of sample species by year, month and week of stranding suggest a de-constructed date could be a useful predictor. strandings_df2 |> select(date, year_val, month_val, day_val) |> summary() ## date year_val month_val day_val ## Min. :1913-02-13 Min. : 0 Min. : 0.000 Min. : 0.00 ## 1st Qu.:1933-09-09 1st Qu.:1933 1st Qu.: 4.000 1st Qu.: 9.00 ## Median :1954-04-13 Median :1955 Median : 7.000 Median :16.00 ## Mean :1955-01-08 Mean :1954 Mean : 6.766 Mean :15.66 ## 3rd Qu.:1979-03-21 3rd Qu.:1979 3rd Qu.:10.000 3rd Qu.:22.00 ## Max. :1989-12-25 Max. :1989 Max. :12.000 Max. :31.00 ## NA's :121 strandings_df3 group_by(species) |> mutate( month_val = if_else(month_val == 0, mean(month_val) |> as.integer(), month_val), day_val = if_else(day_val == 0, mean(day_val) |> as.integer(), day_val), day_val = if_else(day_val == 0, 1L, day_val), date2 = date_build(year_val, month_val, day_val, invalid = "NA"), ) |> ungroup() |> mutate(date3 = coalesce(date, date2)) |> arrange(id) |> mutate(date = if_else(is.na(date), lag(date3), date3)) |> select(-date2, -date3, -ends_with("_val")) example_species get_week() ) |> filter(species %in% example_species) |> count(species, {{ var }}) |> ggplot(aes(species, {{ var }})) + geom_violin( alpha = 0.7, fill = cols[3], show.legend = FALSE ) + coord_flip() + labs( title = glue("Variation in {str_to_title(as.character(var))}", " of Stranding for Known Species"), x = NULL, y = glue("{str_to_title(as.character(var))}") ) } c("year", "month", "week") |> map(sym) |> map(plot_date_feature) ## [[1]] ## ## [[2]] ## ## [[3]] Do latitude and longitude carry useful predictive information? A geospatial visualisation of strandings shows some species do gravitate towards particular stretches of coastline, e.g. “acutus” and “albirostris” to the east, and “coeruleoalba” to the south-west. Some species may also be more prone to mass stranding, so something that indicates whether a species has such a history (in these data) may be worth including in the mix. uki mutate(mass_possible = min(mass_single, na.rm = TRUE)) |> ungroup() Some records are missing the length measurement of the mammal. Nonetheless, where present, this is likely to be predictive, and may help, for example, separate species labelled as “delphis/coeruleoalba” where the length is at the extreme ends of the “delphis” range as we see below. And the range of length may differ by mammal sex. known_species |> mutate(sex = case_when( sex == "M" ~ "Male", sex == "F" ~ "Female", TRUE ~ "Unknown" )) |> filter(species_lumped != "Other") |> count(species_lumped, length, sex) |> mutate(species_lumped = fct_reorder(species_lumped, desc(length), min, na.rm = TRUE)) |> ggplot(aes(species_lumped, length)) + geom_violin(aes(fill = if_else(str_detect(species_lumped, "^de|^co"), TRUE, FALSE)), show.legend = FALSE) + facet_wrap(~ sex) + scale_fill_manual(values = cols[c(1, 5)]) + coord_flip() + labs(title = "Variation in Species Length by Sex", x = NULL, y = "Length (metres)") With map coordinates not always available, county could be, with a little string cleaning, a useful additional predictor. strandings_df4 |> count(county) |> filter(str_detect(county, "Shet|Northumberland")) |> kbl(col.names = c("County", "Count")) County Count Fair Isle, Shetland Isles 1 Northumberland 89 Northumberland. 1 Shetland Islands, Scotland 232 Shetland Isles, Scotland 35 Shetland, Scotland 1 Shetlands, Scotland 1 regex_pattern , " Isl.*$", " &.*$", "[-.]$") strandings_df5 mutate( county = str_remove_all(county, str_c(regex_pattern, collapse = "|")), county = recode( county, "Carnarvon" = "Caernarvon", "E.Lothian" = "East Lothian", "Shetlands" = "Shetland", "W.Glamorgan" = "West Glamorgan", "S.Glamorgan" = "South Glamorgan" ) ) strandings_df4 |> summarise(counties_before = n_distinct(county)) ## # A tibble: 1 × 1 ## counties_before ## ## 1 146 strandings_df5 |> summarise(counties_after = n_distinct(county)) ## # A tibble: 1 × 1 ## counties_after ## ## 1 109 Whilst count also appears to hold, based on the plot pattern below, species-related information, I’m not going to use it as a predictor as we do not know enough about how it was derived, as reflected in these variable descriptions. strandings_df5 |> ggplot(aes(species, count, colour = species_uncertainty)) + geom_jitter(alpha = 0.5, size = 2) + coord_flip() + scale_y_log10() + scale_colour_manual(values = cols[c(1, 5)]) + labs(title = "How 'Count' Relates to Species", x = NULL, y = "Count (log10)", colour = "Species") + theme(legend.position = "top") Modelling So, I’ll set aside the rows where the species is uncertain (to be used later for new predictions), and I’ll train a model on 75% of known species, and test it on the remaining 25%. I’ll use the following predictors: latitude and longitude Mammal length and sex mass_possible indicating a species history of mass strandings date reported converted into decimal, week, month and year county may be useful, especially where the longitude and latitude are missing fam_genus which narrows the range of likely species I’d like to also make use of the textrecipes package. I can tokenise the textual information in comment and location to see if these add to the predictive power of the model. I’ll tune the model using tune_race_anova which quickly discards hyperparameter combinations showing little early promise. known_species filter(species_uncertainty == "Known") |> mutate(across( c( "species", "mass_single", "mass_possible", "county", "location", "sex", "fam_genus" ), factor )) set.seed(123) data_split mutate(species = fct_drop(species)) |> initial_split(strata = species) train training() test testing() predictors update_role(species, new_role = "outcome") |> update_role(all_of(predictors), new_role = "predictor") |> update_role(!has_role("outcome") & !has_role("predictor"), new_role = "id") |> step_date(date, features = c("decimal", "week", "month", "year"), label = FALSE) |> step_tokenize(location, comment) |> step_stopwords(location, comment) |> step_tokenfilter(location, comment, max_tokens = tune()) |> #100 step_tf(location, comment) |> step_zv(all_predictors()) |> step_dummy(all_nominal_predictors()) xgb_model set_mode("classification") |> set_engine("xgboost", count = FALSE, verbosity = 0, tree_method = "hist") xgb_wflow add_recipe(recipe) |> add_model(xgb_model) set.seed(9) folds plot_race() + labs(title = "Early Elimination of Parameter Combinations") set.seed(123) xgb_fit finalize_workflow(tune_result |> select_best(metric = "accuracy")) |> fit(train) Having fitted the model with the 3080 records in the training data, I’ll test its accuracy on the 1028 records of known species the model has not yet seen. Without spending time on alternative models, we’re getting a reasonable result for the “porpoise” of this post, as reflected in both the accuracy metric and confusion matrix. xgb_results augment(new_data = test) xgb_results |> accuracy(species, .pred_class) ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## ## 1 accuracy multiclass 0.993 xgb_results |> conf_mat(species, .pred_class) |> autoplot(type = "heatmap") + scale_fill_gradient2( mid = "white", high = cols[1], midpoint = 0 ) + labs(title = "Confusion Matrix") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) The top variable importance scores include fam_genus, many of the comment tokens, plus length, mass-possible, date_decimal, date_year, and latitude. vi(xgb_fit |> extract_fit_parsnip()) |> arrange(desc(Importance)) |> mutate(ranking = row_number()) |> slice_head(n = 40) |> kbl() Variable Importance ranking fam_genus_Phocoena 0.1215845 1 fam_genus_Globicephala 0.0816541 2 tf_comment_unidentified 0.0736307 3 fam_genus_Delphinus 0.0710860 4 fam_genus_Tursiops 0.0500141 5 tf_comment_false 0.0498405 6 fam_genus_Lagenorhynchus 0.0448997 7 tf_comment_finned 0.0341306 8 tf_comment_sided 0.0302409 9 tf_comment_long 0.0244866 10 fam_genus_Hyperoodon 0.0240353 11 length 0.0230962 12 fam_genus_Grampus 0.0227513 13 tf_comment_beaked 0.0219021 14 tf_comment_lesser 0.0215853 15 tf_comment_rorqual 0.0199182 16 tf_comment_dolphin 0.0198597 17 fam_genus_Orcinus 0.0194480 18 tf_comment_porpoise 0.0192418 19 tf_comment_bottle 0.0165892 20 fam_genus_Pseudorca 0.0159957 21 fam_genus_Physeter 0.0159090 22 tf_comment_risso's 0.0131135 23 mass_possible_S 0.0127573 24 fam_genus_Mesoplodon 0.0123743 25 tf_comment_fin 0.0122737 26 fam_genus_Ziphius 0.0122521 27 fam_genus_odontocete 0.0109274 28 fam_genus_cetacean 0.0104412 29 tf_comment_killer 0.0099741 30 fam_genus_Stenella 0.0087018 31 date_decimal 0.0081614 32 tf_comment_nosed 0.0075943 33 date_year 0.0072361 34 tf_comment_whale 0.0071684 35 tf_comment_white 0.0067214 36 mass_single_S 0.0041988 37 tf_comment_common 0.0041767 38 tf_comment_cuvier's 0.0036603 39 tf_comment_sowerby's 0.0036372 40 Do the predictions look reasonable? The class probability is spread across 27 species. I’m going to set a high threshold of 0.9, meaning the predicted species needs to be a pretty confident prediction. xgb_preds augment(new_data = strandings_df5 |> filter(species_uncertainty == "Uncertain")) species_levels select(starts_with(".pred_"), -.pred_class) |> names() |> as.factor() subset_df mutate( .class_pred = make_class_pred( .pred_acutorostrata, .pred_acutus, .pred_albirostris, .pred_ampullatus, .pred_bidens, .pred_borealis, .pred_breviceps, .pred_cavirostris, .pred_coeruleoalba, .pred_crassidens, .pred_delphis, .pred_electra, .pred_europaeus, .pred_griseus, .pred_leucas, .pred_macrocephalus, .pred_melaena, .pred_melas, .pred_mirus, .pred_monoceros, .pred_musculus, .pred_novaeangliae, .pred_orca, .pred_phocoena, .pred_physalus, .pred_sp.indet., .pred_truncatus, levels = species_levels, min_prob = .9 ) ) subset_df |> group_by(species, .class_pred) |> summarise(n = n()) |> arrange(species, desc(n)) |> kbl(col.names = c("Actual Label", "Predicted Label", "Count")) Actual Label Predicted Label Count acutorostrata? .pred_acutorostrata 12 acutorostrata/borealis .pred_acutorostrata 1 acutus? .pred_acutus 3 albirostris? .pred_albirostris 9 ampullatus? .pred_ampullatus 3 bidens? .pred_bidens 2 bidens? [EQ] 1 cavirostris? .pred_cavirostris 7 coeruleoalba? .pred_coeruleoalba 1 delphis? .pred_delphis 18 delphis/coeruleoalba [EQ] 48 griseus? .pred_griseus 2 macrocephalus? .pred_macrocephalus 2 melaena? .pred_melaena 20 orca? .pred_orca 4 phocoena? .pred_phocoena 42 physalus? .pred_physalus 4 physalus/acutorostrata [EQ] 1 truncatus? .pred_truncatus 18 truncatus/albirostris [EQ] 5 The majority of the 203 uncertain records are predicted to be as suspected in the original labelling. The remainder are classed as equivocal as they have not met the high bar of a 0.9-or-above probability for a single species. R Toolbox Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts. Package Function base as.character[2]; as.integer[3]; c[11]; comment[3]; conflicts[1]; cumsum[1]; function[2]; is.na[1]; length[2]; log10[1]; mean[2]; min[1]; names[1]; search[1]; seq[3]; set.seed[4]; sum[1]; summary[1] clock as_iso_year_week_day[1]; date_build[1]; date_parse[1]; get_month[1]; get_week[1]; get_year[1] doParallel registerDoParallel[1] dplyr filter[12]; across[2]; arrange[5]; case_when[1]; coalesce[1]; count[4]; desc[5]; group_by[4]; id[1]; if_else[9]; mutate[18]; n[2]; n_distinct[2]; recode[1]; row_number[1]; select[3]; slice_head[2]; summarise[4]; ungroup[2] finetune control_race[1]; plot_race[1]; tune_race_anova[1] forcats fct_drop[1]; fct_lump_n[1]; fct_reorder[1] ggplot2 aes[6]; coord_flip[3]; coord_map[1]; element_text[2]; facet_wrap[2]; geom_jitter[2]; geom_map[1]; geom_violin[2]; ggplot[4]; labs[6]; map_data[1]; scale_colour_manual[2]; scale_fill_gradient2[1]; scale_fill_manual[1]; scale_size_manual[1]; scale_y_log10[1]; theme[3]; theme_bw[1]; theme_set[1]; theme_void[1] glue glue[2] janitor clean_names[1] kableExtra kbl[5] parsnip boost_tree[1]; set_engine[1]; set_mode[1] probably make_class_pred[1] purrr map[3]; map2_dfr[1]; possibly[1] readr parse_number[1]; read_csv[1]; read_lines[1] recipes all_nominal_predictors[1]; all_predictors[1]; has_role[2]; step_date[1]; step_dummy[1]; step_zv[1]; update_role[3] rsample initial_split[1]; testing[1]; training[1]; vfold_cv[1] stats var[3] stringr str_c[6]; str_count[1]; str_detect[5]; str_remove[2]; str_remove_all[2]; str_starts[1]; str_to_title[2] textrecipes step_stopwords[1]; step_tf[1]; step_tokenfilter[1]; step_tokenize[1] tibble as_tibble[1]; tibble[2]; enframe[1] tictoc tic[1]; toc[1] tidyr crossing[1]; unnest[1] tune finalize_workflow[1]; select_best[1] wesanderson wes_palette[1] workflows add_model[1]; add_recipe[1]; workflow[1] yardstick accuracy[2]; conf_mat[1]; metric_set[1] Citation Natural History Museum (2019). Data Portal Query on “UK cetacean strandings 1913-1989” created at 2019-08-10 16:41:12.475340 PID https://doi.org/10.5519/qd.iwg63595. Subset of “Historical UK cetacean strandings dataset (1913-1989)” (dataset) PID https://doi.org/10.5519/0028204." />

Sea Monsters that Lost their Way

[This article was first published on R on Quantum Jitter, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Natural History Museum began recording cetacean (whales, dolphins and porpoises) strandings in 1913. I’d like to explore this 1913-1989 dataset.

library(tidyverse)
library(tidymodels)
library(probably)
library(finetune)
library(textrecipes)
library(stopwords)
library(wesanderson)
library(kableExtra)
library(clock)
library(glue)
library(janitor)
library(vip)
library(tictoc)
library(doParallel)

registerDoParallel(cores = 6)

theme_set(theme_bw())

(cols <- wes_palette(name = "Darjeeling2"))
strandings_df <- read_csv("strandings.csv") |>
  clean_names() |> 
  mutate(
    date = date_parse(date, format = "%d/%m/%Y"),
    length = parse_number(length_et),
    species_lumped = fct_lump_n(species, 20),
    across(ends_with("_val"), as.integer)
  )

# glimpse(strandings_df)

Exploratory

Some of the species labels contain a question mark or forward slash. This indicates uncertainty, so it might be fun to see if a machine learning model (multiclass classification) could learn from the known species and suggest an appropriate species where it’s uncertain.

strandings_df2 <- 
  strandings_df |> 
  mutate(species_uncertainty =
      if_else(str_detect(species, "[?/]"), "Uncertain", "Known"))

strandings_df2 |> 
  filter(species_uncertainty == "Uncertain") |> 
  count(species, sort = TRUE, name = "Count") |> 
  slice_head(n = 6) |> 
  kbl()
species Count
delphis/coeruleoalba 48
phocoena? 42
melaena? 20
delphis? 18
truncatus? 18
acutorostrata? 12

The date variable has many NA’s. Fortunately, the components to construct many of these are in the year_val, month_val and day_val variables. With a little wrangling and imputation, we can coalesce these variables into a new date. This will be useful since plots of sample species by year, month and week of stranding suggest a de-constructed date could be a useful predictor.

strandings_df2 |> 
  select(date, year_val, month_val, day_val) |> 
  summary()

##       date               year_val      month_val         day_val     
##  Min.   :1913-02-13   Min.   :   0   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.:1933-09-09   1st Qu.:1933   1st Qu.: 4.000   1st Qu.: 9.00  
##  Median :1954-04-13   Median :1955   Median : 7.000   Median :16.00  
##  Mean   :1955-01-08   Mean   :1954   Mean   : 6.766   Mean   :15.66  
##  3rd Qu.:1979-03-21   3rd Qu.:1979   3rd Qu.:10.000   3rd Qu.:22.00  
##  Max.   :1989-12-25   Max.   :1989   Max.   :12.000   Max.   :31.00  
##  NA's   :121

strandings_df3 <- strandings_df2 |>
  group_by(species) |> 
  mutate(
    month_val = if_else(month_val == 0, mean(month_val) |> 
                          as.integer(), month_val),
    day_val = if_else(day_val == 0, mean(day_val) |> as.integer(), day_val),
    day_val = if_else(day_val == 0, 1L, day_val),
    date2 = date_build(year_val, month_val, day_val, invalid = "NA"),
  ) |> 
  ungroup() |> 
  mutate(date3 = coalesce(date, date2)) |> 
  arrange(id) |> 
  mutate(date = if_else(is.na(date), lag(date3), date3)) |> 
  select(-date2, -date3, -ends_with("_val"))

example_species <-
  c("musculus", "melas", "crassidens", "borealis", "coeruleoalba")

known_species <- strandings_df3 |> 
  filter(species_uncertainty == "Known")

plot_date_feature <- function(var) {
  known_species |>
    mutate(
      year = get_year(date),
      month = get_month(date),
      week = as_iso_year_week_day(date) |> get_week()
    ) |>
    filter(species %in% example_species) |>
    count(species, {{ var }}) |>
    ggplot(aes(species, {{ var }})) +
    geom_violin(
      alpha = 0.7,
      fill = cols[3],
      show.legend = FALSE
    ) +
    coord_flip() +
    labs(
      title = glue("Variation in {str_to_title(as.character(var))}",
                   " of Stranding for Known Species"),
      x = NULL, y = glue("{str_to_title(as.character(var))}")
    )
}

c("year", "month", "week") |> 
  map(sym) |> 
  map(plot_date_feature)

## [[1]]
## 
## [[2]]
## 
## [[3]]

Do latitude and longitude carry useful predictive information?

A geospatial visualisation of strandings shows some species do gravitate towards particular stretches of coastline, e.g. “acutus” and “albirostris” to the east, and “coeruleoalba” to the south-west.

Some species may also be more prone to mass stranding, so something that indicates whether a species has such a history (in these data) may be worth including in the mix.

uki <- map_data("world", region = c("uk", "ireland"))

labels <- c("Mass", "Single")

uki |> 
  ggplot() +
  geom_map(aes(long, lat, map_id = region), map = uki, 
           colour = "black", fill = "grey90", size = 0.1) +
  geom_jitter(aes(longitude, latitude, colour = mass_single, 
                  size = mass_single), 
              alpha = 0.5, data = known_species) +
  facet_wrap(~ species_lumped, nrow = 3) +
  coord_map("mollweide") +
  scale_size_manual(values = c(1, 0.5), labels = labels) +
  scale_colour_manual(values = cols[c(3, 2)], labels = labels) +
  theme_void() +
  theme(legend.position = "top", 
        strip.text = element_text(colour = "grey50")) +
  labs(title = "Strandings by Species", 
       colour = NULL, size = NULL)
# Add history of mass stranding
strandings_df4 <- strandings_df3 |> 
  group_by(species) |>
  mutate(mass_possible = min(mass_single, na.rm = TRUE)) |>
  ungroup()

Some records are missing the length measurement of the mammal. Nonetheless, where present, this is likely to be predictive, and may help, for example, separate species labelled as “delphis/coeruleoalba” where the length is at the extreme ends of the “delphis” range as we see below. And the range of length may differ by mammal sex.

known_species |>
  mutate(sex = case_when(
    sex == "M" ~ "Male",
    sex == "F" ~ "Female",
    TRUE       ~ "Unknown"
  )) |> 
  filter(species_lumped != "Other") |> 
  count(species_lumped, length, sex) |> 
  mutate(species_lumped = fct_reorder(species_lumped, 
                                      desc(length), min, na.rm = TRUE)) |> 
  ggplot(aes(species_lumped, length)) + 
  geom_violin(aes(fill = if_else(str_detect(species_lumped, "^de|^co"), 
                                 TRUE, FALSE)), show.legend = FALSE) +
  facet_wrap(~ sex) +
  scale_fill_manual(values = cols[c(1, 5)]) +
  coord_flip() +
  labs(title = "Variation in Species Length by Sex", 
       x = NULL, y = "Length (metres)")

With map coordinates not always available, county could be, with a little string cleaning, a useful additional predictor.

strandings_df4 |> 
  count(county) |> 
  filter(str_detect(county, "Shet|Northumberland")) |> 
  kbl(col.names = c("County", "Count"))
County Count
Fair Isle, Shetland Isles 1
Northumberland 89
Northumberland. 1
Shetland Islands, Scotland 232
Shetland Isles, Scotland 35
Shetland, Scotland 1
Shetlands, Scotland 1
regex_pattern <-
  c("[,/].*$",
    "(?<!Che|Hamp|Lanca|North York)-?shire",
    " Isl.*$",
    " &.*$",
    "[-.]$")

strandings_df5 <- strandings_df4 |>
  mutate(
    county = str_remove_all(county, str_c(regex_pattern, collapse = "|")),
    county = recode(
      county,
      "Carnarvon" = "Caernarvon",
      "E.Lothian" = "East Lothian",
      "Shetlands" = "Shetland",
      "W.Glamorgan" = "West Glamorgan",
      "S.Glamorgan" = "South Glamorgan"
    )
  ) 

strandings_df4 |>
  summarise(counties_before = n_distinct(county))

## # A tibble: 1 × 1
##   counties_before
##             <int>
## 1             146

strandings_df5 |>
  summarise(counties_after = n_distinct(county))

## # A tibble: 1 × 1
##   counties_after
##            <int>
## 1            109

Whilst count also appears to hold, based on the plot pattern below, species-related information, I’m not going to use it as a predictor as we do not know enough about how it was derived, as reflected in these variable descriptions.

strandings_df5 |>
  ggplot(aes(species, count, colour = species_uncertainty)) +
  geom_jitter(alpha = 0.5, size = 2) +
  coord_flip() +
  scale_y_log10() +
  scale_colour_manual(values = cols[c(1, 5)]) +
  labs(title = "How 'Count' Relates to Species", 
       x = NULL, y = "Count (log10)", colour = "Species") +
  theme(legend.position = "top")

Modelling

So, I’ll set aside the rows where the species is uncertain (to be used later for new predictions), and I’ll train a model on 75% of known species, and test it on the remaining 25%. I’ll use the following predictors:

  • latitude and longitude
  • Mammal length and sex
  • mass_possible indicating a species history of mass strandings
  • date reported converted into decimal, week, month and year
  • county may be useful, especially where the longitude and latitude are missing
  • fam_genus which narrows the range of likely species

I’d like to also make use of the textrecipes package. I can tokenise the textual information in comment and location to see if these add to the predictive power of the model.

I’ll tune the model using tune_race_anova which quickly discards hyperparameter combinations showing little early promise.

known_species <- strandings_df5 |>
  filter(species_uncertainty == "Known") |>
  mutate(across(
    c(
      "species",
      "mass_single",
      "mass_possible",
      "county",
      "location",
      "sex",
      "fam_genus"
    ),
    factor
  ))

set.seed(123)

data_split <-
  known_species |>
  mutate(species = fct_drop(species)) |> 
  initial_split(strata = species)

train <- data_split |> training()

test <- data_split |> testing()

predictors <-
  c(
    "latitude",
    "longitude",
    "length",
    "mass_single",
    "mass_possible",
    "county",
    "location",
    "comment",
    "sex",
    "fam_genus"
  )

recipe <-
  train |>
  recipe() |>
  update_role(species, new_role = "outcome") |>
  update_role(all_of(predictors), new_role = "predictor") |>
  update_role(!has_role("outcome") & !has_role("predictor"), 
              new_role = "id") |>
  step_date(date, features = c("decimal", "week", "month", "year"), 
            label = FALSE) |>
  step_tokenize(location, comment) |>
  step_stopwords(location, comment) |>
  step_tokenfilter(location, comment, max_tokens = tune()) |> #100
  step_tf(location, comment) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors())

xgb_model <-
  boost_tree(trees = tune(), # 440
             mtry = tune(), # 0.6
             learn_rate = 0.02) |> 
  set_mode("classification") |>
  set_engine("xgboost", 
             count = FALSE,
             verbosity = 0,
             tree_method = "hist")

xgb_wflow <- workflow() |>
  add_recipe(recipe) |>
  add_model(xgb_model)

set.seed(9)

folds <- vfold_cv(train, strata = species)

set.seed(10)

tic()

tune_result <- xgb_wflow |>
  tune_race_anova(
    resamples = folds,
    grid = crossing(
      trees = seq(200, 520, 40),
      mtry = seq(0.5, 0.7, 0.1),
      max_tokens = seq(80, 120, 20)
      ),
    control = control_race(),
    metrics = metric_set(accuracy)
  )

toc()

## 476.418 sec elapsed

tune_result |> 
  plot_race() + 
  labs(title = "Early Elimination of Parameter Combinations")
set.seed(123)

xgb_fit <- xgb_wflow |>
  finalize_workflow(tune_result |> 
                      select_best(metric = "accuracy")) |> 
  fit(train)

Having fitted the model with the 3080 records in the training data, I’ll test its accuracy on the 1028 records of known species the model has not yet seen.

Without spending time on alternative models, we’re getting a reasonable result for the “porpoise” of this post, as reflected in both the accuracy metric and confusion matrix.

xgb_results <- xgb_fit |> 
  augment(new_data = test)

xgb_results |>
  accuracy(species, .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.993

xgb_results |>
  conf_mat(species, .pred_class) |>
  autoplot(type = "heatmap") +
  scale_fill_gradient2(
    mid = "white",
    high = cols[1],
    midpoint = 0
  ) +
  labs(title = "Confusion Matrix") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The top variable importance scores include fam_genus, many of the comment tokens, plus length, mass-possible, date_decimal, date_year, and latitude.

vi(xgb_fit |> extract_fit_parsnip()) |> 
  arrange(desc(Importance)) |> 
  mutate(ranking = row_number()) |> 
  slice_head(n = 40) |> 
  kbl()
Variable Importance ranking
fam_genus_Phocoena 0.1215845 1
fam_genus_Globicephala 0.0816541 2
tf_comment_unidentified 0.0736307 3
fam_genus_Delphinus 0.0710860 4
fam_genus_Tursiops 0.0500141 5
tf_comment_false 0.0498405 6
fam_genus_Lagenorhynchus 0.0448997 7
tf_comment_finned 0.0341306 8
tf_comment_sided 0.0302409 9
tf_comment_long 0.0244866 10
fam_genus_Hyperoodon 0.0240353 11
length 0.0230962 12
fam_genus_Grampus 0.0227513 13
tf_comment_beaked 0.0219021 14
tf_comment_lesser 0.0215853 15
tf_comment_rorqual 0.0199182 16
tf_comment_dolphin 0.0198597 17
fam_genus_Orcinus 0.0194480 18
tf_comment_porpoise 0.0192418 19
tf_comment_bottle 0.0165892 20
fam_genus_Pseudorca 0.0159957 21
fam_genus_Physeter 0.0159090 22
tf_comment_risso's 0.0131135 23
mass_possible_S 0.0127573 24
fam_genus_Mesoplodon 0.0123743 25
tf_comment_fin 0.0122737 26
fam_genus_Ziphius 0.0122521 27
fam_genus_odontocete 0.0109274 28
fam_genus_cetacean 0.0104412 29
tf_comment_killer 0.0099741 30
fam_genus_Stenella 0.0087018 31
date_decimal 0.0081614 32
tf_comment_nosed 0.0075943 33
date_year 0.0072361 34
tf_comment_whale 0.0071684 35
tf_comment_white 0.0067214 36
mass_single_S 0.0041988 37
tf_comment_common 0.0041767 38
tf_comment_cuvier's 0.0036603 39
tf_comment_sowerby's 0.0036372 40

Do the predictions look reasonable?

The class probability is spread across 27 species. I’m going to set a high threshold of 0.9, meaning the predicted species needs to be a pretty confident prediction.

xgb_preds <- xgb_fit |> 
  augment(new_data = strandings_df5 |> 
            filter(species_uncertainty == "Uncertain"))

species_levels <- xgb_preds |> 
  select(starts_with(".pred_"), -.pred_class) |> 
  names() |> 
  as.factor()

subset_df <- xgb_preds |>
  mutate(
    .class_pred = make_class_pred(
      .pred_acutorostrata,
      .pred_acutus,
      .pred_albirostris,
      .pred_ampullatus,
      .pred_bidens,
      .pred_borealis,
      .pred_breviceps,
      .pred_cavirostris,
      .pred_coeruleoalba,
      .pred_crassidens,
      .pred_delphis,
      .pred_electra,
      .pred_europaeus,
      .pred_griseus,
      .pred_leucas,
      .pred_macrocephalus,
      .pred_melaena,
      .pred_melas,
      .pred_mirus,
      .pred_monoceros,
      .pred_musculus,
      .pred_novaeangliae,
      .pred_orca,
      .pred_phocoena,
      .pred_physalus,
      .pred_sp.indet.,
      .pred_truncatus,
      levels = species_levels,
      min_prob = .9
    )
  )

subset_df |>
  group_by(species, .class_pred) |> 
  summarise(n = n()) |> 
  arrange(species, desc(n)) |> 
  kbl(col.names = c("Actual Label", "Predicted Label", "Count"))
Actual Label Predicted Label Count
acutorostrata? .pred_acutorostrata 12
acutorostrata/borealis .pred_acutorostrata 1
acutus? .pred_acutus 3
albirostris? .pred_albirostris 9
ampullatus? .pred_ampullatus 3
bidens? .pred_bidens 2
bidens? [EQ] 1
cavirostris? .pred_cavirostris 7
coeruleoalba? .pred_coeruleoalba 1
delphis? .pred_delphis 18
delphis/coeruleoalba [EQ] 48
griseus? .pred_griseus 2
macrocephalus? .pred_macrocephalus 2
melaena? .pred_melaena 20
orca? .pred_orca 4
phocoena? .pred_phocoena 42
physalus? .pred_physalus 4
physalus/acutorostrata [EQ] 1
truncatus? .pred_truncatus 18
truncatus/albirostris [EQ] 5

The majority of the 203 uncertain records are predicted to be as suspected in the original labelling. The remainder are classed as equivocal as they have not met the high bar of a 0.9-or-above probability for a single species.

R Toolbox

Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.

Package Function
base as.character[2]; as.integer[3]; c[11]; comment[3]; conflicts[1]; cumsum[1]; function[2]; is.na[1]; length[2]; log10[1]; mean[2]; min[1]; names[1]; search[1]; seq[3]; set.seed[4]; sum[1]; summary[1]
clock as_iso_year_week_day[1]; date_build[1]; date_parse[1]; get_month[1]; get_week[1]; get_year[1]
doParallel registerDoParallel[1]
dplyr filter[12]; across[2]; arrange[5]; case_when[1]; coalesce[1]; count[4]; desc[5]; group_by[4]; id[1]; if_else[9]; mutate[18]; n[2]; n_distinct[2]; recode[1]; row_number[1]; select[3]; slice_head[2]; summarise[4]; ungroup[2]
finetune control_race[1]; plot_race[1]; tune_race_anova[1]
forcats fct_drop[1]; fct_lump_n[1]; fct_reorder[1]
ggplot2 aes[6]; coord_flip[3]; coord_map[1]; element_text[2]; facet_wrap[2]; geom_jitter[2]; geom_map[1]; geom_violin[2]; ggplot[4]; labs[6]; map_data[1]; scale_colour_manual[2]; scale_fill_gradient2[1]; scale_fill_manual[1]; scale_size_manual[1]; scale_y_log10[1]; theme[3]; theme_bw[1]; theme_set[1]; theme_void[1]
glue glue[2]
janitor clean_names[1]
kableExtra kbl[5]
parsnip boost_tree[1]; set_engine[1]; set_mode[1]
probably make_class_pred[1]
purrr map[3]; map2_dfr[1]; possibly[1]
readr parse_number[1]; read_csv[1]; read_lines[1]
recipes all_nominal_predictors[1]; all_predictors[1]; has_role[2]; step_date[1]; step_dummy[1]; step_zv[1]; update_role[3]
rsample initial_split[1]; testing[1]; training[1]; vfold_cv[1]
stats var[3]
stringr str_c[6]; str_count[1]; str_detect[5]; str_remove[2]; str_remove_all[2]; str_starts[1]; str_to_title[2]
textrecipes step_stopwords[1]; step_tf[1]; step_tokenfilter[1]; step_tokenize[1]
tibble as_tibble[1]; tibble[2]; enframe[1]
tictoc tic[1]; toc[1]
tidyr crossing[1]; unnest[1]
tune finalize_workflow[1]; select_best[1]
wesanderson wes_palette[1]
workflows add_model[1]; add_recipe[1]; workflow[1]
yardstick accuracy[2]; conf_mat[1]; metric_set[1]

Citation

Natural History Museum (2019). Data Portal Query on “UK cetacean strandings 1913-1989” created at 2019-08-10 16:41:12.475340 PID https://doi.org/10.5519/qd.iwg63595. Subset of “Historical UK cetacean strandings dataset (1913-1989)” (dataset) PID https://doi.org/10.5519/0028204.

To leave a comment for the author, please follow the link and comment on their blog: R on Quantum Jitter.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)