forcats::fct_lump_n() with weights “overall”
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Sometimes I want to summarize some categories which don’t have much impact on my analysis. So the best way to do this is using some of the forcats::fct_lump*()
functions.
But I often struggle to find the way using the weights to order the categories. That’s because the main use case of fct_lump()
is a vector of a factor containing several values and getting the most n
and the rest combined as “other”.
Example
Let’s look at an example:
1 2 3 4 |
library(tidyverse) set.seed(42) values <- sample(letters[1:10], 40, replace = TRUE) values |
1 2 3 |
## [1] "a" "e" "a" "i" "j" "d" "b" "j" "a" "h" "g" "d" "i" "e" "d" "j" "b" "c" "i" ## [20] "i" "d" "e" "e" "d" "b" "h" "c" "j" "a" "j" "h" "f" "j" "h" "d" "d" "f" "b" ## [39] "e" "d" |
So values is a vector with 40 letters. Now we want to see the 5 most used letters and combine all other letters as “other”.
1 |
forcats::fct_lump(values, 5) |
1 2 3 4 5 |
## [1] a e a i j d b j a h Other d ## [13] i e d j b Other i i d e e d ## [25] b h Other j a j h Other j h d d ## [37] Other b e d ## Levels: a b d e h i j Other |
1 |
forcats::fct_lump(values, 5) %>% table() |
1 2 3 |
## . ## a b d e h i j Other ## 4 4 8 5 4 4 6 5 |
Because of ties there are more than 5 letters. But that’s okay. There are options to handle ties.
Weights
Instead of simple counting it’s also possible to use weights. But in my regular cases I have to compute those weights.
Here’s my case. Let’s say I’m analyzing pageviews of a websites per browser. So I get for each day and browser a value.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
set.seed(42) chrome <- tibble(day = seq(1:5), useragent = "chrome", pageviews = round(rnorm(5, 1000, 100))) firefox <- tibble(day = seq(1:5), useragent = "firefox", pageviews = round(rnorm(5, 600, 100))) edge <- tibble(day = seq(1:5), useragent = "edge", pageviews = round(rnorm(5, 600, 100))) junk_1 <- tibble(day = seq(1:5), useragent = "junk 1", pageviews = round(rnorm(5, 100, 20))) junk_2 <- tibble(day = seq(1:5), useragent = "junk 2", pageviews = round(rnorm(5, 100, 20))) junk_3 <- tibble(day = seq(1:5), useragent = "junk 3", pageviews = round(rnorm(5, 100, 20))) data <- chrome %>% rbind(firefox) %>% rbind(edge) %>% rbind(junk_1) %>% rbind(junk_2) %>% rbind(junk_3) data %>% arrange(day) %>% head() |
1 2 3 4 5 6 7 8 9 |
## # A tibble: 6 × 3 ## day useragent pageviews ## <int> <chr> <dbl> ## 1 1 chrome 1137 ## 2 1 firefox 589 ## 3 1 edge 730 ## 4 1 junk 1 113 ## 5 1 junk 2 94 ## 6 1 junk 3 91 |
For each day there are lots of different browsers. Here I have three main browsers (chrome, firefox and edge) and three obscure ones (called junk 1 to junk 3). Those obscure ones I want to combine as “other” because I’m only interested in the main (or top) three browsers.
I like to rank my browsers by their total pageviews and then lump them.
1 2 3 4 5 6 7 8 9 |
data_lumped <- data %>% group_by(useragent) %>% mutate(browser_total = sum(pageviews)) %>% ungroup() %>% mutate( ua = fct_lump_n(f = useragent, n = 3, w = browser_total) ) %>% arrange(day, ua) data_lumped |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
## # A tibble: 30 × 5 ## day useragent pageviews browser_total ua ## <int> <chr> <dbl> <dbl> <fct> ## 1 1 chrome 1137 5220 chrome ## 2 1 edge 730 3179 edge ## 3 1 firefox 589 3327 firefox ## 4 1 junk 1 113 431 Other ## 5 1 junk 2 94 517 Other ## 6 1 junk 3 91 447 Other ## 7 2 chrome 944 5220 chrome ## 8 2 edge 829 3179 edge ## 9 2 firefox 751 3327 firefox ## 10 2 junk 1 94 431 Other ## # ℹ 20 more rows |
Now it’s simple grouping and summarizing:
1 2 3 4 |
data_lumped %>% group_by(day, ua) %>% summarise(pageviews = sum(pageviews), .groups = "drop") %>% arrange(day, ua) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
## # A tibble: 20 × 3 ## day ua pageviews ## <int> <fct> <dbl> ## 1 1 chrome 1137 ## 2 1 edge 730 ## 3 1 firefox 589 ## 4 1 Other 298 ## 5 2 chrome 944 ## 6 2 edge 829 ## 7 2 firefox 751 ## 8 2 Other 253 ## 9 3 chrome 1036 ## 10 3 edge 461 ## 11 3 firefox 591 ## 12 3 Other 209 ## 13 4 chrome 1063 ## 14 4 edge 572 ## 15 4 firefox 802 ## 16 4 Other 284 ## 17 5 chrome 1040 ## 18 5 edge 587 ## 19 5 firefox 594 ## 20 5 Other 351 |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.