forcats::fct_lump_n() with weights “overall”

[This article was first published on rstats-tips.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sometimes I want to summarize some categories which don’t have much impact on my analysis. So the best way to do this is using some of the forcats::fct_lump*() functions. But I often struggle to find the way using the weights to order the categories. That’s because the main use case of fct_lump() is a vector of a factor containing several values and getting the most n and the rest combined as “other”.

Example

Let’s look at an example:

1
2
3
4
library(tidyverse)
set.seed(42)
values <- sample(letters[1:10], 40, replace = TRUE)
values
1
2
3
##  [1] "a" "e" "a" "i" "j" "d" "b" "j" "a" "h" "g" "d" "i" "e" "d" "j" "b" "c" "i"
## [20] "i" "d" "e" "e" "d" "b" "h" "c" "j" "a" "j" "h" "f" "j" "h" "d" "d" "f" "b"
## [39] "e" "d"

So values is a vector with 40 letters. Now we want to see the 5 most used letters and combine all other letters as “other”.

1
forcats::fct_lump(values, 5)
1
2
3
4
5
##  [1] a     e     a     i     j     d     b     j     a     h     Other d    
## [13] i     e     d     j     b     Other i     i     d     e     e     d    
## [25] b     h     Other j     a     j     h     Other j     h     d     d    
## [37] Other b     e     d    
## Levels: a b d e h i j Other
1
forcats::fct_lump(values, 5) %>% table()
1
2
3
## .
##     a     b     d     e     h     i     j Other 
##     4     4     8     5     4     4     6     5

Because of ties there are more than 5 letters. But that’s okay. There are options to handle ties.

Weights

Instead of simple counting it’s also possible to use weights. But in my regular cases I have to compute those weights.

Here’s my case. Let’s say I’m analyzing pageviews of a websites per browser. So I get for each day and browser a value.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
set.seed(42)
chrome <- tibble(day = seq(1:5), useragent = "chrome", pageviews = round(rnorm(5, 1000, 100)))
firefox <- tibble(day = seq(1:5), useragent = "firefox", pageviews = round(rnorm(5, 600, 100)))
edge <- tibble(day = seq(1:5), useragent = "edge", pageviews = round(rnorm(5, 600, 100)))
junk_1 <- tibble(day = seq(1:5), useragent = "junk 1", pageviews = round(rnorm(5, 100, 20)))
junk_2 <- tibble(day = seq(1:5), useragent = "junk 2", pageviews = round(rnorm(5, 100, 20)))
junk_3 <- tibble(day = seq(1:5), useragent = "junk 3", pageviews = round(rnorm(5, 100, 20)))

data <- chrome %>% 
  rbind(firefox) %>% 
  rbind(edge) %>% 
  rbind(junk_1) %>% 
  rbind(junk_2) %>% 
  rbind(junk_3)

data %>% 
  arrange(day) %>% 
  head()
1
2
3
4
5
6
7
8
9
## # A tibble: 6 × 3
##     day useragent pageviews
##   <int> <chr>         <dbl>
## 1     1 chrome         1137
## 2     1 firefox         589
## 3     1 edge            730
## 4     1 junk 1          113
## 5     1 junk 2           94
## 6     1 junk 3           91

For each day there are lots of different browsers. Here I have three main browsers (chrome, firefox and edge) and three obscure ones (called junk 1 to junk 3). Those obscure ones I want to combine as “other” because I’m only interested in the main (or top) three browsers.

I like to rank my browsers by their total pageviews and then lump them.

1
2
3
4
5
6
7
8
9
data_lumped <- data %>% 
  group_by(useragent) %>% 
  mutate(browser_total = sum(pageviews)) %>% 
  ungroup() %>% 
  mutate(
    ua = fct_lump_n(f = useragent, n = 3, w = browser_total)
  ) %>% 
  arrange(day, ua)
data_lumped
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
## # A tibble: 30 × 5
##      day useragent pageviews browser_total ua     
##    <int> <chr>         <dbl>         <dbl> <fct>  
##  1     1 chrome         1137          5220 chrome 
##  2     1 edge            730          3179 edge   
##  3     1 firefox         589          3327 firefox
##  4     1 junk 1          113           431 Other  
##  5     1 junk 2           94           517 Other  
##  6     1 junk 3           91           447 Other  
##  7     2 chrome          944          5220 chrome 
##  8     2 edge            829          3179 edge   
##  9     2 firefox         751          3327 firefox
## 10     2 junk 1           94           431 Other  
## # ℹ 20 more rows

Now it’s simple grouping and summarizing:

1
2
3
4
data_lumped %>% 
  group_by(day, ua) %>% 
  summarise(pageviews = sum(pageviews), .groups = "drop") %>% 
  arrange(day, ua)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
## # A tibble: 20 × 3
##      day ua      pageviews
##    <int> <fct>       <dbl>
##  1     1 chrome       1137
##  2     1 edge          730
##  3     1 firefox       589
##  4     1 Other         298
##  5     2 chrome        944
##  6     2 edge          829
##  7     2 firefox       751
##  8     2 Other         253
##  9     3 chrome       1036
## 10     3 edge          461
## 11     3 firefox       591
## 12     3 Other         209
## 13     4 chrome       1063
## 14     4 edge          572
## 15     4 firefox       802
## 16     4 Other         284
## 17     5 chrome       1040
## 18     5 edge          587
## 19     5 firefox       594
## 20     5 Other         351
To leave a comment for the author, please follow the link and comment on their blog: rstats-tips.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)