Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When we are dealing with unbalanced classes in Machine Learning projects there are many approaches that you can follow. Just to main some of them:
- Undersampling: We try to reduce the observations from the majority class so that the final dataset to be balanced
- Oversampling: We try to generate more observations from the minority class usually by replicating the samples from the minority class so that the final dataset to be balanced.
- Synthetic Data Generation (SMOTE): We generate artificial data using bootstrapping and k-Nearest Neighbors algorithms.
Generate the Unbalanced Data
The scenario is that we are dealing with 3 email campaigns that have different CTRs and we want to apply undersampling to normalize the CTR by the campaign so that to avoid any skewness and biased when we will build the Machine Learning model. The hypothetical dataset is the following:
- Campaign A: 5000 Observations with 10% CTR (approx)
- Campaign B: 10000 Observations with 20% CTR (approx)
- Campaign C: 1000 Observations with 30% CTR (approx)
Let’s try to generate this random sample in R.
library(tidyverse) set.seed(5) df = rbind(data.frame(Campaign = "A", Click = rbinom(n=5000, size=1, prob=0.1)), data.frame(Campaign = "B", Click = rbinom(n=10000, size=1, prob=0.2)), data.frame(Campaign = "C", Click = rbinom(n=1000, size=1, prob=0.3))) head(df)
Output:
Campaign Click 1 A 0 2 A 0 3 A 1 4 A 0 5 A 0 6 A 0
Let’s get the CTR by Campaign
df%>%group_by(Campaign)%>% summarise(CTR=mean(Click))
Output:
# A tibble: 3 x 2 Campaign CTR <chr> <dbl> 1 A 0.106 2 B 0.198 3 C 0.302
As we can see the A campaign has 10.6% CTR, the B 19.8% and the C 30.2%. Let’s add also a random column called attribute which takes the values “X”, “Y”, “Z” since we will deal with datasets with more than two columns.
df$Attribute<-sample(c("X","Y", "Z"), size = dim(df)[1], replace = TRUE, prob = c(0.2, 0.6, 0.2)) head(df) Campaign Click Attribute 1 A 0 Y 2 A 0 Y 3 A 1 Z 4 A 0 Y 5 A 0 Z 6 A 0 Z
Now, our goal is to apply undersampling so that each campaign will have around 50% CTR
Undersampling by Group
We will use the map2
function from the purrr
package which belongs to the tidyverse
family:
campaign_summary <- df %>% group_by(Campaign)%>% summarize(rr=sum(Click)/n(), pos= sum(Click)) df_neg_sample<- df %>% filter(Click==0) %>% group_by(Campaign) %>% nest() %>% #group all data by campaign name ungroup() %>% inner_join(campaign_summary, by="Campaign") sampled_df_neg<-df_neg_sample %>% mutate(samp = map2(data, pos, sample_n, replace = FALSE)) %>%# sample based on the campaing summary select(-data) %>% #remove original nested data unnest(samp) %>% select(c(-"rr",-"pos")) df_pos <- df %>% filter(Click==1) #positive samples new_df <- rbind(df_pos,sampled_df_neg) #balanced set positive negative within each campaign head(new_df) Campaign Click Attribute 1 A 1 Z 2 A 1 Y 3 A 1 Y 4 A 1 Y 5 A 1 Y 6 A 1 Y tail(new_df) Campaign Click Attribute 5613 C 0 Y 5614 C 0 Y 5615 C 0 Z 5616 C 0 Y 5617 C 0 Y 5618 C 0 X
Let’s check if the new_df
is balanced by campaign. We will group by campaign and we will show the CTR and the number of observations:
new_df%>%group_by(Campaign)%>%summarise(CTR=mean(Click), Observations=n()) Campaign CTR Observations <chr> <dbl> <int> 1 A 0.5 1064 2 B 0.5 3950 3 C 0.5 604
As we can see, we sacrificed a sample but we have a balanced number of classes for every campaign (50-50).
Undersampling by Group using the ROSE Package
We can use also the ROSE package. Below we will apply a for loop by campaign so that to get a balanced sample using the undersampling technique.
library(ROSE) balanced_sample = NULL for (c in unique(df$Campaign)) { tmp_df = df%>%filter(Campaign==c) tmp<-ovun.sample(Click ~ ., data = tmp_df, method = "under", p = 0.5, seed = 5)$data balanced_sample<-rbind(balanced_sample, tmp) }
Let’s check if the balanced_sample
is actually balanced.
balanced_sample%>%group_by(Campaign)%>%summarise(CTR=mean(Click), Observations=n())
Output:
Campaign CTR Observations <chr> <dbl> <int> 1 A 0.504 1056 2 B 0.496 3978 3 C 0.510 592
Awesome. We showed two different approaches of how you can apply undersampling by group.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.