How to Split Randomly a Userbase using Modulo

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In many cases, there is a need to split a userbase into 2 or more buckets. For example:

  • UCG: Many companies that run promotional campaigns, in order to quantify and evaluate the performance of the campaigns, create a Universal Control Group (UCG) which is a random sample of the userbase and does not receive any offer or message.
  • Bucketize: For testing purposes, it is common to split the userbase into buckets so that to be able to compare them in a long term.
  • Samples for Machine Learning: A userbase can become too large for a machine learning model to run and for that reason, it is common to get random samples.

The requirements

For the cases that we mentioned above, the splitting algorithm must satisfy the following two requirements:

  1. There should be a mapping function so that every time we encounter an existing user to be assigned to the same group. For instance, if the UserID 152514 was initially assigned to UCG, then it will always be to UCG group.
  2. There should be a mapping function so that every new user to be assigned to a group.

We can fulfil the requirements above by applying the modulo operation.

Example of Splitting the Userbase with Modulo

Let’s see how we can split the Userbase into two buckets. Let’s say that we want the 20% of the users to be in UCG and the rest 80% to be Control. Usually the UserIDs will be hashed, according to GDPR compliance. Below we generate some random data:

library(tidyverse)
library(digest)
library(Rmpfr)
set.seed(5)


df<-tibble(Row_Number = seq(1,100000))


df<-df%>%rowwise%>%mutate(Hash_Name = digest(paste(sample(LETTERS, 10, replace = TRUE), collapse = ""), 
                                             algo="md5", serialize=F),
                          Event_Date = lubridate::as_datetime( runif(1, 1546290000, 1577739600)))


head(df)

Output:

# A tibble: 6 x 3
# Rowwise: 
  Row_Number Hash_Name                        Event_Date         
       <int> <chr>                            <dttm>             
1          1 275db34231203750f10adb24c76b9619 2019-06-10 06:15:33
2          2 9a449c58ac6baed3b3648f0f3b5f8084 2019-03-27 21:38:34
3          3 e28e89ab554739a982c862cccf024464 2019-12-02 15:43:48
4          4 45b9aea890d3b98419cae72bb497e94b 2019-10-18 18:58:23
5          5 c4ce7434621d08f5195fbd1bfc1c20c2 2019-08-09 06:14:45
6          6 0b8a304be1015cacfcf31dd40ef6a381 2019-04-10 08:07:28

In order to generate random numbers, it is better to choose prime number for the modulo operation. For this example we will take the 997 which is a prime number. The other thing that we need to do, is to convert the MD5 Hashed to numeric. We can do it with the Rmpfr library in R. To sum up:

  • We will convert the MD5 to numeric
  • We will divide the above number by 997 and we will keep store the remainder
df$Remainder <- as.numeric(mpfr(df$Hash_Name, base=16) %% 997)

Is it Random

This approach generates pseudo-random numbers. Let’s see if the distribution of the numbers (from 0 to 996) is random.

hist(df$Remainder)
How to Split Randomly a Userbase using Modulo 1

We can apply a Chi-Square test too.

chisq.test(table(df$Remainder))

Output:

	Chi-squared test for given probabilities

data:  table(df$Remainder)
X-squared = 995.2, df = 996, p-value = 0.5012

The P-value is 0.5012 which implies that the generated numbers can be considered random.

Now, we can split our UB into UCG and Control as follows:

If the remainder is less than 200 then UCG else Control

df$Group <- ifelse(df$Remainder<200, 'UCG', 'Control')

df

How to Split Randomly a Userbase using Modulo 2

Check the Proportions

Finally, we want to make sure that the proportion is 80% vs 20% for Control and UCG respectively.

prop.table(table(df$Group))

Output:

Control     UCG 
0.80002 0.19998 

Conclusion

We can use the modulo function to split a userbase in a reproducible and efficient way.

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)