[This article was first published on You Know, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Part I showed the concept and general technique of a method of assigning n email addresses to x cells pseudo-randomly, without the need for maintaining a log of each assignment.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The earlier post considered the basic case of each cell being assigned approximately the same quantity of email addresses. In practice, cell sizes often vary. Below is a technique that works well when the total number of email addresses needed is less than the product of the cell sizes’ greatest common divisor and the average email address length. For example, cell sizes are 500, 500, & 1,000; so 2,000 < 500*25ish.
< !-- saved from url=(0014)about:internet -->
Assign n Email Addresses to x Cells, Intrinsically; Part 2 (Variable Cell Sizes)
Sample Use Case:Marketing requests that an email address list be divided randomly into a given number of cells so that each cell would receive a different version of copy.
Below is a technique that takes n email addresses and pseudo-randomly assigns each to one of x cells. The advantage of this method is that the user does not need to maintain a log of each email address's assigned cell since the cell assignment can be reproduced at any time.
This technique is extended from Part 1 to accommodate cells of varying sizes.
First, load in a randomly generated list of email addresses.
set.seed(4444) library(numbers) fict.email <- function(n = 5) { fict.emails <- data.frame(email = NA) for (i in 1:n) { fict.emails[i, "email"] <- paste0(paste(sample(letters, sample(3:25, 1, TRUE), TRUE), collapse = ""), "@", paste(sample(letters, sample(3:15, 1, TRUE), TRUE), collapse = ""), ".", paste(sample(letters, sample(2:3, 1, TRUE), TRUE), collapse = "")) } fict.emails } emails <- sample(fict.email(10000))Next, assign the cell sizes.
cell.sizes <- c(500, 500, 1500, 2000)Get the number of characters of each email address; this is important because this will remain constant for each entry. Next, find the greatest common divisor of the cell sizes. Use the modulo function to calculate the remainders.
cells <- length(cell.sizes) cell.gcd <- mGCD(cell.sizes) em.len <- sapply(emails, nchar) em.mod <- em.len%%(sum(cell.sizes)/cell.gcd)Combine mod values into cell numbers.
ranges <- data.frame(start = 0, end = 0) for (j in 1:cells) { ranges[j, "start"] <- (sum(cell.sizes[1:j]) - cell.sizes[j])/cell.gcd + 1 ranges[j, "end"] <- sum(cell.sizes[1:j])/cell.gcd } for (k in 1:cells) { emails$cell[em.mod >= ranges$start[k] & em.mod <= ranges$end[k]] <- k }Split the data frame into the required cell sizes. These lists are the final output.
email.lists <- split(emails, emails$cell) for (l in 1:cells) { email.lists[[l]] <- email.lists[[l]][[1]][1:cell.sizes[l]] }Now each email address has been assigned to a specific cell.
Each email address will always belong to the current cell because the number of characters it has will not change.
To leave a comment for the author, please follow the link and comment on their blog: You Know.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.