[This article was first published on R snippets, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have learned a lot from my last RGolf post. Therefore today I have another problem from practice.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
You have a data set on values of contracts signed by ten salesmen. It has three columns: person id (p), contract value (v) and time (t).
Here is the code that generates the test data set.
n <- 4000
y <- 10
set.seed(1)
d <- data.frame(p = sample(letters[1:y], n, rep=T),
v = runif(n),
t = sample(seq(as.Date(“2000-1-1”),
by=”day”, len=n)))
head(d)
# p v t
# 1 c 0.18776846 2001-05-28
# 2 d 0.50475902 2001-05-25
# 3 f 0.02728685 2008-07-06
# 4 j 0.49629785 2004-08-06
# 5 c 0.94735171 2007-10-31
# 6 i 0.38118213 2001-09-04
For each salesmen we want to find the 90-day period in which she generated the highest sum of sales.
The task is to generate a data frame that has three columns: person id (p), maximal sum of sales (v), start of the 90-day period (t) and save it to a variable named r.
In situations where there are multiple such 90-day periods you are to report the date of the earliest contract that was signed during such a period.
The rules of engagement are exactly as last time:
(1) generate data frame r as few keystrokes as possible,
(2) one line of code may not be longer than 80 characters,
(3) the solution must be in base R only (no package loading is allowed).
Here is my attempt consisting of 225 characters:
x=max(d$t);m=matrix(0,x+89,y)
m[cbind(d$t, d$p)]=d$v
s=sapply(1:x,function(z)
colSums(m[z+0:89,]))*t(m>0)[,1:x]
u=apply(s,1,function(z)c(max(z),which.max(z)))
r=data.frame(p=letters[1:y],v=u[1,],t=u[2,]+as.Date(“1970-01-01”))
It produces the following output:
r
# p v t
# 1 a 10.198044 2009-10-11
# 2 b 9.265335 2002-08-28
# 3 c 9.735401 2008-02-21
# 4 d 9.479942 2004-09-07
# 5 e 11.036041 2010-09-16
# 6 f 10.602446 2002-04-04
# 7 g 9.927153 2007-08-11
# 8 h 10.917856 2007-04-26
# 9 i 10.027341 2008-10-26
# 10 j 9.965738 2004-12-28
Here is a more verbose version that is easier to read:
d_o <- d[order(d$t),] # order d by time
testr <- NULL
for (p in levels(d_o$p)) {
d_o_p <- d_o[d_o$p == p,] # subset ordered d by player
best_v_sum <- -Inf
best_i <- NA
for (i in 1:nrow(d_o_p)) {
d_o_p_t <- d_o_p[d_o_p$t < d_o_p$t[i] + 90 &
d_o_p$t >= d_o_p$t[i],]
# subset ordered d by player and rolling window
if (sum(d_o_p_t$v) > best_v_sum) {
best_v_sum <- sum(d_o_p_t$v)
best_i <- i
}
}
testr <- rbind(testr, data.frame(p = p, v = best_v_sum,
t = d_o_p$t[best_i]))
}
all.equal(r,testr)
< !----->< !----->
# [1] TRUE
Any competing solution should pass all.equal test.
As last time – please post your solutions in comments to the post.
To leave a comment for the author, please follow the link and comment on their blog: R snippets.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.