rtdists 0.7-2: response time distributions now with Rcpp and faster
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It took us quite a while but we have finally released a new version of rtdists
to CRAN which provides a few significant improvements. As a reminder, rtdists
[p]rovides response time distributions (density/PDF, distribution function/CDF, quantile function, and random generation): (a) Ratcliff diffusion model based on C code by Andreas and Jochen Voss and (b) linear ballistic accumulator (LBA) with different distributions underlying the drift rate.
The main reason it took us relatively long to push the new version was that we had a problem with the C
code for the diffusion model that we needed to sort out first. Specifically, the CDF (i.e., pdiffusion
) in versions prior to 0.5-2 did not produce correct results in many cases (one consequence of this is that the model predictions given in the previous blog post are wrong). As a temporary fix, we resorted to the correct but slow numerical integration of the PDF (i.e., ddiffusion
) to obtain the CDF in version 0.5-2 and later. Importantly, it appears as if the error was not present in fastdm
which is the source of the C
code we use. Matthew Gretton carefully investigated the original C
code, changed it such that it connects to R
via Rcpp
, and realized that there are two different variants of the CDF, a fast
variant and a precise
variant. Up to this point we had only used the fast
variant and, as it turns out, this was responsible for our incorrect results. We now per default use the precise
variant (which only seems to be marginally slower) as it produces the correct results for all cases we have tested (and we have tested quite a few).
In addition to a few more minor changes (see NEWS for full list), we made two more noteworthy changes. First, all diffusion functions as well as rLBA
received a major performance update, mainly in situations with trial-wise parameters. Now it should almost always be fastest to call the diffusion functions (e.g., ddiffusion
) only once with vectorized parameters instead of calling it several times for different sets of parameters. The speed up with the new version depends on the number of unique parameter sets, but even with only a few different sets the speed up should be clearly noticeable. For completely trial-wise parameters the speed-up should be quite dramatic.
Second, I also updated the vignette which now uses the tidyverse
in, I believe, a somewhat more reasonable manner. Specifically, it now is built on nested
data (via tidyr::nest
) and purrr::map
instead of relying heavily on dplyr::do
. The problem I had with dplyr::do
is that it often leads to somewhat ugly syntax. The changes in the vignette are mainly due to me reading Chapter 25 in the great R for Data Science
book by Wickham and Gorlemund. However, I still prefer lattice
over ggplot2
.
Example Analysis
To show the now correct behavior of the diffusion CDF let me repeat the example from the last post. As a reminder, we somewhat randomly pick one participant from the speed_acc
data set and fit both diffusion model and LBA to the data.
require(rtdists) # Exp. 1; Wagenmakers, Ratcliff, Gomez, & McKoon (2008, JML) data(speed_acc) # remove excluded trials: speed_acc <- droplevels(speed_acc[!speed_acc$censor,]) # create numeric response variable where 1 is an error and 2 a correct response: speed_acc$corr <- with(speed_acc, as.numeric(stim_cat == response))+1 # select data from participant 11, accuracy condition, non-word trials only p11 <- speed_acc[speed_acc$id == 11 & speed_acc$condition == "accuracy" & speed_acc$stim_cat == "nonword",] prop.table(table(p11$corr)) # 1 2 # 0.04166667 0.95833333 ll_lba <- function(pars, rt, response) { d <- dLBA(rt = rt, response = response, A = pars["A"], b = pars["A"]+pars["b"], t0 = pars["t0"], mean_v = pars[c("v1", "v2")], sd_v = c(1, pars["sv"]), silent=TRUE) if (any(d == 0)) return(1e6) else return(-sum(log(d))) } start <- c(runif(3, 0.5, 3), runif(2, 0, 0.2), runif(1)) names(start) <- c("A", "v1", "v2", "b", "t0", "sv") p11_norm <- nlminb(start, ll_lba, lower = c(0, -Inf, 0, 0, 0, 0), rt=p11$rt, response=p11$corr) p11_norm[1:3] # $par # A v1 v2 b t0 sv # 0.1182940 -2.7409230 1.0449963 0.4513604 0.1243441 0.2609968 # # $objective # [1] -211.4202 # # $convergence # [1] 0 ll_diffusion <- function(pars, rt, response) { densities <- ddiffusion(rt, response=response, a=pars["a"], v=pars["v"], t0=pars["t0"], sz=pars["sz"], st0=pars["st0"], sv=pars["sv"]) if (any(densities == 0)) return(1e6) return(-sum(log(densities))) } start <- c(runif(2, 0.5, 3), 0.1, runif(3, 0, 0.5)) names(start) <- c("a", "v", "t0", "sz", "st0", "sv") p11_diff <- nlminb(start, ll_diffusion, lower = 0, rt=p11$rt, response=p11$corr) p11_diff[1:3] # $par # a v t0 sz st0 sv # 1.3206011 3.2727202 0.3385602 0.4621645 0.2017950 1.0551706 # # $objective # [1] -207.5487 # # $convergence # [1] 0
As is common, we pass the negative summed log-likelihood to the optimization algorithm (here trusty nlminb
) and hence lower values of objective
indicate a better fit. Results show that the LBA provides a somewhat better account. The interesting question is whether this somewhat better fit translates into a visibly better fit when comparing observed and predicted quantiles.
# quantiles: q <- c(0.1, 0.3, 0.5, 0.7, 0.9) ## observed data: (p11_q_c <- quantile(p11[p11$corr == 2, "rt"], probs = q)) # 10% 30% 50% 70% 90% # 0.4900 0.5557 0.6060 0.6773 0.8231 (p11_q_e <- quantile(p11[p11$corr == 1, "rt"], probs = q)) # 10% 30% 50% 70% 90% # 0.4908 0.5391 0.5905 0.6413 1.0653 ### LBA: # predicted error rate (pred_prop_correct_lba <- pLBA(Inf, 2, A = p11_norm$par["A"], b = p11_norm$par["A"]+p11_norm$par["b"], t0 = p11_norm$par["t0"], mean_v = c(p11_norm$par["v1"], p11_norm$par["v2"]), sd_v = c(1, p11_norm$par["sv"]))) # [1] 0.9581342 (pred_correct_lba <- qLBA(q*pred_prop_correct_lba, response = 2, A = p11_norm$par["A"], b = p11_norm$par["A"]+p11_norm$par["b"], t0 = p11_norm$par["t0"], mean_v = c(p11_norm$par["v1"], p11_norm$par["v2"]), sd_v = c(1, p11_norm$par["sv"]))) # [1] 0.4871710 0.5510265 0.6081855 0.6809796 0.8301286 (pred_error_lba <- qLBA(q*(1-pred_prop_correct_lba), response = 1, A = p11_norm$par["A"], b = p11_norm$par["A"]+p11_norm$par["b"], t0 = p11_norm$par["t0"], mean_v = c(p11_norm$par["v1"], p11_norm$par["v2"]), sd_v = c(1, p11_norm$par["sv"]))) # [1] 0.4684374 0.5529575 0.6273737 0.7233961 0.9314820 ### diffusion: # same result as when using Inf, but faster: (pred_prop_correct_diffusion <- pdiffusion(rt = 20, response = "upper", a=p11_diff$par["a"], v=p11_diff$par["v"], t0=p11_diff$par["t0"], sz=p11_diff$par["sz"], st0=p11_diff$par["st0"], sv=p11_diff$par["sv"])) # [1] 0.964723 (pred_correct_diffusion <- qdiffusion(q*pred_prop_correct_diffusion, response = "upper", a=p11_diff$par["a"], v=p11_diff$par["v"], t0=p11_diff$par["t0"], sz=p11_diff$par["sz"], st0=p11_diff$par["st0"], sv=p11_diff$par["sv"])) # [1] 0.4748271 0.5489903 0.6081182 0.6821927 0.8444566 (pred_error_diffusion <- qdiffusion(q*(1-pred_prop_correct_diffusion), response = "lower", a=p11_diff$par["a"], v=p11_diff$par["v"], t0=p11_diff$par["t0"], sz=p11_diff$par["sz"], st0=p11_diff$par["st0"], sv=p11_diff$par["sv"])) # [1] 0.4776565 0.5598018 0.6305120 0.7336275 0.9770047 ### plot predictions par(mfrow=c(1,2), cex=1.2) plot(p11_q_c, q*prop.table(table(p11$corr))[2], pch = 2, ylim=c(0, 1), xlim = c(0.4, 1.3), ylab = "Cumulative Probability", xlab = "Response Time (sec)", main = "LBA") points(p11_q_e, q*prop.table(table(p11$corr))[1], pch = 2) lines(pred_correct_lba, q*pred_prop_correct_lba, type = "b") lines(pred_error_lba, q*(1-pred_prop_correct_lba), type = "b") legend("right", legend = c("data", "predictions"), pch = c(2, 1), lty = c(0, 1)) plot(p11_q_c, q*prop.table(table(p11$corr))[2], pch = 2, ylim=c(0, 1), xlim = c(0.4, 1.3), ylab = "Cumulative Probability", xlab = "Response Time (sec)", main = "Diffusion") points(p11_q_e, q*prop.table(table(p11$corr))[1], pch = 2) lines(pred_correct_diffusion, q*pred_prop_correct_diffusion, type = "b") lines(pred_error_diffusion, q*(1-pred_prop_correct_diffusion), type = "b")
The fit plot compares observed quantiles (as triangles) with predicted quantiles (circles connected by lines). Here we decided to plot the 10%, 30%, 50%, 70% and 90% quantiles. In each plot, the x-axis shows RTs and the y-axis cumulative probabilities. From this it follows that the upper line and points correspond to the correct trials (which are common) and the lower line and points to the incorrect trials (which are uncommon). For both models the fit looks pretty good especially for the correct trials. However, it appears the LBA does a slightly better job in predicting very fast and slow trials here, which may be responsible for the better fit in terms of summed log-likelihood. In contrast, the diffusion model seems somewhat better in predicting the long tail of the error trials.
Checking the CDF
Finally, we can also check whether the analytical CDF does in fact correspond to the empirical CDF of the data. For this we can compare the analytical CDF function pdiffusion
to the empirical CDF obtained from random deviates. One thing one needs to be careful about is that pdiffusion
provides the 'defective' CDF that only approaches one if one adds the CDF for both response boundaries. Consequently, to compare the empirical CDF for one response with the analytical CDF, we need to scale the latter to also go from 0 to 1 (simply by dividing it by its maximal value). Here we will use the parameters values obtained in the previous fit.
rand_rts <- rdiffusion(1e5, a=p11_diff$par["a"], v=p11_diff$par["v"], t0=p11_diff$par["t0"], sz=p11_diff$par["sz"], st0=p11_diff$par["st0"], sv=p11_diff$par["sv"]) plot(ecdf(rand_rts[rand_rts$response == "upper","rt"])) normalised_pdiffusion = function(rt,...) pdiffusion(rt,...)/pdiffusion(rt=Inf,...) curve(normalised_pdiffusion(x, response = "upper", a=p11_diff$par["a"], v=p11_diff$par["v"], t0=p11_diff$par["t0"], sz=p11_diff$par["sz"], st0=p11_diff$par["st0"], sv=p11_diff$par["sv"]), add=TRUE, col = "yellow", lty = 2)
This figure shows that the analytical CDF (in yellow) lies perfectly on top the empirical CDF (in black). If it does not for you, you still use an old version of rtdists
. We have also added a series of unit tests to rtdists
that compare the empirical CDF to the analytical CDF (using ks.test
) for a variety of parameter values to catch if such a problem ever occurs again.
References
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.