Site icon R-bloggers

Flexible simulation in simstudy with customized distribution functions

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Really, the only problem with the simstudy package (😄) is that there is a hard limit to the possible probability distributions that are available (the current count is 15 – see here for a complete description). However, it turns out that there is more flexibility than first meets the eye, and we can easily accommodate a limitless number as long as you are willing to provide some extra code.

I am going to illustrate this with two examples, first by implementing a truncated normal distribution, and second by implementing the flexible non-linear data generating algorithm that I described last time.

Before we get going, here are the necessary libraries:

library(simstudy)
library(data.table)
library(msm)
library(ggplot2)
library(mgcv)

General concept

In the data definition step, it is possible to specify any valid R function in the formula argument. If dist is specified as “nonrandom”, then simstudy will generate data based on that function. (Yes, the specification as “nonrandom” is a bit awkward since we are defining a stochastic data generating process in this case; in future versions I plan to allow dist to be specified as “custom” to make this less dissonant.)

In this example, I want to be able to generate data from a truncated normal distribution. There is an existing function rtnorm in the msm package that I can take advantage here. What I have done is essentially create a wrapper function that makes a single draw from the truncated distribution with a specified mean, standard deviation, and pair of truncation bounds:

trunc_norm <- function(mean, sd, lower, upper) {
  rtnorm(n = 1, mean = mean, sd = sd, lower = lower, upper = upper)
}

Now that trunc_norm has been created, I am free to use this is in a data definition statement. And even more important, the call to trunc_norm can depend on other variables; in this case, I have created binary variable x that will determine the upper and lower bounds of the distribution. When \(x=0\), the \(N(0, 3.5^2)\) distribution is truncated at -5 and 5, and when \(x=1\), the distribution is truncated at -8 and 8.

defI <- defData(varname = "x", formula = 0.5, dist = "binary")
defI <- defData(defI, varname = "y", 
  formula = "trunc_norm(mean = 0, sd = 3.5, 
               lower = -5 + -3*x, upper = 5 + 3*x)",
  dist = "nonrandom")

The generated data appear to have the properties that we would expect:

dd <- genData(1000, defI)

Application to non-linear data generation

Last time, I described an approach to generate a variable \(y\) that has a non-linear response with respect to an input variable \(x\). At the end of that post, I created two functions, one of which can be referred to in the defData statement to generate the data. (I plan on implementing these functions in simstudy, but I was eager to get the concept out there in case any one has some suggestions or could use this feature right away.)

In the first step, I need to generate a smooth function by specifying a few points. I do this by calling getNLfunction. (If you want the code for this, let me know, but I actually provided most of it last week.) The variable nlf is an object that contains the function:

dpoints <- data.table(x = c(20, 30, 53, 65, 80), y = c(15, 44, 60, 55, 35))
nlf <- getNLfunction(dpoints)

The function genNL makes predictions based on the nlf object and adds a little Gaussian noise. We use the same approach as we did above for the truncated normal to generate different responses \(y\) based on the level of \(x\):

def <- defData(varname = "x", formula = "20;80", dist = "uniform")
def <- defData(def, varname = "y", 
  formula = "genNL(nf = ..nlf, x, sd = 10)", dist = "nonrandom")

dd <- genData(300, def)

And if we introduce much less noise, we get much closer to the original underlying function specified by our points:

def <- defData(varname = "x", formula = "20;80", dist = "uniform")
def <- defData(def, varname = "y", 
  formula = "genNL(nf = ..nlf, x, sd = 0.5)", dist = "nonrandom")

dd <- genData(300, def)

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.