Simulate data with R

Posted on July 23, 2014 by Mic in R bloggers | 0 Comments

[This article was first published on The Beginner Programmer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last semester I was attending a boring class, even though the professor was really clever, he was always bouncing around the main theme and never got straight to the point. While thinking about everything but the class, I had an idea: when you are given a set of data, say X and Y, you can easily compute a linear regression model, e.g. the regression line, and find out information on the data. Now, you will also find information on the error that the linear model made in predicting the data. By finding out the distribution of the error you can somehow simulate data similar to the original, from the regression line, by simply adding a random error (whose distribution is known) to the predicted data.
Furthermore, we know from the regression line that the expected error is 0.

Here is the code to implement this idea in R. You can get the data to work on in the bottom of the page.

	data <- read.csv("data.csv", header = T)
	attach(data)

	plot(Y~X, col = "blue")

	# Generate a linear model and plot it
	model <- lm(Y ~ X)
	abline(model, lwd=2, col="red")

	# Calculate the residuals
	resid <- signif(residuals(model), 5)

	# and the predicted values
	predicted_values <- predict(model)

	# Set the bounds
	bounds <- seq(from = min(resid), to = max(resid), by = 0.2)
	# Check the relative frequencies and error distribution
	error_distribution <- table(cut(resid, bounds))/sum(table(cut(resid,bounds)))
	# Plot error distribution
	plot(error_distribution, main = "Frequencies")

	# From the plot, we can assume, that the error should be approximately iid.
	# Therefore runif() could work as an error simulator

	# We now generate 10 errors
	simulated_errors <- runif(length(predicted_values), min = min(resid), max = max(resid))
	simulated_values <- c()

	# We simulate actual values of Y by adding errors to the predicted values
	for(i in 1:length(X))
	{
	simulated_values[i] <- (predicted_values[i] + simulated_errors[i])
	}

	# Here are our simulated_values
	simulated_values

	# We plot again the data and the model, then we add our simulated values
	plot(Y~X, col = "blue")
	abline(model, lwd=2, col="red")
	points(X, simulated_values, type = "p",col="red")

	detach(data)

view raw simData.R hosted with ❤ by GitHub

The result should look something like this: In blue the actual data and in red the simulated one.

Hope this was useful, if you know the name of this method, please leave a comment and let me know. Click here to get the data I used.

To leave a comment for the author, please follow the link and comment on their blog: The Beginner Programmer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Simulate data with R

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)