Simulate data with R
[This article was first published on The Beginner Programmer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last semester I was attending a boring class, even though the professor was really clever, he was always bouncing around the main theme and never got straight to the point. While thinking about everything but the class, I had an idea: when you are given a set of data, say X and Y, you can easily compute a linear regression model, e.g. the regression line, and find out information on the data. Now, you will also find information on the error that the linear model made in predicting the data. By finding out the distribution of the error you can somehow simulate data similar to the original, from the regression line, by simply adding a random error (whose distribution is known) to the predicted data.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Furthermore, we know from the regression line that the expected error is 0.
Here is the code to implement this idea in R. You can get the data to work on in the bottom of the page.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
data <- read.csv("data.csv", header = T) | |
attach(data) | |
plot(Y~X, col = "blue") | |
# Generate a linear model and plot it | |
model <- lm(Y ~ X) | |
abline(model, lwd=2, col="red") | |
# Calculate the residuals | |
resid <- signif(residuals(model), 5) | |
# and the predicted values | |
predicted_values <- predict(model) | |
# Set the bounds | |
bounds <- seq(from = min(resid), to = max(resid), by = 0.2) | |
# Check the relative frequencies and error distribution | |
error_distribution <- table(cut(resid, bounds))/sum(table(cut(resid,bounds))) | |
# Plot error distribution | |
plot(error_distribution, main = "Frequencies") | |
# From the plot, we can assume, that the error should be approximately iid. | |
# Therefore runif() could work as an error simulator | |
# We now generate 10 errors | |
simulated_errors <- runif(length(predicted_values), min = min(resid), max = max(resid)) | |
simulated_values <- c() | |
# We simulate actual values of Y by adding errors to the predicted values | |
for(i in 1:length(X)) | |
{ | |
simulated_values[i] <- (predicted_values[i] + simulated_errors[i]) | |
} | |
# Here are our simulated_values | |
simulated_values | |
# We plot again the data and the model, then we add our simulated values | |
plot(Y~X, col = "blue") | |
abline(model, lwd=2, col="red") | |
points(X, simulated_values, type = "p",col="red") | |
detach(data) |
The result should look something like this: In blue the actual data and in red the simulated one.
Hope this was useful, if you know the name of this method, please leave a comment and let me know. Click here to get the data I used.
To leave a comment for the author, please follow the link and comment on their blog: The Beginner Programmer.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.