Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Often I find myself needing data sets to try functions and code out on or for teaching purposes. I have a few stand-bys such as the mtcars and CO2 data sets in the base packages of R but sometimes I need a long format data set or a bunch of categorical or a bunch of numeric or repeated measures or I want it to have missing values to test the function and I spend valuable time searching for the correct data set. About a year ago my answer was to have a file with several data sets I knew could fit various situations but eventually I grew tired of the pain of loading a data set each time I needed to test something and created a randomly generated data set function with categorical, numeric, interval, and repeated measures data. I recently extended the data set to contain optional missing values, long or wide format, and proportion data and attempted to give it some speed boosts for creating larger data sets. It generally suits my needs and I think can probably serve others too.
The main function, DFgen, relies on two helper functions, props and NAins. I do not place these helper functions inside of DFgen itself as they have useful properties in and of themselves. I’ll briefly explain each function, provide the code, and give a few tests to try it out.
The props Function
The props function generates a data frame of proportions whose rows sum to 1. It takes two arguments and an optional var.names argument. The first two arguments are the dimensions of the dataframe and are pretty self explanatory. The final argument optionally names the columns otherwise they are named X1..Xn. One note on this function is that for many columns it is a poorer choice. For a slower props function but better for numerous columns Dason of talkstats.com provides an alternative (LINK).
############################################################# # function to generate random proportions whose rowSums = 1 # ############################################################# props <- function(ncol, nrow, var.names=NULL){ if (ncol < 2) stop("ncol must be greater than 1") p <- function(n){ y <- 0 z <- sapply(seq_len(n-1), function(i) { x <- sample(seq(0, 1-y, by=.01), 1) y <<- y + x return(x) } ) w <- c(z , 1-sum(z)) return(w) } DF <- data.frame(t(replicate(nrow, p(n=ncol)))) if (!is.null(var.names)) colnames(DF) <- var.names return(DF) } ############## # TRY IT OUT # ############## props(ncol=5, nrow=5) props(ncol=3, nrow=25) props(ncol=3, nrow=5, var.names=c("red", "blue", "green"))
The NAins Function
The NAins function takes a data frame and randomly inserts a certain proportion of missing (NA) values. The function has two arguments: df which is the dataframe and prop which is the proportion of NA values to be inserted into the data frame (default is .1),
Special thanks again to Dason of talk.stats.com for helping with a speed boost with this function. This function consumes considerable time in DFgen and he provided the code to really gain some speed.
################################################################ # RANDOMLY INSERT A CERTAIN PROPORTION OF NAs INTO A DATAFRAME # ################################################################ NAins <- NAinsert <- function(df, prop = .1){ n <- nrow(df) m <- ncol(df) num.to.na <- ceiling(prop*n*m) id <- sample(0:(m*n-1), num.to.na, replace = FALSE) rows <- id %/% m + 1 cols <- id %% m + 1 sapply(seq(num.to.na), function(x){ df[rows[x], cols[x]] <<- NA } ) return(df) } ############## # TRY IT OUT # ############## NAins(mtcars, .1)
The DFgen Function
The DFgen function randomly generates an n-lenght data set with predefined variables. The default DFgen() with no arguments specified will produce the following n=10 data set:
> set.seed(10) > DFgen() id group hs.grad race gender age m.status political n.kids income score time1 time2 time3 1 ID.1 treat yes white male 19 never republican 1 111000 -1.24 51.39 52.15 53.76 2 ID.2 control yes black male 30 divorced independent 0 122000 -0.46 32.21 35.07 33.10 3 ID.3 control yes white male 32 married republican 1 2000 -0.83 43.36 45.46 46.22 4 ID.4 treat no white male 30 divorced republican 1 65000 0.34 71.63 72.06 74.49 5 ID.5 control yes white female 18 married republican 3 96000 1.07 9.26 12.24 11.02 6 ID.6 treat yes asian female 30 married independent 3 135000 1.22 24.10 26.45 24.74 7 ID.7 treat yes white female 26 never democrat 5 16000 0.74 28.76 31.72 31.39 8 ID.8 treat yes white male 40 married republican 1 113000 -0.48 28.24 29.10 37.12 9 ID.9 treat yes white male 23 married independent 2 80000 0.56 62.99 65.09 67.72 10 ID.10 treat no asian male 22 married democrat 1 96000 -1.25 43.74 46.79 44.04
The function also takes optional:
- type argument (default “wide” or “long”)
- na.rate (a decimal value between 0 and 1; default is 0) that randomly inserts missing data (great for teaching demos and testing corner cases)
- prop argument (takes TRUE or default FALSE )
- digits that controls the number of degits (default is 2)
############################################################ # GENERATE A RANDOM DATA SET. CAN BE SET TO LONG OR WIDE. # # DATA SET HAS FACTORS AND NUMERIC VARIABLES AND CAN # # OPTIONALLY GIVE BUDGET EXPENDITURES AS A PROPORTION. # # CAN ALSO TELL A PROPORTION OF CELLS TO BE MISSING VALUES # ############################################################ # NOTE RELIES ON THE props FUNCTION AND THE NAins FUNCTION # ############################################################ DFgen <- DFmaker <- function(n=10, type=wide, digits=2, proportion=FALSE, na.rate=0) { rownamer <- function(dataframe){ x <- as.data.frame(dataframe) rownames(x) <- NULL return(x) } dfround <- function(dataframe, digits = 0){ df <- dataframe df[,sapply(df, is.numeric)] <-round(df[,sapply(df, is.numeric)], digits) return(df) } TYPE <- as.character(substitute(type)) time1 <- sample(1:100, n, replace = TRUE) + abs(rnorm(n)) DF <- data.frame(id = paste0("ID.", 1:n), group= sample(c("control", "treat"), n, replace = TRUE), hs.grad = sample(c("yes", "no"), n, replace = TRUE), race = sample(c("black", "white", "asian"), n, replace = TRUE, prob=c(.25, .5, .25)), gender = sample(c("male", "female"), n, replace = TRUE), age = sample(18:40, n, replace = TRUE), m.status = sample(c("never", "married", "divorced", "widowed"), n, replace = TRUE, prob=c(.25, .4, .3, .05)), political = sample(c("democrat", "republican", "independent", "other"), n, replace= TRUE, prob=c(.35, .35, .20, .1)), n.kids = rpois(n, 1.5), income = sample(c(seq(0, 30000, by=1000), seq(0, 150000, by=1000)), n, replace=TRUE), score = rnorm(n), time1, time2 = c(time1 + 2 * abs(rnorm(n))), time3 = c(time1 + (4 * abs(rnorm(n))))) if (proportion) { DF <- cbind (DF[, 1:10], props(ncol=3, nrow=n, var.names=c("food", "housing", "other")), DF[, 11:14]) } if (na.rate!=0) { DF <- cbind(DF[, 1, drop=FALSE], NAins(DF[, -1], prop=na.rate)) } DF <- switch(TYPE, wide = DF, long = {DF <- reshape(DF, direction = "long", idvar = "id", varying = c("time1","time2", "time3"), v.names = c("value"), timevar = "time", times = c("time1", "time2", "time3")) rownamer(DF)}, stop("Invalid Data \"type\"")) return(dfround(DF, digits=digits)) } ############## # TRY IT OUT # ############## DFgen() DFgen(type="long") DFmaker(20000) DFgen(prop=T) DFgen(na.rate=.3)
Click here for a .txt version of this demonstration
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.