Selecting subset of variables in data frame
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I frequently work with datasets with many variables. In this case I often need to apply some function to subset of variables in data frame. To simplify this task I wrote short function that allows me to specify what variables to include and what variables should be excluded.
I do choose subset of variables based on the following condition types:
- variable/column type (factor, numeric, string) (I know there are other types, feel free to improve the function)
- column name pattern (usually columns describing similar concepts have the same prefix)
- variable is not excluded (I do not want some variables to be part of the result)
With R it was suprisingly easy to write function varlist():
varlist <- function (df=NULL,type=c("numeric","factor","character"), pattern="", exclude=NULL) { vars <- character(0) if (any(type %in% "numeric")) { vars <- c(vars,names(df)[sapply(df,is.numeric)]) } if (any(type %in% "factor")) { vars <- c(vars,names(df)[sapply(df,is.factor)]) } if (any(type %in% "character")) { vars <- c(vars,names(df)[sapply(df,is.character)]) } vars[(!vars %in% exclude) & grepl(vars,pattern=pattern)] }
Function has the following parameters:
- data frame
- column type (numeric, factor, character or any combination given as vector)
- pattern (will be used in regex to filter matching variable names)
- exclude (vector of names to exclude)
I will demonstrate how this works on dataset "German Credit Data":
german_data <- read.table(file="
Now we can start playing with varlist():
## All variable starting with cred varlist(german_data,pattern="^cred") ## All numeric variable varlist(german_data,type="numeric") ## All factor variable except variable gb and variables starting with c varlist(german_data,type="factor",exclude=c("gb",varlist(german_data,pattern="^c"))) ## Same as previous, only using pattern instead of c() varlist(german_data,type="factor",exclude=varlist(german_data,pattern="^c|gb"))
Once we have list of column names, it is easy to use sapply and do real job:
> sapply(german_data[,varlist(german_data,type="numeric",pattern="credit")], summary) credit_amount existing_credits Min. 250 1.000 1st Qu. 1366 1.000 Median 2320 1.000 Mean 3271 1.407 3rd Qu. 3972 2.000 Max. 18420 4.000
Of course, we can have our own function in sapply:
> sapply(german_data[,varlist(german_data,type="numeric",pattern="credit")], function (x) length(unique(x))) credit_amount existing_credits 921 4
Let me know if you find this useful or have other solutions when dealing with many variables.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.