Site icon R-bloggers

Selecting subset of variables in data frame

[This article was first published on R (en) - Analytik dat, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I frequently work with datasets with many variables. In this case I often need to apply some function to subset of variables in data frame. To simplify this task I wrote short function that allows me to specify what variables to include and what variables should be excluded.

 

I do choose subset of variables based on the following condition types:

With R it was suprisingly easy to write function varlist():

varlist <- function (df=NULL,type=c("numeric","factor","character"), pattern="", exclude=NULL) {
   vars <- character(0)
   if (any(type %in% "numeric")) {
     vars <- c(vars,names(df)[sapply(df,is.numeric)])
   }
   if (any(type %in% "factor")) {
     vars <- c(vars,names(df)[sapply(df,is.factor)])
   }  
   if (any(type %in% "character")) {
     vars <- c(vars,names(df)[sapply(df,is.character)])
   }  
   vars[(!vars %in% exclude) & grepl(vars,pattern=pattern)]
 }

Function has the following parameters:

I will demonstrate how this works on dataset “German Credit Data”:

german_data <- read.table(file="

Now we can start playing with varlist():

## All variable starting with cred
varlist(german_data,pattern="^cred")
## All numeric variable
varlist(german_data,type="numeric")
## All factor variable except variable gb and variables starting with c
varlist(german_data,type="factor",exclude=c("gb",varlist(german_data,pattern="^c")))
## Same as previous, only using pattern instead of c()
varlist(german_data,type="factor",exclude=varlist(german_data,pattern="^c|gb"))

Once we have list of column names, it is easy to use sapply and do real job:

> sapply(german_data[,varlist(german_data,type="numeric",pattern="credit")], summary)
        credit_amount existing_credits
Min.              250            1.000
1st Qu.          1366            1.000
Median           2320            1.000
Mean             3271            1.407
3rd Qu.          3972            2.000
Max.            18420            4.000

 Of course, we can have our own function in sapply:

> sapply(german_data[,varlist(german_data,type="numeric",pattern="credit")], function (x) length(unique(x)))
   credit_amount existing_credits 
             921                4 

Let me know if you find this useful or have other solutions when dealing with many variables.

To leave a comment for the author, please follow the link and comment on their blog: R (en) - Analytik dat.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.