A Python-Like walk() Function for R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A really nice function available in Python is walk(), which recursively descends a directory tree, calling a user-supplied function in each directory within the tree. It might be used, say, to count the number of files, or maybe to remove all small files and so on. I had students in my undergraduate class write such a function in R for homework, and thought it may be interesting to present it here.
Among other things, readers who are not familiar with recursive function calls will learn about those here. I must add that all readers, even those with such background, will find this post to be rather subtle and requiring extra patience, but I believe you will find it worthwhile.
Let’s start with an example, in which we count the number of files with a given name:
countinst <- function(startdir,flname) { walk(startdir,checkname, list(flname=flname,tot=0)) } checkname <- function(drname,filelist,arg) { if (arg$flname %in% filelist) arg$tot <- arg$tot + 1 arg }
Say we try all this in a directory containing a subdirectory mydir that consists of a file x, and two subdirectories, d1 and d2. The latter in turn consists of another file named x. We then make the call countinst(‘mydir’,’x’). As can be seen above, that call will in turn make the call
walk('mydir',checkname,list(flname='x',tot=0)
The walk() function, which I will present below, will start in mydir, and will call the user-specified function checkname() at every directory it encounters in the tree rooted at mydir, in this case mydir, d1 and d2.
At each such directory, walk() will pass to the user-specified function, in this case checkname(), first the current directory name, then a character vector reporting the names of all files (including directories) in that directory. In mydir, for instance, this vector will be c(‘d1′,’d2′,’x’). In addition, walk() will pass to checkname() an argument, formally named arg above, which will serve as a vehicle for accumulating running totals, in this case the total number of files of the given name.
So, let’s look at walk() itself:
walk <- function(currdir,f,arg) { # "leave trail of bread crumbs" savetop <- getwd() setwd(currdir) fls <- list.files() arg <- f(currdir,fls,arg) # subdirectories of this directory dirs <- list.dirs(recursive=FALSE) for (d in dirs) arg <- walk(d,f,arg) setwd(savetop) # go back to calling directory arg }
The comments are hopefully self-explanatory, but the key point is that within walk(), there is another call to walk()! This is recursion.
Note how arg accumulates, as needed in this application. If on the other hand we wanted, say, simply to remove all files named ‘x’, we would just put in a dummy variable for arg. And though we didn’t need drname here, in some applications it would be useful.
For compute-intensive tasks, recursion is not very efficient in R, but it can be quite handy in certain settings.
If you would like to conveniently try the above example, here is some test code:
test <- function() { unlink('mydir',recursive=TRUE) dir.create('mydir') file.create('mydir/x') dir.create('mydir/d1') dir.create('mydir/d2') file.create('mydir/d2/x') print(countinst('mydir','x')) }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.