Code Snippet: Extracting a Subsample from a Large File
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week a reader of the r-help mailing list posted a query titled “Importing random subsets of a data file.” With a very large file, it is often much easier and faster–and really, just as good–to just work with a much smaller subset of the data.
Fellow readers then posted rather sophisticated solutions, such as storing the file in a database. Here I’ll show how to perform this task much more simply. And if you haven’t been exposed to R’s text file reading functions before, it will be a chance for you to learn a bit.
I’m assuming here that we want to avoid storing the entire file in memory at once, which may be difficult or impossible. In other words, functions like read.table() are out.
I’m also assuming that you don’t know exactly how many records are in the file, though you probably have a rough idea. (If you do know this number, I’ll outline an alternative approach at the end of this post.)
Finally, due to lack of knowledge of the total number of records, I’m also assuming that extracting every kth record is sufficiently “random” for you.
So, here is the code (downloadable from here):
subsamfile <- function(infile,outfile,k,header=T) { ci <- file(infile,"r") co <- file(outfile,"w") if (header) { hdr <- readLines(ci,n=1) writeLines(hdr,co) } recnum = 0 numout = 0 while (TRUE) { inrec <- readLines(ci,n=1) if (length(inrec) == 0) { # end of file? close(co) return(numout) } recnum <- recnum + 1 if (recnum %% k == 0) { numout <- numout + 1 writeLines(inrec,co) } } }
Very straightforward code. We use file() to open the input and output files, and read in the input file one line at a time, by specifying the argument n = 1 in the first call to file(). Each inputted record is a character string. To sense the end-of-file condition on the input file, we test whether the input record has length 0. (Any record, even an empty one, will have length 1, i.e. each record is read as a 1-element vector of mode character, again due to setting n = 1.)
On a Linux or Mac platform, we can determine the number of records in the file ahead of time by running wc -l infile (either directly or via R’s system()). This may take a long time, but if we are willing to incur that time, then the above code could be changed to extract random records. We’d do something like cullrecs <- sample(1:ntotrecs,m,replace=FALSE) where m is the desired number of records to extract, and then whenever recnum matches the next element of cullrecs, we’d write that record to outfile.
Will you be at the JSM next week? My talk is on Tuesday, but I’ll be there throughout the meeting. If you’d like to exchange some thoughts on R or statistics, I’d enjoy chatting with you.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.