Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I was using writeLines() to dump the results of a large number of classification results and noticed that the output file was not as expected: some records were missing, some repeated. It took me a while to figure out exactly what the problem was (the multicore library? network latency? an unclosed sink()? etc.) but once I finally stripped the script bare, it turned out that I could reproduce the bug with a combination of multicore and writeLines().
Here is how my script works: a function (accuracyWithSomeFeatures()) is applied a number of times (numberFeatures). For each application, I output numberFeatures rows, each with 3 columns: a identifier for this iteration, the value passed to the function (from 1 to numberFeatures) and the number generated by the function. This process is repeated numberExperiments times.
When it’s finished, I should therefore end up with a text file of numberFeatures x numberExperiments.
Well this doesn’t happen if the function is applied in parallel AND I use writeLines() to write to the file.
Only this combination reproduces the bug. If I use multicore (or parallel in 2.14) and cat() or don’t use multicore but writeLines() etc., the output file is of the right shape.
What’s even weirder is that when the bug happens, the number of lines of the output file is not constant; the repeat/mangling is not consistent. For example, with numberFeatures< -317; numberExperiments< -467, a wc -l gives 217175, 217417, 217357 lines after 3 consecutive runs. None of these numbers are multiples of 317 or 467 so something is going seriously wrong.
The effect cannot be due to parallel writing either: the output file is populated after the parallelised generation of the data.
I’m leaving the code here if you fancy solving a bit of R mystery. Please let me know if you find the reason for this apparent bug, I’m sure there’s something to learn here. (Or it’s staring at me in the face. I’m ready for that too).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | if(as.numeric(version$major)*100+as.numeric(version$minor)<214) library(multicore) else library(parallel) classifyInParallel<-TRUE withCat<-FALSE numberFeatures<-317 numberExperiments<-467 outputFile<-"~/debugWriteLines.csv" unlink(outputFile) if(!withCat) outputFile<-file(outputFile,'a') replicate(numberExperiments,{ experimentID<-paste(sample(letters,6,replace=TRUE),collapse='') accuracyWithSomeFeatures<-function(k) 100*runif(1) if(classifyInParallel){ accuracy<-mclapply(1:numberFeatures,accuracyWithSomeFeatures) accuracy<-unlist(accuracy) } else{ accuracy<-sapply(1:numberFeatures,accuracyWithSomeFeatures) } content<-paste(experimentID, 1:numberFeatures,accuracy,sep='\t') if(withCat) cat(content, sep='\n', file=outputFile, append=TRUE) else writeLines( content, con=outputFile) NULL }) if (!withCat) close(outputFile) cat("\nclassifyInParallel:",classifyInParallel) cat("\nwithCat:",withCat) cat("\nThere should be",numberFeatures*numberExperiments,"lines in total.") |
My system:
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 14.0
year 2011
month 10
day 31
svn rev 57496
language R
version.string R version 2.14.0 (2011-10-31)
(original image by Kaptain Kobold)
Update (17 Nov 2011):
I now have an even barer script. writeLines() gets it wrong only if the output file is opened and closed once, i.e. before and after the main loop. It works if the file is opened and closed at each iteration. Again, this is not an issue of writing to a file from multiple processors, since writeLines() is outside mclapply().
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | if(as.numeric(version$major)*100+as.numeric(version$minor)<214) library(multicore) else library(parallel) numberFeatures<-17 numberExperiments<-13 outputFilename<-"~/debugWriteLines.csv" cat("\nThere should be",numberFeatures*numberExperiments,"lines in total.") # opening and closing at each save cat("\n\tOpening and closing at each iteration.") unlink(outputFilename) for(experiment in 1:numberExperiments){ outputFile<-file(outputFilename,'a') mclapply(1:numberFeatures,function(k) runif(1)) content<-paste(experiment, 1:numberFeatures,sep='\t') writeLines(content,con=outputFile) close(outputFile) } cat(" Number of lines:",length(readLines(outputFilename))) # opening and closing once cat("\n\tOpening and closing once.") unlink(outputFilename) outputFile<-file(outputFilename,'a') for(experiment in 1:numberExperiments){ mclapply(1:numberFeatures,function(k) runif(1)) # do something in parallel content<-paste(experiment, 1:numberFeatures,sep='\t') writeLines(content,con=outputFile)} close(outputFile) numberLines<-length(readLines(outputFilename)) cat(" Number of lines:",numberLines) cat("\n\t\tNumber of lines / number of experiments:",numberLines/numberExperiments) cat("\n\t\tNumber of lines / number of features:",numberLines/numberFeatures) cat("\n\t\tNumber of lines / (number of experiments * number of features):",numberLines/(numberExperiments*numberFeatures)) |
produces:
There should be 221 lines in total.
Opening and closing at each save. Number of lines: 221
Opening and closing once. Number of lines: 2465
Number of lines / number of experiments: 189.6
Number of lines / number of features: 145
Number of lines / (number of experiments * number of features): 11.15
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.