Site icon R-bloggers

The mysterious case of the misbehaving writeLines() (or: a cat saves the day)

[This article was first published on Christophe Ladroue » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Dear readers and R experts, I submit to you a mysterious R quirk which has been baffling me for the best part of a week. I found a work-around but I’d love it if someone could explain this strangest of behaviour.

I was using writeLines() to dump the results of a large number of classification results and noticed that the output file was not as expected: some records were missing, some repeated. It took me a while to figure out exactly what the problem was (the multicore library? network latency? an unclosed sink()? etc.) but once I finally stripped the script bare, it turned out that I could reproduce the bug with a combination of multicore and writeLines().

Here is how my script works: a function (accuracyWithSomeFeatures()) is applied a number of times (numberFeatures). For each application, I output numberFeatures rows, each with 3 columns: a identifier for this iteration, the value passed to the function (from 1 to numberFeatures) and the number generated by the function. This process is repeated numberExperiments times.
When it’s finished, I should therefore end up with a text file of numberFeatures x numberExperiments.
Well this doesn’t happen if the function is applied in parallel AND I use writeLines() to write to the file.

Only this combination reproduces the bug. If I use multicore (or parallel in 2.14) and cat() or don’t use multicore but writeLines() etc., the output file is of the right shape.

What’s even weirder is that when the bug happens, the number of lines of the output file is not constant; the repeat/mangling is not consistent. For example, with numberFeatures< -317; numberExperiments< -467, a wc -l gives 217175, 217417, 217357 lines after 3 consecutive runs. None of these numbers are multiples of 317 or 467 so something is going seriously wrong.

The effect cannot be due to parallel writing either: the output file is populated after the parallelised generation of the data.

I’m leaving the code here if you fancy solving a bit of R mystery. Please let me know if you find the reason for this apparent bug, I’m sure there’s something to learn here. (Or it’s staring at me in the face. I’m ready for that too).

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
 
if(as.numeric(version$major)*100+as.numeric(version$minor)<214) library(multicore) else library(parallel)
 
classifyInParallel<-TRUE
withCat<-FALSE
 
numberFeatures<-317
numberExperiments<-467
 
outputFile<-"~/debugWriteLines.csv"
unlink(outputFile)
if(!withCat) outputFile<-file(outputFile,'a')
 
replicate(numberExperiments,{
  experimentID<-paste(sample(letters,6,replace=TRUE),collapse='')
 
  accuracyWithSomeFeatures<-function(k) 100*runif(1)
 
    if(classifyInParallel){
      accuracy<-mclapply(1:numberFeatures,accuracyWithSomeFeatures)
      accuracy<-unlist(accuracy)
    } else{
      accuracy<-sapply(1:numberFeatures,accuracyWithSomeFeatures)
    }
 
  content<-paste(experimentID, 1:numberFeatures,accuracy,sep='\t')
 
  if(withCat)
    cat(content,
      sep='\n',
      file=outputFile,
      append=TRUE)  
  else
    writeLines(
      content,
      con=outputFile)  
 
  NULL
})
 
if (!withCat) close(outputFile)
 
cat("\nclassifyInParallel:",classifyInParallel)
cat("\nwithCat:",withCat)
cat("\nThere should be",numberFeatures*numberExperiments,"lines in total.")

My system:
platform x86_64-pc-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status major 2 minor 14.0 year 2011 month 10 day 31 svn rev 57496 language R version.string R version 2.14.0 (2011-10-31)

(original image by Kaptain Kobold)

Update (17 Nov 2011):
I now have an even barer script. writeLines() gets it wrong only if the output file is opened and closed once, i.e. before and after the main loop. It works if the file is opened and closed at each iteration. Again, this is not an issue of writing to a file from multiple processors, since writeLines() is outside mclapply().

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
if(as.numeric(version$major)*100+as.numeric(version$minor)<214) library(multicore) else library(parallel)
 
numberFeatures<-17
numberExperiments<-13
 
outputFilename<-"~/debugWriteLines.csv"
 
 
cat("\nThere should be",numberFeatures*numberExperiments,"lines in total.")
 
# opening and closing at each save
cat("\n\tOpening and closing at each iteration.")
unlink(outputFilename)
 
for(experiment in 1:numberExperiments){
  outputFile<-file(outputFilename,'a')
  mclapply(1:numberFeatures,function(k) runif(1))
  content<-paste(experiment, 1:numberFeatures,sep='\t')
  writeLines(content,con=outputFile)
  close(outputFile)
  }
cat(" Number of lines:",length(readLines(outputFilename)))
 
# opening and closing once
cat("\n\tOpening and closing once.")
 
unlink(outputFilename)
outputFile<-file(outputFilename,'a')
 
for(experiment in 1:numberExperiments){
  mclapply(1:numberFeatures,function(k) runif(1)) # do something in parallel
  content<-paste(experiment, 1:numberFeatures,sep='\t')
  writeLines(content,con=outputFile)}
 
close(outputFile)
 
numberLines<-length(readLines(outputFilename))
cat(" Number of lines:",numberLines)
cat("\n\t\tNumber of lines / number of experiments:",numberLines/numberExperiments)
cat("\n\t\tNumber of lines / number of features:",numberLines/numberFeatures)
cat("\n\t\tNumber of lines / (number of experiments * number of features):",numberLines/(numberExperiments*numberFeatures))

produces:
There should be 221 lines in total. Opening and closing at each save. Number of lines: 221 Opening and closing once. Number of lines: 2465 Number of lines / number of experiments: 189.6 Number of lines / number of features: 145 Number of lines / (number of experiments * number of features): 11.15

To leave a comment for the author, please follow the link and comment on their blog: Christophe Ladroue » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.