Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is a short story about a non-trivial bug in the processx package, and how I fixed it. It is a good showcase of the some debugging tools.
The bug
processx is an R package to start and manage external processes. It is used by the callr package to run code in another R session. The original bug report by Will Landau has a nice, clean reproducible example:
https://github.com/r-lib/processx/pull/237 seems to solve https://github.com/r-lib/processx/issues/236 for
processx
, but not forcallr
.> fun <- function() { > parallel::mclapply(1:2, function(x) x) > } > env <- c( > callr::rcmd_safe_env(), > PROCESSX_NOTIFY_OLD_SIGCHLD = "true" > ) > tmp <- callr::r(fun, env = env, show = TRUE) > #> Error while shutting down parallel: unable to terminate some child processes
A red herring
The references to PR #237 and issue #236 are a red herring, and a tricky one! The error message is exactly the same there, but the reasons are very different. That issue is about an interference between processx and the parallel package. This one must be something else, because callr does not load processx (or any other R package) in the R subprocess it creates:
callr::r(function() loadedNamespaces()) ## [1] "compiler" "graphics" "utils" "grDevices" ## [5] "stats" "datasets" "methods" "base"
About SIGCHLD
signals
To understand what is going on here, we need to know a bit about signals
and subprocesses. A signal is an asynchronous notification, sent to a
process by a(nother) process or the operating system. A signal means
that an important event has happened, and the process’s normal execution
is interrupted. For some signal types, the process gets a chance to handle
the signal gracefully and then continue execution. The SIGCHLD
signal is
such a signal. It is sent by the operating system when a subprocess of
the (parent) process has finished its execution. When the parent process
receives SIGCHLD
, the subprocess has already finished, but it is still
in the OS’s process table. It is dead but still hanging around: it is a
zombie process.
To completely eliminate the subprocess, the parent process needs to either
read out its exit status, or tell the OS that it is not interested in the
exit status. The parent process can also pre-emptively tell the OS that
it is not interested in the exit status of its subprocesses, and in this
case no SIGCHLD
signals are delivered to it at all.
Of course it is my bad
To make sure that this issue is specific to processx or callr, I tried if
a subprocess started with system()
has the same issue, and it does not:
system("R -q -e 'parallel::mclapply(1:2, function(x) x)'")
Whereas processx/callr does:
tmp <- callr::r( function() parallel::mclapply(1:2, function(x) x), show = TRUE ) ## Error while shutting down parallel: unable to terminate some child processes
Still, it is possible (however unlikely ????) that the bug
is in the parallel package, e.g. because it does not set up the signal
handler for SIGCHLD
properly, when the R session was started by
processx/callr.
The parallel package sets up a finalizer, that runs when R exits. This finalizer tries to eliminate all subprocesses that were started by parallel, and if it fails to do that, it emits the error message that we see above.
When eliminating subprocesses, parallel sends them a SIGKILL
signal, which
is not possible to catch and handle, so it is sure that their execution
has finished, and they are in a zombie state. In fact, mclapply()
already
tries to clean up the subprocesses it has started, so they are probably
already in the zombie state when mclapply()
returns. This is
easy to check with the ps package:
tmp <- callr::r( function() { parallel::mclapply(1:2, function(x) x) print(sapply(ps::ps_children(ps::ps_handle()), ps::ps_status)) }, show = TRUE ) ## [1] "zombie" "zombie" ## Error while shutting down parallel: unable to terminate some child processes
Indeed, both subprocesses of the callr R process are zombies.
Clearly, the callr R process did not receive or did not handle their
SIGCHLD
signals. To make sure that the SIGCHLD
signal handler is
properly set up in parallel, we need to debug parallel’s C code, in the
callr subprocess.
We will use the lldb debugger here, as that is the default on macOS. gdb has similar functionality, you probably want to use gdb for gcc and Linux.
To debug the callr subprocess, we tell lldb to wait for a process called R:
❯ lldb -w -n R (lldb) process attach --name "R" --waitfor
Then in another terminal, we run callr::r()
:
callr::r(function() parallel::mclapply(1:2, function(x) x))
As soon as the callr subprocess starts, lldb will stop it (using a SIGSTOP
signal that cannot be caught):
❯ lldb -w -n R (lldb) process attach --name "R" --waitfor Process 4196 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP frame #0: 0x00007fff74b75a4b libsystem_info.dylib`xdrmem_getlong_aligned + 75 libsystem_info.dylib`xdrmem_getlong_aligned: -> 0x7fff74b75a4b <+75>: callq 0x7fff74b759f0 ; _OSSwapInt32 0x7fff74b75a50 <+80>: movq -0x18(%rbp), %rdx 0x7fff74b75a54 <+84>: movl %eax, (%rdx) 0x7fff74b75a56 <+86>: movq -0x10(%rbp), %rdx Target 0: (R) stopped. Executable module set to "/Library/Frameworks/R.framework/Resources/bin/exec/R". Architecture set to: x86_64h-apple-macosx-. (lldb)
Now we can set break points on parallel C functions. For a better debugging experience, it makes sense to recompile R without optimization, and with debug symbols, so we can actually see the original C source code in the debugger. But for now we can get away without that.
(lldb) b mc_fork Breakpoint 1: no locations (pending). WARNING: Unable to resolve breakpoint to any actual locations. (lldb) b mc_cleanup Breakpoint 2: no locations (pending). WARNING: Unable to resolve breakpoint to any actual locations.
Ideally, we would set a break point on the setup_sig_handler()
function,
that sets up the signal handler, but this function is optimized out by
the compiler. Lets continue running the process:
(lldb) c Process 4196 resuming 1 location added to breakpoint 1 1 location added to breakpoint 2 Process 4196 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x000000010e89ea10 parallel.so`mc_fork parallel.so`mc_fork: -> 0x10e89ea10 <+0>: pushq %rbp 0x10e89ea11 <+1>: movq %rsp, %rbp 0x10e89ea14 <+4>: pushq %r15 0x10e89ea16 <+6>: pushq %r14 Target 0: (R) stopped. (lldb)
lldb stops the process at mc_fork()
. parallel is just about to start a
subprocess here. We step throught mc_fork()
a bit (there is not easy way
to step until the end of a function in lldb, AFAICT). We go on until we
see that sigaction(2)
is called, this should set up the signal handler:
(lldb) n Process 4196 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = instruction step over frame #0: 0x000000010e89eac9 parallel.so`mc_fork + 185 parallel.so`mc_fork: -> 0x10e89eac9 <+185>: callq 0x10e8a089a ; symbol stub for: sigaction 0x10e89eace <+190>: movl $0x80000, -0x58(%rbp) ; imm = 0x80000 0x10e89ead5 <+197>: leaq -0x58(%rbp), %rsi 0x10e89ead9 <+201>: leaq -0x44(%rbp), %rdx Target 0: (R) stopped. (lldb) n Process 4196 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = instruction step over frame #0: 0x000000010e89eace parallel.so`mc_fork + 190 parallel.so`mc_fork: -> 0x10e89eace <+190>: movl $0x80000, -0x58(%rbp) ; imm = 0x80000 0x10e89ead5 <+197>: leaq -0x58(%rbp), %rsi 0x10e89ead9 <+201>: leaq -0x44(%rbp), %rdx 0x10e89eadd <+205>: movl $0x1, %edi Target 0: (R) stopped.
If you look at the source code, sigaction(2)
is called from
setup_sig_handler()
, but the compiler inlined that function.
Now that lldb has resolved the break points to the loaded parallel.so
shared lib, we can double check that:
(lldb) image lookup -n setup_sig_handler (lldb)
Never mind, sigaction(2)
was already called, so the signal handler
should be set up. Let’s check it. We can call sigaction(2)
to query the
current SIGCHLD
handler:
(lldb) call (void*) malloc(sizeof(struct sigaction)) (void *) $0 = 0x00007fe6c7e305f0 (lldb) call (int) __sigaction(20, NULL, $0) (int) $1 = 0 (lldb) p ((struct sigaction *)$0)->__sigaction_u (__sigaction_u) $2 = { __sa_handler = 0x000000010e8a0750 (parallel.so`parent_sig_handler) __sa_sigaction = 0x000000010e8a0750 (parallel.so`parent_sig_handler) }
20 is the number of SIGCHLD
, see man signal
, and for the (platform
dependent) structure of struct sigaction
see man sigaction
.
In any case, parallel.so
’s parent_sig_handler
is indeed set up properly.
Now we tell lldb to stop on receiving a SIGCHLD
signal:
(lldb) process handle SIGCHLD --notify true --pass true --stop true NAME PASS STOP NOTIFY =========== ===== ===== ====== SIGCHLD true true true
and we are ready to continue the process. It will stop again at mc_fork()
,
because we are starting two subprocesses in mclapply()
. Then we continue
again:
(lldb) c Process 4196 resuming Process 4196 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x000000010e89ea10 parallel.so`mc_fork parallel.so`mc_fork: -> 0x10e89ea10 <+0>: pushq %rbp 0x10e89ea11 <+1>: movq %rsp, %rbp 0x10e89ea14 <+4>: pushq %r15 0x10e89ea16 <+6>: pushq %r14 Target 0: (R) stopped. (lldb) c Process 4196 resuming Process 4196 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1 frame #0: 0x000000010e89e680 parallel.so`mc_cleanup parallel.so`mc_cleanup: -> 0x10e89e680 <+0>: pushq %rbp 0x10e89e681 <+1>: movq %rsp, %rbp 0x10e89e684 <+4>: pushq %r15 0x10e89e686 <+6>: pushq %r14 Target 0: (R) stopped. (lldb)
Interestingly, we got to mc_cleanup
, even though the SIGCHLD
signal(s)
should have arrived first. From another R session, we can check that the
subprocesses of parallel are zombies already:
❯ sapply(ps::ps_children(ps::ps_handle(4196L)), ps::ps_status) [1] "zombie" "zombie"
We can continue still, and hope that the SIGCHLD
s will still arrive,
but they won’t:
(lldb) c Process 4196 resuming Process 4196 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1 frame #0: 0x000000010e89e680 parallel.so`mc_cleanup parallel.so`mc_cleanup: -> 0x10e89e680 <+0>: pushq %rbp 0x10e89e681 <+1>: movq %rsp, %rbp 0x10e89e684 <+4>: pushq %r15 0x10e89e686 <+6>: pushq %r14 Target 0: (R) stopped. (lldb) c Process 4196 resuming Process 4196 exited with status = 0 (0x00000000) (lldb)
Clearly, the signal handler is properly set up in parallel, but the signals are not delivered to the process. No doubt, this is a bug in the processx or callr code.
Some hypotheses
There are very few possible reasons for the OS not sending out SIGCHLD
signals. The first is that the parent process explicitly tells the OS that
it is not interested. This is done by setting the signal handler to
SIG_IGN
with sigaction(2)
. This is unlikely to be the case for us,
since we saw that parallel did set up a signal handler properly. By
searching the R source code and the source code of processx and callr for
SIG_IGN
, it is obvious that none of them does this.
The other reason is that the SIGCHLD
signal is blocked by the process.
Blocking a signal means that the process tells the OS, that it is currently
not ready to process it, and it should be delivered later, when the process
has unblocked the signal. If SIGCHLD
was blocked in the callr subprocess,
that would be a good explanation for the OS not sending it.
Validating the hypothesis
The sigprocmask(2)
system call can be used to query or manipulate the
set of blocked signals. So we can re-run our callr::r()
reprex, with
lldb on the callr subprocess again, and examine the state of the
SIGCHLD
signal:
(lldb) call (void*) malloc(sizeof(sigset_t)) (void *) $3 = 0x00007f92704c4300 (lldb) call (int) sigprocmask(0, NULL, $3) (int) $4 = 0 (lldb) p (int) sigismember((const sigset_t*) $3, 20) (int) $7 = 1
SIGCHLD
is indeed blocked! We got it!
Fixing the bug
Clearly, if base::system()
works and callr::r()
and processx::run()
do not, that means that processx causes the SIGCHLD
to be blocked in the
subprocess.
When processx starts a subprocess, it first calls fork(2)
to create a
copy of the current process, and then execvp(3)
to replace that with
another executable. fork()
creates an identical copy, i.e. the signal
handlers and the blocked signals are the same in the subprocess.
execvp(3)
resets the signal handlers to their defaults, which is great,
but apparently, it does not reset the set of blocked signals. So if
SIGCHLD
was blocked when processx called fork(2)
then it will be blocked
in the subprocess as well. Indeed,
this is how processx started the subprocess
before fixing this bug:
processx__block_sigchld(); pid = fork(); if (pid == -1) { /* ERROR */ err = -errno; if (signal_pipe[0] >= 0) close(signal_pipe[0]); if (signal_pipe[1] >= 0) close(signal_pipe[1]); if (cpty) close(pty_master_fd); processx__unblock_sigchld(); R_THROW_SYSTEM_ERROR_CODE(err, "Cannot fork when running '%s'", ccommand); } /* CHILD */ if (pid == 0) { if (cpty) close(pty_master_fd); processx__child_init(handle, pipes, num_connections, ccommand, cargs, signal_pipe[1], cstdin, cstdout, cstderr, pty_name, cenv, &options, ctree_id); R_THROW_SYSTEM_ERROR("Cannot start child process when running '%s'", ccommand); }
processx blocks SIGCHLD
before forking, for simplicity, so it does not
have to worry about concurrency. processx__child_init()
does a number
of things, but it does not unblock this signal, so it stays blocked.
The fix
is clearly to unblock the signal in the child process, before calling
execvp(3)
in processx__child_init()
:
/* CHILD */ if (pid == 0) { /* LCOV_EXCL_START */ if (cpty) close(pty_master_fd); processx__unblock_sigchld(); processx__child_init(handle, pipes, num_connections, ccommand, cargs, signal_pipe[1], cstdin, cstdout, cstderr, pty_name, cenv, &options, ctree_id); R_THROW_SYSTEM_ERROR("Cannot start child process when running '%s'", ccommand); /* LCOV_EXCL_STOP */ }
Conclusion
Debugging is hard, but debuggers like lldb help you greatly. It is worth to invest some time into learning what they can do: wait for a process to attach to, call functions and system calls, catch signals, and a lot more.
The fact that the signal mask is not reset at execvp(3)
is of course
documented in the manual page of execve(2)
(on macOS). This is the system
call that execvp(3)
uses internally. I guess another lesson is to
read the manual…
If you enjoyed reading this, you might like similar debugging stories by Jim Hester and Rich FitzJohn.
Resources
Some resources for debugging C/C++ code in R packages:
- Winston Chang’s https://github.com/wch/r-debug#readme has Docker containers and lots of tips for debugging various problems. The containers also have gdb and an R build with debug symbols.
- Jim Hester’s video about getting started with lldb: https://www.jimhester.com/post/2019-04-05-lldb-debugging/
- The official lldb tutorial: https://lldb.llvm.org/use/tutorial.html
- Apple’s quick start guide for lldb: https://developer.apple.com/library/archive/documentation/IDEs/Conceptual/gdb_to_lldb_transition_guide/document/Introduction.html This uses XCode, but it is a good summary of the features.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.