Site icon R-bloggers

Don’t Panic! a Scientific Approach to Debugging Production Failure

[This article was first published on Category R on Roel's R-tefacts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Your production system just broke down. What should you do now? Can you imagine your shiny application / flask app, or your API service breaking down?

As a beginning programmer, or operations (or devops) person it can be overwhelming to deal with logs, messages, metrics and other possible relevant information that is coming at you at such a point.

And when something fails you want it to get back to working state as fast as possible. People depend on this stuff!

You get into panic mode.

That is not helpful. Be careful when in panic mode! Don’t do stupid stuff like me.

Silly stuff I’ve done in panic mode

When you are in panic mode you focus on what is right in front of you and make suboptimal decisions. Here is some I have made.

Assuming it is the same as previous time

When an application failed for the second time in 2 weeks, my thought was:

“It failed last time because we ran out of memory on the machine. If I restart the machine again it will work again.”

But in this case there was a different cause, we forgot to update the credentials for this machine.

if I’d read the logs I would have seen a clear error indicating that the connection could not be made because the credentials were missing.

My panic decision was to default to a cause I had seen before and fix that. If I had taken a minute to read the logs I would know better.

updates were not propagated

(this was not really production, but while developing I still panicked) My front-end application did not properly pass information to the back-end.

I had changed the configuration but nginx was not set to auto reload on changes in the configuration (this was in a kubernetes cluster and so the config lived in a configmap which moves automatically inside the machine, so I forgive myself for not realizing that nginx does not automatically reload it). I have looked up several configuration options and tried them all, and none were effective!

I should have read the logs better. Logging level was set to debug and so there were many messages and I looked over the message that told me how the routing was implemented.

My panic decision was to assume my changes did not work, when in reality the changes were not picked up.

A Scientific approach for debugging and for production breakage (OODA loop)

Even when you are in panic mode, try to keep a scientific approach to debugging. Check your assumptions, generate hypothesis of what causes the problem and check that hypothesis. But this is not a scientific paper, there must be rigor, but also speed; employ the ‘theory of close enough’, formulate hypotheses quickly and verify them.

Observe (1) what happened, (2) what happens and (3) what should happen. Orient yourself to the problem by applying your knowledge of the system to the observations, Decide on what the cause is and Act on it. Then Observe again to see if you were right. This is a generic approach known as the OODA loop (Observe, Orient, Decide, Act) and it was originally described by an American Air Force Colonel. But it did not stay in the military, this approach is used in litigation, law enforcement, and business. I think it works as a generic strategy for debugging too. The main point is you should not jump to conclusions too quickly.

So how does this work in practice:

You get a message that your application is not working from someone else This is not great, it is bad if someone else noticed before you.

Observe

Orient

Decide

Act

Hypothesis confirmed

Hypothesis refuted

Learning from failures

The approach above will help you go through your problem quickly and probably bring the system back into a working state again. By writing down what you observed and what you thought you can learn from this failure.

When the crisis is over take a break, drink a cup of tea and look back at the failure with refreshed eyes.

Blameless Post-Mortem

A post-mortem, or autopsy is a medical procedure where a doctor finds out the cause of death of a victim. Your systems were maybe not really dead but it is good to go deep and find out why a failure occurred.

The best post-mortems are blameless. Our software and our teams, work in a system. A system you try to make as robust as possible. That a failure occured, means something went by unnoticed, or broke an assumption you made. You try to figure out what barriers were broken, what procedures are faulty, what information was missed. Maybe your system broke down because someone switched off the power to a rack, but why was that possible? Maybe your flask app broke down because someone send in a malformed document with right-to-left reading order text, but should the app deal with this gracefully? should it block malformed documents earlier?

The only way to learn from these (often painful) experiences is to write down what happened and formulate plans to counter that happening in the future again.

Write down a time-line:

After you have written the time-line down, formulate actions you can take:

Failures are just another way to learn about your systems. A new opportunity to learn and to improve.

notes

To leave a comment for the author, please follow the link and comment on their blog: Category R on Roel's R-tefacts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.