Coding, Tech and Developers Blog
Many among us are deeply into with one or even multiple distributed systems. These environments exhibit a level of complexity that is very different from what we were accustomed to in a classic monolithic application. This will especially become apparent when things go wrong in your system; and they always do. And debugging a distributed system is hard, so let's talk about this.
I have worked on distriubted systems of different nature as well as monolithic applications over the past few years and one thing has clearly emerged as an insight: If you do not prepare and build your (distributed) application with debugging in mind, you'll encounter significant difficulties when production issues arise. Allow me to share some thoughts on this matter.
First of all, let us all develop a common understanding of what the term "distributed system describes:
A distributed system is defined as a system in which its components are situated in different network locations. Typically, the various modules of such a system communicate with one another through a messaging mechanism, which could be implemented using events, commands, traditional web requests, or similar methods.
In that sense a distributed system could be the sum of very different services interacting with each other (service-oriented architecture), a simple client-server applications where the two parts are deployed independently from each other, or instances of the same application running on different machines in some kind of multiplayer scenarios.
Here, we will focus on the first category of distributed systems, but many of key points in this article will apply to all of them.
There is a reason for these types of systems to be complex in nature. Oftentimes, the complexity of a software system is broken down into three main categories:
See for example this talk by Bert Schrijver.
When it comes to distributed systems, we are obviously both faced with concurrent and distributed problems which are giving us a hard time when tracking down a particularly tricky bug. As opposed to the classical, concurrent monolith, there are many new exciting things that can go wrong.
In terms of what can go wrong, there to major types of problems that we developers are faced with from day to day: external and internal problems.
The external problems are in general very well described by the Fallacies of Distributed Computing almost 30 years ago. In today's cloud environments, I would summarize them vaguely as
Assuming that the surroundings of your application (e.g., network, infrastructure, provisioning, costs, etc.) are stable, is foolish.
In a way, that is just a variation of Murphy's law. These are the problems we have to keep in mind, when dealing with network communcation, retry policies and try-catch-blocks.
On the other side are those kinds of problems that are business-related. Specifically, errors and bugs related to the business are usually caused by developers either missing out on a requirement or by making assumptions about the business within our code that cannot be justified from a techical point of view.
For example, we might implement the consumer of a certain message like this:
public void Consume(ProductOrdered message)
this.TotalStock -= message.Products.Count;
if(this.TotalStock < 0)
throw new InvalidOperationException("Stock cannot be negative");
This simple guard clause makes an assumption based on what we think is reasonable about a sales system. However, the whole ecosystem of the application might no care at all about what is reasonable about
TotalStock. Messages might be delayed, messages might arrive out of order, customers might hit
Buy at precisely the same time. There is a number of reasons why we might arrive at this line trying to decrement the total stock below zero. From a business perspective that will still be an error - but we need to be careful about making this code throwing an exception right away.
You and your teams build your application, everything looks fine, your customers are using your site, and then, suddenly, a critical bug is found and you need to fix it. How should you proceed?
Let me share my first step when debugging distributed systems:
Don't debug distributed systems.
Now, this statement is not a joke. Since debugging distributed systems can be very hard, especially if your ecosystems consists of a great number of isolated components, I want to encourage you to make actual debugging not your first option. Imagine someone at Amazon tracking down a bug and saying, "Right, I'll just spin up a dozen different services on my machine, written in 4 different languages and attach a different debugger to all of them." Not going to happen.
While there will be scenarios where debugging might be required, it should generally be regarded as a last resort. Rather, you should focus on the following topics first when tracking down an issue in your system:
Let's take a look at each of these topics.
In order to really be able to solve the problem at hand, you need to understand it first. That seems obvious but we are usually tempted to dive into a debugging session first. Save yourself some time and make sure you leveraged all of your options to gather information about the issue, before starting to code. In order to do that, you might want to have a few tools at hand, which you need to think about already in the design phase of your system.
Do you know whether the observability of your system is high? Ask yourself the following:
At any time, can you assess and understand the internal state of your system by asking questions from the outside?
If not, you need to increase the observability in your code. You need to make sure that your system emits enough information in case something goes awry. Three different tools are available to do this:
Metrics assess the current (and maybe also past) state of your applications in (scalar) numbers. This might be in the form of health checks quantifying for example memory consumption, number of incoming requests, etc. They are good indicators that there is a problem, but not good at telling you what's actually wrong.
The classic. In monolithic applications, logs have been the number one debugging tool for many decades. We all have added statements like
log.Info("I am here") to our code. In distributed systems, however, things will get tricky. Analyzing logs from two or more applications, trying to synchronize timestamps and figuring out what is wrong is hard. There is a better way to do this.
In distributed ecosystems, distributed tracing can be your game winner. Traces introduce the dimension of causality to your code. "What triggered this operation and how long did that take?" is a question that can easily be answered and visualized if your applications export information to a central place. If you want to know more about this, there is a pretty decent talk by Martin Thwaites which I highly recommend.
Let me paraphrase:
Traces will help you understand the big picture of your applications information flow. Make sure to check that out. For .NET people, you can go to the official documentation as a start.
There is a saying in our industry that goes like this:
For every bug in production, you are probably missing a test in development.
That statement may not be true in all situations, but there is certainly some truth to it. Of course, there are things that are difficult to emulate in your development environment, but at least we need to make sure to have all imaginable tests for the business part of our application in place.
In case of distributed systems, that holds especially true for testing the input and output of your application. Whether you use request-response techniques or messaging to communicate with other modules, make sure that not only do you test how the system reacts to incoming data, but also that the output of your code is correct. Did you assert that there was exactly a single
PrepareShipping message sent? Did it contain precisely the data that you expected?
If you are lazy here, that can be fatal in production scenarios. And if that happens, the tests for your module should be the first thing to come back to. Can you reproduce that bug with an integration test?
In order to effectively debug and test your application during its lifetime, a few things come to mind that we should be aware of already when designing and writing our code.
Assuming you have succeeded in finding the issue in your code I have one last recommendation: Document your finding. Documentation can be done in any way that you and your team consider helpful. It is more important that you do it at all. There are many problems that are likely to reappear again later, maybe to somebody else. If that person has the chance to search for it, maybe in a Wiki, in code, in a ticket system, that might save many hours of testing and debugging.
In this article, I have laid out my personal thoughts and advices when it comes to debugging distributed systems. Not everything might be of value to you, but I hope that some of the information I shared might be helpful anyways.
One last thing I want to mention about errors in our systems is this:
If we want to have a reliable system, we have to have seen it fail often.
In essence, there will be bugs, there will be errors, someone will break production. We cannot avoid that. The only thing we can do is to make sure we are prepared to be able to act quickly and informed about it.
Be the first to know when a new post was released