dateo. Coding Blog

Coding, Tech and Developers Blog

observability

metrics

traces

debugging

Debugging distributed systems is hard!

Dennis Frühauff on July 12th, 2023

Many among us are deeply into with one or even multiple distributed systems. These environments exhibit a level of complexity that is very different from what we were accustomed to in a classic monolithic application. This will especially become apparent when things go wrong in your system; and they always do. And debugging a distributed system is hard, so let's talk about this.

I have worked on distriubted systems of different nature as well as monolithic applications over the past few years and one thing has clearly emerged as an insight: If you do not prepare and build your (distributed) application with debugging in mind, you'll encounter significant difficulties when production issues arise. Allow me to share some thoughts on this matter.

What is a distributed system?

First of all, let us all develop a common understanding of what the term "distributed system describes:

A distributed system is defined as a system in which its components are situated in different network locations. Typically, the various modules of such a system communicate with one another through a messaging mechanism, which could be implemented using events, commands, traditional web requests, or similar methods.

In that sense a distributed system could be the sum of very different services interacting with each other (service-oriented architecture), a simple client-server applications where the two parts are deployed independently from each other, or instances of the same application running on different machines in some kind of multiplayer scenarios.

Here, we will focus on the first category of distributed systems, but many of key points in this article will apply to all of them.

There is a reason for these types of systems to be complex in nature. Oftentimes, the complexity of a software system is broken down into three main categories:

Level 1: Non-concurrent,
Level 2: Concurrent,
Level 3: Distributed.

See for example this talk by Bert Schrijver.

When it comes to distributed systems, we are obviously both faced with concurrent and distributed problems which are giving us a hard time when tracking down a particularly tricky bug. As opposed to the classical, concurrent monolith, there are many new exciting things that can go wrong.

What can go wrong? Fallacies of network, business stuff

In terms of what can go wrong, there to major types of problems that we developers are faced with from day to day: external and internal problems.

The external problems are in general very well described by the Fallacies of Distributed Computing almost 30 years ago. In today's cloud environments, I would summarize them vaguely as

Assuming that the surroundings of your application (e.g., network, infrastructure, provisioning, costs, etc.) are stable, is foolish.

In a way, that is just a variation of Murphy's law. These are the problems we have to keep in mind, when dealing with network communcation, retry policies and try-catch-blocks.

On the other side are those kinds of problems that are business-related. Specifically, errors and bugs related to the business are usually caused by developers either missing out on a requirement or by making assumptions about the business within our code that cannot be justified from a techical point of view.

For example, we might implement the consumer of a certain message like this:

public void Consume(ProductOrdered message)
{
  this.TotalStock -= message.Products.Count;
  
  if(this.TotalStock < 0)
  {
     throw new InvalidOperationException("Stock cannot be negative");
  }
}

This simple guard clause makes an assumption based on what we think is reasonable about a sales system. However, the whole ecosystem of the application might no care at all about what is reasonable about TotalStock. Messages might be delayed, messages might arrive out of order, customers might hit Buy at precisely the same time. There is a number of reasons why we might arrive at this line trying to decrement the total stock below zero. From a business perspective that will still be an error - but we need to be careful about making this code throwing an exception right away.

How to debug?

You and your teams build your application, everything looks fine, your customers are using your site, and then, suddenly, a critical bug is found and you need to fix it. How should you proceed?

Let me share my first step when debugging distributed systems:

Don't debug distributed systems.

Now, this statement is not a joke. Since debugging distributed systems can be very hard, especially if your ecosystems consists of a great number of isolated components, I want to encourage you to make actual debugging not your first option. Imagine someone at Amazon tracking down a bug and saying, "Right, I'll just spin up a dozen different services on my machine, written in 4 different languages and attach a different debugger to all of them." Not going to happen.

While there will be scenarios where debugging might be required, it should generally be regarded as a last resort. Rather, you should focus on the following topics first when tracking down an issue in your system:

Observability: Understanding the problem thoroughly
Test, Test, Test
Divide and Conquer
Design for Failure

Let's take a look at each of these topics.

Observability: Understanding the problem thoroughly

In order to really be able to solve the problem at hand, you need to understand it first. That seems obvious but we are usually tempted to dive into a debugging session first. Save yourself some time and make sure you leveraged all of your options to gather information about the issue, before starting to code. In order to do that, you might want to have a few tools at hand, which you need to think about already in the design phase of your system.

What is observability

Do you know whether the observability of your system is high? Ask yourself the following:

At any time, can you assess and understand the internal state of your system by asking questions from the outside?

If not, you need to increase the observability in your code. You need to make sure that your system emits enough information in case something goes awry. Three different tools are available to do this:

Metrics

Metrics assess the current (and maybe also past) state of your applications in (scalar) numbers. This might be in the form of health checks quantifying for example memory consumption, number of incoming requests, etc. They are good indicators that there is a problem, but not good at telling you what's actually wrong.

Logs

The classic. In monolithic applications, logs have been the number one debugging tool for many decades. We all have added statements like log.Info("I am here") to our code. In distributed systems, however, things will get tricky. Analyzing logs from two or more applications, trying to synchronize timestamps and figuring out what is wrong is hard. There is a better way to do this.

Traces

In distributed ecosystems, distributed tracing can be your game winner. Traces introduce the dimension of causality to your code. "What triggered this operation and how long did that take?" is a question that can easily be answered and visualized if your applications export information to a central place. If you want to know more about this, there is a pretty decent talk by Martin Thwaites which I highly recommend.
Let me paraphrase:

Use distributed tracing.
You don't know what you need to know, so put as much information as possible into the context of a trace.
If you leverage tracing fully, you don't need logs anymore.

Traces will help you understand the big picture of your applications information flow. Make sure to check that out. For .NET people, you can go to the official documentation as a start.

Test, Test, Test

There is a saying in our industry that goes like this:

For every bug in production, you are probably missing a test in development.

That statement may not be true in all situations, but there is certainly some truth to it. Of course, there are things that are difficult to emulate in your development environment, but at least we need to make sure to have all imaginable tests for the business part of our application in place.
In case of distributed systems, that holds especially true for testing the input and output of your application. Whether you use request-response techniques or messaging to communicate with other modules, make sure that not only do you test how the system reacts to incoming data, but also that the output of your code is correct. Did you assert that there was exactly a single PrepareShipping message sent? Did it contain precisely the data that you expected?

If you are lazy here, that can be fatal in production scenarios. And if that happens, the tests for your module should be the first thing to come back to. Can you reproduce that bug with an integration test?

Design for failure

In order to effectively debug and test your application during its lifetime, a few things come to mind that we should be aware of already when designing and writing our code.

Modularity/Separation of Concerns: A modular, or even vertically-sliced application is easier to test. Unnecessary parts can stripped away easily, reducing the clutter about the relevant code while debugging and testing.
Consolidated Logging: If logging is your go-to strategy, make sure it is centralized. Every application in your ecosystem to log to the same provider. Also, make sure developers are aware of how to log information, what levels should be used, and, in case of structured logging, how properties should be named to give the best picture to the poor guys reading them.
Consolidated Error Handling: Make sure you have a common understanding about what is an error (an exception) and what is not. Throwing errors generously in a message-oriented system can be fatal, because it can trigger chains of retries in situations that do not resolve itself.
Don't make Assumptions: Be careful about business assumptions creeping into your code. The input you'll get might be nonsense from a business perspective, so be prepared to handle it either way. Also, don't rely on correct message order. A messaging provider that can guarantee you correct message delivery order is rarely found.
Reproduce Infrastructure Locally: With everything being hosted in the cloud, local development can get a nightmare if resources are shared among developers. Several people publishing messages on the same topic of an AWS queue will easily interfere with each other. Some cloud services can be hosted locally in docker containers. In case of AWS, this is achieved via localstack. If your cloud services cannot be mirrored on the developer's machine, consider implementing mock replacements, for example for your message bus, to be able to work in isolation if necessary. This will already be of great value for your integration tests.
Simulate External Services: If your applications needs to talk to external services, consider implementing simulators or stubs to be able to remove additional complexity from your scenario. Having worked at a company which acted as a system integrator, providing ourselves with simulators of the external systems oftentimes saved us a great amount of work.

Documentation

Assuming you have succeeded in finding the issue in your code I have one last recommendation: Document your finding. Documentation can be done in any way that you and your team consider helpful. It is more important that you do it at all. There are many problems that are likely to reappear again later, maybe to somebody else. If that person has the chance to search for it, maybe in a Wiki, in code, in a ticket system, that might save many hours of testing and debugging.

Conclusion

In this article, I have laid out my personal thoughts and advices when it comes to debugging distributed systems. Not everything might be of value to you, but I hope that some of the information I shared might be helpful anyways.

One last thing I want to mention about errors in our systems is this:

If we want to have a reliable system, we have to have seen it fail often.

In essence, there will be bugs, there will be errors, someone will break production. We cannot avoid that. The only thing we can do is to make sure we are prepared to be able to act quickly and informed about it.

Please share on social media, stay in touch via the contact form, and subscribe to our post newsletter!

Be the first to know when a new post was released