Describing and Debugging Change Over Time in a System

So, Aria Stewart tweeted two questions and statement the other day:

I wanted to discuss this idea as it pertains to debugging and strategies I’ve been employing more often lately. But lets examine the topic at face value first, describing change over time in a system. The abstract “system” is where a lot of the depth in this question comes from. It could be talking about the computer you’re on, the code base you work in, data moving through your application, an organization of people, many things fall into a system of some sort, and time acts upon them all. I’m going to choose the data moving through a system as the primary topic, but also talk about code bases over time.

Another part of the question that keeps it quite open is the lack of “why”, “what”, or “how” in it. This means we could discuss why the data needs to be transformed in various ways, why we added a feature or change some code, why an organization is investing in writing software at all. We could talk about what change at each step in a data pipeline, what changes have happened in a given commit, or what goals were accomplished each month by some folks. Or, the topic could be how a compiler changes the data as it passes through, how a programmer sets about making changes to a code base, or how an organization made its decisions to go in the directions it did. All quite valid and this is but a fraction of the depth in this simple question.

Let’s talk about systems a bit. At work, we have a number of services talking via a messaging system and a relational database. The current buzz phrase for this is “micro services” but we also called them “service oriented architectures” in the past. My previous job, I worked in a much smaller system that had many components for gathering data, as well as a few components for processing that data and sending it back to the servers. Both of these systems shared common attributes which most other systems also must cope with: events that provide data to be processed happen in functionally random order, that data is fed into processors who then stage data to be consumed by other parts of the system.

When problems arise in systems like these, it can be difficult to tell what piece is causing disruption. The point where the data changes from  healthy to problematic may be a few steps removed from the layer that the problem is detected in. Also, sometimes the data is good enough to only cause subtle problems. At the start of the investigation, all you might know is something bad happened in the past. It is especially at these points when we need the description of the change that should happen to our data over time, hopefully with as much detail as possible.

The more snarky among us will point out the source code is what is running so why would you need some other description? The problem often isn’t that a given developer can’t understand code as they read it, though that may be the case. Rather, I find the problem is that code is meant to handle so many different cases and scenarios that the exact slice that I care about is often hard to track. Luckily our brains are build up these mental models that we can use to traverse our code, eliminating blocks of code because we intuitively “know” the problem couldn’t be in them, because we have an idea of how our code should work. Unfortunately, it is often at the mental model part where problems arise. The same tricks we use to read faster and then miss errors in our own writing are what can cause problems when understanding why a system is working in some way we didn’t expect.

Mental models are often incomplete due to using libraries, having multiple developers on a project, and the ravages of time clawing away at our memory. In some cases the mental model is just wrong. You may have intended to make a change but forgot to actually do it,  maybe you read some documentation in a different way than intended, possibly you made a mistake while writing the code such as a copy/paste error or off by 1 in a loop. It doesn’t really matter the source of the flaw though, because when we’re hunting a bug. The goal is to find what the flaw in both the code and the mental model are so it can be corrected, then we can try to identify why the model got out of wack in the first place.

Can we describe change in a system over time? Probably to some reasonable degree of accuracy, but likely not completely. How does all of this tie into debugging? The strategy I’ve been practicing when I hit these situations is particularly geared around the idea that my mental model and the code are not in agreement. I shut off anything that might interrupt a deep focus time such as my monitors and phone, then gather a stack of paper and a pen. I write down the reproduction steps at whatever level of detail they were given to me to use as a guide in the next process.

I then write out every single step along the path that the data will take as it appears in my mental model, preferably in order. This often means a number of arrows as I put in steps I forgot. Because I know the shape data and the reproduction steps, I can make assumptions like “we have an active connection to the database.” Assumptions are okay at this point, I’m just building up a vertical slice of the system and how it affects a single type of data. Once I’ve gotten a complete list of event on the path, I then start the coding part. I go through and add log lines that line up with list I made, or improve them when I see there is already some logging at a point. Running the code periodically to make sure my new code hasn’t caused any issues and that my mental model still holds true.

The goal of this exercise isn’t necessarily to bring the code base into alignment with my mental model, because my mental model may be wrong for a reason. But because there is a bug, so rarely am I just fixing my mental model, unless of course I discover the root cause and have to just say “working as intended.” As I go through, I make notes on my paper mental model where things vary, often forgotten steps make their way in now. Eventually, I find some step that doesn’t match up, at that point I probably know how to solve the bug, but usually I keep going, correcting the bug in the code, but continuing to analyze the system against my mental model.

I always keep going until I exhaust the steps in my mental model for a few reasons. First, since there was at least one major flaw in my mental model, there could be more, especially if that first one was obscuring other faults. Second, this is an opportunity to update my mental model with plenty of work already like writing the list and building any tools that were needed to capture the events. Last, the sort of logging and tools I build for validating my mental model, are often useful in the future when doing more debugging, so completing the path can make me better prepared for next time.

If you found this interesting, give this strategy a whirl. If you are wondering what level of detail I include in my event lists, commonly I’ll fill 1-3 pages with one event per line and some lines scratched out or with arrows drawn in the middle. Usually this documentation gets obsolete very fast. This is because it is nearly as detailed as the code, and only a thin vertical slice for very specific data, not the generalized case. I don’t try to save it or format it for other folks’ consumption. The are just notes for me.

I think this strategy is a step toward fulfilling the statement portion of Aria’s tweet, “Practice this.” One of the people you need to be concerned with the most when trying to describe change in a system, is yourself. Because if you can’t describe it to yourself, how are you ever going to describe it to others?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s