debugging distributed systems
I'm in a team that maintains and develops multiple (think 20) relatively small, but heavily used distributed systems and it's always a challenge to find and fix an issue. Every small moving part contributes as a multiplier to the overall complexity:
- the architectures of different systems have nothing in common because they evolve independently at different frequencies
- the people that develop the systems change all the time
- the trendy technologies used change all the time
- business priorities change daily
- support team experience and availability change all the time
- timezones vary and there are cultural differences of the people involved, because they are spread across multiple continents
- software provided by vendors, which have their own outages, release schedules, working hours has to be kept in mind.
The combinatorial explostion that results from all those factors makes the surface area to explore and monitor so large that it quickly turns unmanageable. It's somewhat similar to what happens in games where you have many simple pieces (chess) or player controlled characters (moba games) that end up interacting in a complex way.
So how to approach this? Generic life advice is perfectly applicable, nothing new under the sun. Take a step back, relax, don't let yourself be overwhelmed, divide and conquer, search exhaustively and be consistent, try not to cut corners. Of course, easier said than done. Experience is one of the best tools to have.
Unfortunately computer science is a relatively new field in the grand scheme of things and still suffers from consequences of trying to differentiate itself as a separate branch of "real" science. There's ego, gatekeeping, trying to invent its own terminology and practice, often ignoring and purposefully not mentioning and not teaching established practices in other fields.
The "real" world is a distributed system. Learn how other people analyze and solve problems in it.
People make fun of lab experiments and results that are not directly applicable to a real world situation, but it's a naive stance to have. When faced with a complex system, it's best to remove as many variables, not related to the problem at hand, as possible. A good way to do it is with a separate isolated testing environment. It's no different with software. It seems obvious and it is, still worth repeating.
Best case situation is to reproduce the issue in an environment where you can analyze the root cause without worrying about destroying production. It's an art on its own to isolate a distributed system, because it often drags a huge amount of dependencies with it, both internal and external. Usually it means to copy a bunch of databases, to copy the program execution environment (also recursively for dependencies, if needed), to parametrize everything to use the new resources. This is where it pays off to have sane well made parametrized automatic deployments of your projects and people that know how to operate them.