IPC will kill your app
Had to debug a core component that wasn't able to handle the relatively big input data stream.
The application reads input from an AWS kinesis stream, unpacks it, validates it, processes it, outputs metrics and writes data to an AWS kinesis firehose to be loaded in AWS redshift for analytical reports later on. The input stream distributes data via its shards to many virtual machines, each running an instance of the "broken" application. The VMs are relatively big - 16 cpus and 30gb memory each. There are no signs of CPU bottlenecks using basic metrics and htop (aka the machines are idle). Our application still somehow manages to both be too slow and run out of memory. We'll focus on processing time.
So how to attack the problem?
As is typical, part of the team starts to look for problems in all the wrong places. They blamed kinesis, the recent migration from python 2 to python 3 and also started tweaking configuration numbers in the deployment of the application to see what happens. When there's no plan, there's always hope ... I ignored their ad-hoc guesswork. A smaller application would allow one to change random things until the problem shows itself. A bigger application requires a more systematic analytical approach.
Before I change anything, I start gathering facts by analyzing all metrics and logs. We have metrics for kinesis streams, application logs, metrics for firehose streams and metrics for redshift.
The metrics for the kinesis stream and its shards show increases in latency. This is the first thing everyone checked and got tricked into thinking it's a kinesis problem, because they didn't read what exactly the metrics mean. The latency increases were due to a) more data coming in by upstream apps b) the application not reading the data fast enough.
Next I checked if the sink (kinesis firehose and redshift) are not able to ingest everything being sent to them. I see some pretty impressive numbers (100s of millions of records per load every few minutes), but no signs that something is having trouble and causing delays. So the problem is in the middle - the app itself.
The application falls behind by 4 hours at peak. There are 4 things to do - find the root cause, design a better solution, implement it, verify it works. I'm completely new to the project, the old maintainers have left the company long ago, the new maintainers have no idea what they are doing, I'm pulled in to help from a completely unrelated team, documentation is not a thing. Tests are for some of the business logic and they are small and cannot reproduce the issue at hand at all. I'm working with little to no cooperation (aka good luck, you're on your own).
I usually alternate between an "ivory tower" and more practical runtime analysis. To me the ivory tower appoarch is to sit and read the code until things start to make sense. The new knowledge allows one to form a hypothesis. For more senior engineers, this is often enough, they can spot the issue based on their understanding of the architecture and their past experience and knowledge. Then they might hand off the solution to someone less senior to implement etc. The hands-on practical approach is to reproduce the issue in an isolated environment and to start analyzing runtime behavior with debuggers, profilers, code modifications, tests. I use both usually because I get bored of doing the same thing and because one is less susceptible to explore a wrong path for too long. Gathering runtime data is very important, people are easily tricked by code that looks bad, but the issue is completely different. The disadvantage is the "set up" time is higher before you become somewhat productive and show any results, because you have to understand how to create your own deployment of the app and how to send it realistic amounts of traffic, so the issue is reproduced.
I'm in the OPS team, so I have a lot of background investigating dependencies and permissions, so eventually I managed to deploy me an isolated copy of the app without destroying production, staging or some random database. Reproducing the issue was somewhat harder. I had to learn enough about how the application takes its input (kinesis streams + kcl are 3rd party projects with their own documentaiton), what the data format is (reading the source of the upstream app that generates the data + learning how to manually subscribe to the kinesis data stream), how to get enough data to overload my isolated environment. Subscribing to the production stream didn't work as it got overloaded and started interfering with prod. Reading through the source of the upstream project showed me it keeps a usable (but not the same) copy of the data in AWS S3. I found that the previous maintainers have left some abandoned nonworking code that could read data from a file. Fixing that took forever (both due to code and permission issues), but eventually I managed to feed data from S3 to my instance and reproduce the issue.
Observing how the app slows down and falls behind in prod eliminated my first hypothesis that there's a cpu bottleneck. The 16 cores were at most 70% used at usually sat idle. This usually is an I/O issue, slow network, slow disk, slow database. In my dev env I saw a similar behavior and there's no network in use there (input is from a file, I disabled output), the disk wasn't getting overloaded and there are no databases in use. The next hypothesis is it's an IPC (interprocess communication) issue where the critical path is between separate processes and neither of them does enough to take up a full cpu.
The application is built with multiple processes. The entrypoint is the Amazon KCL library, which runs as a java process (and a bunch of aux subprocesses we don't care about), subscribes to a kinesis shard, starts the client application as a subprocess and feeds it data on stdin. The main application itself starts multiple subprocesses and feeds them data via pipes. Already feels like trouble - no easy way to set a breakpoint in and step through a subprocess with an interactive debugger without refactoring a lot of code.
At this point I was bouncing between reading through the code to understand the data flow, reading through python multiprocessing source code to understand the library performance implications and how exactly IPC works and reading through google and github searching for a sane profiler that can work in a multiprocess environment without much code modification.
IPC was being used for three things:
- Sending jobs from one process ot the next. Eg send messages from a process reading kinesis data to a subprocess categorizing and batching the data. Then sending the batched data from one process to the next for more processing.
- Worker processes sending "slow down" messages to processes feeding them jobs, implementing backpressure.
- A "state manager" process holding a bunch of static parameters (for subprocesses to configure themselves) and counters that get incremented by all processes for metrics collection and logging.
Because there's so much IPC and so many processes I had to find out what exactly is the bottleneck. Eventually, using a profiler, I found 2 main processes spending most of their time in IPC send/receive calls and started investigating them. Note how much work had been done so far to set the stage and we are just now getting to the root cause.
I found a few issues
- The processes in question were communicating relatively small messages and were often waiting for each other in lock-step. The fix was to batch up at least 1000 messages or 1 second of data before sending it for processing. This gave us around 2x improvement, because cpus were now actually getting used.
- Some inner loops were incrementing counters (metrics) in the state manager, which is a separate process, which means IPC on every iteration. Pretty bad. 2x improvement by making the counters local to the respective process or by moving them around.
- One of the processes was applying too much backpressure, because its buffer was too small. When the buffer fills up, backpressure is applied, but then the buffer is processed almost instantly, so as a result the process starves itself out of work and sits idle for a while. A bigger buffer was enough.
- From the 20 or so processes started, the majority of work is done in 2 of them, so the machine is mostly sitting idle by design. No easy fix without rewriting everything. Luckily wasn't needed as I squeezed out plenty of performance just by reducing IPC.
Data locality matters even in relatively high level applications written in python where top performance isn't a main requirement.