Invincible Bug

Published:
Translations:Русский

In 2018, I encountered an invincible bug in a third-party library. The bug triggered a whole chain of events that led to a bunch of interesting conclusions.

Puzzle in the shape of a bug.

Background

Advertising businesses like 2GIS or Google have only one option for survival - this is to please not only users, but also advertisers. Users don’t pay, but advertisers do.

Users like it when a service solves its problem well + when the service keeps up with the times. Typically, users of 2GIS mobile and web applications launch the application to find something. They click on the search results, on the houses on the map, and configure various parameters. Users do many different things. The company would like to understand how ordinary users (tens of millions of them) use the application (and how they do not use it).

Advertisers like it when advertising costs are transparent. For example, this means that all clicks to paid search positions must be tracked and provided to the advertiser in the form of a beautiful report.

Therefore, our team created and developed an receiving system and pipeline for processing business events, such as user clicks and application launches. The first version of this system worked in production and processed a couple of hundred requests per second (each request could contain thousands of business events). We chose Kafka as the basis for the second version of the system. Kafka is doing great with scaling. We were interested in this because tens of millions of randomly clicking users are no joke. Plus, Kafka allows you to deliver events without duplicates and losses (we suffered with this in the first version of the system).

At that time, we wrote most of the code in C++, and there was only one Kafka client library for C++ - librdkafka. The library is open, so we could at least somehow understand how it works. The library’s documentation and software interfaces were far from ideal. The brilliance and poverty of open source, as our team leader said. I do not condemn the author of the library; on the contrary, I respect him very much. He wrote a huge and useful library almost single-handedly. It is expected that he did not have enough time for something.

One day, QA found a bug that led to the loss of events. The loss of events occurred because the system completely failed. The bug chewed up the system like a cassette in a video recorder.

The story

I sat down to deal with this bug. It quickly became clear that there was a bug in librdkafka. Where exactly inside librdkafka is not clear. The implementation was extremely confusing. Core multi-threaded code in C/C++, a lot of thread synchronization, a lot of non-trivial optimizations for the sake of performance, eh.

I tried to figure it out for a couple more days. It became clear that I had no chance of fixing this bug in librdkafka. The fix requires global rework of the library, and since there were no tests there, I could not fully check the result of the rework. I could not consult the author. I was shy. The author had a sledgehammer on his avatar, and this sledgehammer somehow bothered me.

librdkafka author.

The library for Java/Scala did not have a similar bug, but the library was written completely differently. There were no similarities with the C++ implementation. The structure of the library for Java/Scala was much simpler.

Over the long weekend, I rewrote one of the services in Scala. Yes, the implementation of the service was small, about 1000 lines of code. The functionality has not changed, but the configuration file format had to be changed. I had to redo the metrics taken from the service. I had to redo the logs. The build and deployment process was noticeably different. In total, another 6 thousand lines of code.

I thought I had done a heroic thing by promptly rewriting everything from scratch. And then I was greeted by two things:

  1. Carpal tunnel syndrome. I thought typists’ illnesses were rare. It turns out that this can easily happen if you type for 12+ hours without breaks for several days in a row. Even despite the relatively ergonomic keyboard (Microsoft Sculpt) and ergonomic layout (Colemak). Most likely, it would have been worse without them. A numb hand is very unpleasant.
  2. QA response. For some reason, it turned out that they were not happy with such a radical fix. Everything needs to be retested from scratch. Arguments about how the bug could not be fixed otherwise did not work well.

As a result, I cured the syndrome (it took years), convinced QA (it took a couple of days). After convincing, I went to the related team and persuaded them to rewrite one of their services from their favorite C# to Java. C# implementation of their service had the same bug. I thought I would be met with hostility, because C# developers on the Internet hate Java. It turned out, no, people understood and accepted it very well. It was a pleasure to review their final implementation, which was technically perfect.

Conclusions

  1. It’s good when you can consult with the author. This applies not only to open source libraries, but to anything. Don’t be shy, don’t wait to improve yourself first. Here it is, a chance to improve it. In principle, following the thought process of the fathers of a favorite technology or subject area is invaluable. It makes sense to at least know a dozen names of pioneers in your field.
  2. Global changes must be announced in advance. If you are 100% sure, you don’t need to ask for permission, but it’s worth warning and briefly explaining. It’s easier to ask forgiveness than to get permission, as Grace Hopper said (she coined the term “bug”).
  3. A computer can seriously damage your health. It makes a lot of sense to invest at least a couple of months of your work (aka two salaries) in the ergonomics of your workplace. Be honest with yourself, there are still hundreds of months of sitting in front of a computer ahead. Good ergonomics can turn these months from hard and painful to easy and enjoyable.
  4. Even if it seems to you that the adjacent team will burn you and your proposal, go anyway. No one burns polite and open people. Friends in related teams are great.
  5. You need to help your colleagues not get into overtime out of the blue.