A Case Study on Debugging High-Concurrency Bugs in a Multi-Threaded Environment

Testing Tech Hub 2023-11-30 15:00 142

The article covers the debugging process for core dump issues, memory leaks, and performance hotspots, as well as the use of various tools such as GDB, Valgrind, AddressSanitizer, and Perf.

Introduction

In April, I came across an interview question that was reportedly asked by a Tencent campus recruitment interviewer: In a multi-threaded and high-concurrency environment, if there is a bug that occurs once every one million times on average, how do you debug this bug?

The original post on Zhihu is as follows: https://zhuanlan.zhihu.com/p/21348220.

Unfortunately, many answers on Zhihu criticize the validity of the question itself. Although I am not the interviewer in this instance, I believe this is an excellent interview question. Of course, it's just a bonus question, with no points deducted for not being able to answer it, but a good response indicates that the candidate's problem-solving approach and ability are above the average level of fresh graduates.

The reason I wrote the paragraph above is that I think most back-end server-side developers may encounter such bugs. Even if they haven't faced them, such questions can inspire everyone to constantly think and reflect. Coincidentally, I also encountered a similar and even more severe bug in April. This was a deep pit I dug myself, and if I didn't fill it up, the entire project could not be launched.

Now that more than a month has passed, and while I have time, I will summarize my experience, hoping that some of the experiences and tools mentioned can provide some help to everyone.

Project Background

We have made significant modifications to the Nginx event framework and OpenSSL protocol stack to improve the computational performance of Nginx's HTTPS full handshake. Since native nginx uses local CPU for RSA calculations, the single-core processing capability of the ECDHE_RSA algorithm is only about 400 qps. The concurrent performance during initial testing was very low, and even with 24 cores, the performance could not exceed 10,000.

The core functionality was completed by the end of last year, and no problems were found in offline testing. The optimized performance increased several times. To test the maximum performance, we used many clients to concurrently test the HTTPS performance. We quickly encountered some issues:

1. The first issue is that nginx has an extremely low probability (one in a hundred million) of core dumping in different places. During the daytime offline stress test of 20,000 qps, it usually takes two or three hours to produce a core dump. Every night before going to bed, I would compile the latest debug code and start the test, and the first thing I would do when I woke up in the morning was to check the machine and hope that there would be no core dumps. Unfortunately, there are usually a few to dozens of cores, and it is often found that core dumps are concentrated at one point in time. The online gray test ran for six days, and it wasn't until the morning of the sixth day that dozens of core dumps occurred in a concentrated manner. In this way, the probability of this core dump is at least one in a hundred million. However, unlike the multi-threading in the interview question, nginx uses a multi-process + fully asynchronous event-driven programming model (it also supports multi-threading now, but only for IO optimization, and the core mechanism is still multi-process and asynchronous). In the context of web server implementation, the advantages of multi-process asynchronous compared to multi-threading are higher performance, less thread switching, and independent memory space, eliminating the competition of locks between threads. Of course, there are also disadvantages, such as asynchronous programming being very complex, cutting some logically continuous events from space and time, which does not conform to normal human thinking habits, and it is difficult to trace when problems occur. In addition, asynchronous events require a higher level of knowledge of the underlying network and operating system, and it is easy to dig pits if you are not careful.

2. The second issue is that nginx has memory leaks during high concurrency. There is no problem when the traffic is low, but increasing the test traffic will cause memory leaks.

3. The third issue is that we have made some modifications to the key code of nginx and OpenSSL, hoping to improve its performance. So, how do we find performance hotspots and bottlenecks and continuously optimize them?

The background of the first and second issues is that they may only occur when the concurrency is above 10,000 qps. When the QPS is a few hundred or one or two thousand, there are no issues with the program.

Debugging Core Dump

First, let's discuss the solution for the core dump issue, mainly focusing on the following points:

1. GDB and debug log positioning, which is found to be of little use.
2. How to reproduce the bug?
3. Construct a high-concurrency stress test system.
4. Construct stable abnormal requests.

5. GDB and debug log efficiency is too low.

Since there is a core dump, this issue seems easy to locate at first glance. Using GDB to find the core dump point and btrace can reveal the basic cause and context. The direct cause of the core dump is very simple and common, all caused by NULL pointer references. However, it is not clear from the function context why NULL values would appear, as these pointers are used in native nginx events and modules and should not become NULL in these places. Since the root cause cannot be found for the time being, let's fix the core dump first. The fix is also very simple, just check if the pointer is NULL, and if it is, return directly without referencing, and there will be no core dump in this place in the future.

Such defensive programming is not encouraged. If a NULL pointer reference does not cause a core dump but returns directly, this error could potentially affect user access, and it is unknown when such a bug would be exposed. Therefore, a core dump at NULL is actually a very responsible and effective approach.

Returning at NULL does indeed avoid the core dump at this point, but after a few hours, another core dump occurs at another NULL pointer reference. So, I added another judgment to avoid NULL pointer references. Unfortunately, after a few hours, another core dump occurred elsewhere. For several days, I kept wondering why there were NULL pointer situations? Why are there core dumps in different places? Why are there no problems when I access using browsers and command tools like curl?

Those familiar with the nginx code should be aware that nginx rarely checks for NULL values at function entrances and other places. Especially for some key data structures, such as 'ngx_connection_t' and SSL_CTX, which are initialized when the request is received, so it is impossible for NULL values to appear in the subsequent normal processing.

So, I became even more confused. Obviously, the core dump caused by NULL values is just a symptom. The real problem is, why are these key pointers assigned NULL values? At this point, the disadvantages and complexity of asynchronous event programming are exposed. A good client request, which should be logically continuous, is broken into multiple fragments by read/write and time events. Although GDB can accurately record the function call stack at the time of the core dump, it cannot accurately record the complete event processing stack of a request. It is not known which event and which functions assigned the pointer to NULL last time, or even which event used these data structures last time.

For example, the client sends a normal GET request, which needs to be sent twice due to network or client behavior. The server reads the data for the first time but does not read all the data. This read event calls functions A and B, and then the event returns. When the second data arrives, the read event is triggered again, calling functions A and C. The core dump occurs in function C. At this point, the btrace stack frame no longer contains information about function B being called.

So, GDB cannot accurately locate the real cause of the core dump.

New Attempts at Log Debugging

At this point, the powerful GDB is no longer useful. What to do? Print nginx debug logs. However, printing logs is also frustrating. As long as the nginx log level is adjusted to DEBUG, the core cannot be reproduced. Why? Because the amount of information in DEBUG logs is enormous, and frequent disk writes severely affect NGINX's performance. When DEBUG is enabled, performance drops from hundreds of thousands to a few hundred QPS. When adjusted to other levels, such as INFO, the performance is better, but the amount of log information is too small and not helpful. Despite this, logs are a great tool, so I tried the following methods:

1. Enable debug logs for specific client IPs, such as printing DEBUG logs for IP 10.1.1.1 and the highest-level logs for other IPs. Nginx itself supports such configurations.
2. Disable DEBUG logs and add high-level debug logs to some critical paths on my own, printing debug information at the EMERG level.
3. Run nginx with only one process and a small number of connections. Sample and print debug logs for connection numbers with specific endings (e.g., ending with 1).

The overall idea is still to print as detailed debug logs as possible without significantly reducing performance. Unfortunately, the above methods still cannot help locate the problem. Of course, as I continue to debug logs, I become more and more familiar with the code and logic.

How to Reproduce the Bug?

At this point, the debugging efficiency is already very low. Tens of thousands of QPS continuous stress tests, and a core dump only occurs every few hours, followed by modifying the code and adding debug logs. Several days have passed with no progress. Therefore, it is necessary to construct a stable core dump environment offline to speed up debugging efficiency. Although the root cause has not been found yet, a very suspicious point has been discovered: Core dumps are relatively concentrated, often occurring at 4-5 am and 7-8 am, dumping dozens of cores.

Constructing a Weak Network Environment with Traffic Control

Considering that there are many network hardware adjustments and failures at night, I suspect that these core dumps may be related to network quality. Especially when the network is momentarily unstable, it can easily trigger bugs and cause a large number of core dumps. At first, I considered using the TC (traffic control) tool to construct a weak network environment, but then I thought, what is the result of a weak network environment? Obviously, it is various network request anomalies. So, it is better to directly construct various abnormal requests to reproduce the problem. Therefore, I prepared to construct test tools and environments that need to meet two conditions:

1. Strong concurrency performance, capable of sending tens of thousands or even hundreds of thousands of QPS simultaneously.
2. A certain probability of abnormal requests is required. Especially in the TCP handshake and SSL handshake stages, abnormal terminations are needed.

Traffic control is an excellent tool for constructing weak network environments. I have used it before to test SPDY protocol performance. It can control network speed, packet loss rate, latency, and other network environments and is included as a tool in the iproute toolset provided by the Linux system. However, the more troublesome part is that the configuration rules of TC are very complex. Facebook has encapsulated TC into an open-source tool called apc, which those interested can try.

WRK Stress Testing Tool

Since the core dump may only occur during high-concurrency traffic, the first step is to find a powerful performance testing tool. WRK is an excellent open-source HTTP stress testing tool that uses a multi-threaded + asynchronous event-driven framework. The event mechanism uses Redis's ae event framework, and the protocol parsing uses Nginx's related code. Compared to traditional stress testing tools like ab (Apache Bench), the advantage of WRK is its high performance. A single machine can send several million QPS and fully utilize the network card without any issues. The downside of WRK is that it only supports HTTP-based protocols and does not support testing for other protocols, such as protobuf. Additionally, data display is not very convenient.

Nginx test usage: wrk -t500 -c2000 -d30s https://127.0.0.1:8443/index.html

Building a Distributed Automated Testing System

Since it is an HTTPS request, when using the ECDHE_RSA key exchange algorithm, the client's computational consumption is also relatively large, with a single machine handling just over 10,000 QPS. In other words, if the server's performance is 30,000 QPS, a single client cannot send such high pressure, so a multi-machine distributed testing system needs to be built. This system allows the central control machine to simultaneously control multiple test client machines to start and stop testing. As mentioned earlier, the debugging efficiency is too low, and the entire testing process needs to be automated. For example, before going to bed at night, you can control multiple machines to run throughout the night on different protocols, different ports, and different cipher suites. During the day, since you are constantly monitoring, you only need to run the test for a few minutes before checking the results.

This system has the following functions:

1. Concurrently control the start and stop of multiple test clients, and finally summarize and output the total test results.
2. Support HTTPS, HTTP protocol testing, and support web server and reverse proxy performance testing.
3. Support configuring different test times, ports, and URLs.
4. Select different SSL protocol versions and cipher suites based on the port.
5. Choose web server or reverse proxy mode based on the URL.

Constructing Abnormal Test Requests

The stress testing tools and systems are ready, but the core dump environment cannot be accurately reproduced. Next, we need to construct abnormal requests. What abnormal requests should be constructed? Since the added functionality code is mainly related to SSL handshake, this process occurs immediately after the TCP handshake, so the abnormalities mainly occur during this stage. So, I considered constructing the following three abnormal scenarios:

1. Abnormal TCP connections. That is, when the client's TCP connect system call is made, there is a 10% probability of directly closing the socket.
2. Abnormal SSL connections. Consider two situations: during the first stage of the full handshake, when sending the client hello, the client has a 10% probability of directly closing the connection. During the second stage of the full handshake, when sending the clientKeyExchange, the client has a 10% probability of directly closing the TCP connection.
3. Abnormal HTTPS requests. 10% of client requests use the wrong public key to encrypt data, so nginx decryption will definitely fail.

Core Bug Fix Summary

After constructing the high-concurrency stress abnormal test system mentioned above, a core dump is indeed triggered within a few seconds. With a stable testing environment, the efficiency of bug fixing will naturally be much faster. Although it is still inconvenient to locate the root cause through GDB at this time, the test requests have already met the conditions for triggering a core dump, and enabling debug logs can also trigger a core dump. So, you can continuously modify the code, constantly debug with GDB, and continuously add logs, step by step to trace the root cause and get closer to the truth.

Eventually, the root cause of the core dump was found through repeated iterations of the above steps. In fact, when writing the summary document, the root cause of the core dump is not that important. What is most important is the approach and process of solving the problem, which is worth sharing and summarizing. In many cases, after painstaking troubleshooting, the problem turns out to be a very obvious or even foolish error.

For example, the main reason for this core dump is: Due to not correctly setting non-reusable, when the concurrency is too high, the connection structure used for asynchronous proxy calculation is recycled by nginx and initialized, resulting in NULL pointers and core dumps in different events.

Memory Leak

Although the core dump issue has been resolved, another problem has surfaced, which is memory leakage during high-concurrency testing, about 500MB per hour.

Disadvantages of Valgrind

When encountering memory leaks or memory issues, the first tool that comes to mind is Valgrind. Valgrind is an excellent software that can be used to test programs directly without recompiling them. Its functionality is also very powerful, capable of detecting common memory errors such as uninitialized memory, out-of-bounds access, memory overflow, and incorrect free operations. I recommend everyone to use it. The basic principle of Valgrind's operation is: The program to be tested runs on the simulated CPU provided by Valgrind, which records memory access and calculated values, and finally compares and outputs errors. I have also found some memory-related errors when testing Nginx with Valgrind and would like to share some experiences of testing Nginx with Valgrind:

1. Nginx usually runs using the master fork subprocess method. Use --trace-children=yes to track subprocess information.
2. When testing Nginx + OpenSSL, many memory errors will be reported when using the rand function, such as "Conditional jump or move depends on uninitialized value" and "Uninitialized value was created by a heap allocation." This is because rand data requires some entropy, and not initializing it is normal. If you want to remove Valgrind error prompts, you need to add an option when compiling: -DPURIFY.
3. If there are too many Nginx processes, such as more than 4, it will cause Valgrind error log printing to be confusing. Try to reduce the number of Nginx worker processes and keep it at 1. This is because general memory errors are usually not related to the number of processes.

I have mentioned Valgrind's features and usage experience above, but Valgrind also has a significant drawback, which is that it significantly reduces the performance of the program. The official documentation states that using the Memcheck tool can result in a 10-50 times performance decrease. In other words, if the full handshake performance of Nginx is 20,000 QPS, when testing with Valgrind, the performance will be around 400 QPS. For general memory issues, reduced performance is not a big concern, but in my case, the memory leak might only occur during high-stress testing. If the performance is significantly reduced, the memory leak error cannot be detected at all. I have to consider other methods.

Advantages of AddressSanitizer

AddressSanitizer (abbreviated as ASan) is a fast memory detection tool for C/C++ programs. Its advantage over Valgrind is its speed; according to the official documentation, it only reduces the program's performance by 2 times. For those interested in the principle of ASan, you can refer to the article on ASan's algorithm. Its implementation principle is to insert some custom code into the program code, as follows:

Before compilation: *address = ...; // or: ... = *address;

After compilation: if (IsPoisoned(address)) { ReportError(address, kAccessSize, kIsWrite); } *address = ...; // or: ... = *address;

The significant difference between ASan and Valgrind is that ASan requires adding compilation flags and recompiling the program, but you don't need to modify the code yourself. Valgrind, on the other hand, can be run directly without recompiling the program. AddressSanitizer is integrated into the Clang compiler and is only supported in GCC 4.8 and later versions. Our online programs are compiled by default using GCC 4.3, so when I tested, I directly recompiled Nginx using Clang:

--with-cc="clang" \ --with-cc-opt="-g -fPIC -fsanitize=address -fno-omit-frame-pointer"

The 'with-cc' option specifies the compiler, and the 'with-cc-opt' option specifies the compilation options. '-fsanitize=address' enables the AddressSanitizer function. Since AddressSanitizer has a minimal impact on Nginx, it can achieve tens of thousands of concurrent connections during high-stress testing, making it easy to locate memory leak issues. I will not go into detail about the cause of the memory leak here, as it is related to OpenSSL's error handling logic and is my own implementation, which has no general reference value.

The most important thing is to know the use cases and methods of Valgrind and ASan so that memory-related issues can be fixed quickly when encountered.

Performance Hotspot Analysis

At this point, the modified Nginx program has no risks of core dump and memory leaks. However, this is not the result we are most concerned about (because the code should be like this). The questions we are most concerned about are:

1. Where is the bottleneck in the program before code optimization? To what extent can it be optimized?
2. After code optimization, is the optimization thorough? What new performance hotspots and bottlenecks will emerge?

At this point, we need some tools to detect the performance hotspots of the program.

Perf, OProfile, Gprof, SystemTap

The Linux world has many excellent performance analysis tools. I will briefly introduce a few of the most commonly used ones:

1. [Perf](Perf Wiki) is probably the most comprehensive and convenient performance detection tool. It is carried by the Linux kernel and updated synchronously, which basically meets daily use. I recommend everyone to use it.
2. OProfile is a somewhat outdated performance detection tool, basically replaced by Perf, and its command usage is not very convenient. For example, commands like opcontrol --no-vmlinux, opcontrol --init to start, then opcontrol --start, opcontrol --dump, opcontrol -h to stop, and opreport to view results, etc., a long list of commands and parameters. Sometimes it is easy to forget to initialize, and the data is empty.
3. Gprof is mainly a performance analysis tool for application layer programs. The disadvantage is that it requires recompiling the program and has some impact on program performance. It does not support some kernel-level statistics. The advantage is that the application layer function performance statistics are more detailed and closer to our daily understanding of performance, such as the running time of each function, the number of function calls, etc., which are very user-friendly and easy to read.
4. SystemTap is actually a runtime program or system information collection framework, mainly used for dynamic tracing, and can also be used for performance analysis. It has the most powerful functionality but is relatively complex to use. It is not a simple tool but can be considered a dynamic tracing language. If the program encounters very troublesome performance issues, I recommend using SystemTap.

Here, I will introduce the Perf command in more detail. It is installed by default on Tlinux systems. For example, using perf top can list the current system or process hotspot events and function rankings. Perf record can record and save system or process performance events for later analysis, such as the flame graph to be introduced later.

Flame Graph

Perf has a drawback: it is not intuitive. Flame graphs are designed to solve this problem. They can display event hotspots and function call relationships in a vector graphics way. For example, I can draw the performance hotspot of the native Nginx in the ECDHE_RSA cipher suite with the following commands:

1. perf record -F 99 -p PID -g – sleep 10
2. perf script | ./stackcollapse-perf.pl > out.perf-folded
3. ./flamegraph.pl out.perf-folded > ou.svg

Directly through the flame graph, you can see the percentage of each function's usage. For example, in the graph above, you can clearly see that the rsaz_1024_mul_avx2 and rsaz_1024_sqr_avx2 functions account for 75% of the sampling ratio. The objects we need to optimize are very clear: can we avoid the calculations of these two functions? Or implement their calculations using a non-local CPU solution? Of course, it is possible. Our asynchronous proxy calculation solution is designed to solve this problem.

There are no RSA-related calculations in the hotspot events anymore. As for how this was achieved, I will write a dedicated article to share it later when I have time.

Mindset

It took me about three weeks to solve the core dump and memory leak issues mentioned above. The pressure was enormous, and I was mentally tense. To be honest, it was a bit embarrassing that it took so long to deal with seemingly simple problems. Of course, I was not very happy and felt a bit anxious, especially during the critical period of the project launch. But even so, I was very confident and spirited throughout the process. I kept telling myself:

1. Debugging bugs is a rare learning opportunity, so don't see it as a burden. Whether it's online or offline, actively and efficiently tracking bugs, especially challenging ones, is a valuable learning opportunity for yourself. Faced with such a great learning opportunity, you should naturally be enthusiastic and eager. In retrospect, if it weren't for this bug, I wouldn't have had a deeper understanding and use of some tools, and this document wouldn't have been created.

2. No matter what kind of bug it is, it can definitely be resolved over time. Thinking this way, it's actually much more relaxed, especially when taking over new projects or transforming complex projects. Initially, you may not be very familiar with the code and business, so there's a transition period. But the key is to keep these issues in mind. During the day, there are many distractions at work. On the way to and from work and before going to bed at night, your brain can be more alert, and your thoughts will be clearer. Especially during the day, it's easy to get stuck in a mindset and fall into a long-lasting misconception, debugging for half a day, only to find your brain in chaos. When you're alone on the way to and from work or before bed, you might come up with some new ideas and solutions.

3. Openly discuss the issues. Don't be embarrassed when facing problems, no matter how simple or basic they may be. As long as the issue is not something you can find by Googling, feel free to discuss it seriously with your colleagues. During this bug debugging, several key discussions gave me great inspiration, especially the final reusable issue, which was also inspired by discussions with my colleagues.

This article is authorized for release by Tencent WeTest.
Link: https://zhuanlan.zhihu.com/p/21348220
Source: Zhihu

stress-testing load-testing

Read Previous Post >>

A Comprehensive Guide to Using Fiddler for Mobile Data Packet Capture