Exploring the Application of AI in Security Testing: A Tencent Case Study

This article uses Security Protection Products as an example, but this methodology is suitable for in-depth mining of bugs caused by combinations of multiple factors.


With the continuous operation of the product line, the scale of the product line is getting larger and the functions are becoming more complex. The growth of product volume demands higher quality requirements. To achieve higher quality requirements, it is necessary to find ways to increase the intensity of testing, but the cost of using the traditional method of manually writing test cases to automate regression is too high. In recent years, AI technology has played an increasingly important role in more and more fields. At Tencent, we have always maintained curiosity about new technologies, actively learning and applying them to our daily work. The author of this article is Junke Lin, a senior system testing engineer at Tencent Security Department. He has 16 years of experience in software testing and has done a lot of research on the implementation of AI technology in the testing field.

This article uses Security Protection Products as an example, but this methodology is suitable for in-depth mining of bugs caused by combinations of multiple factors. The figure below shows the protection process of a typical traffic attack: a hacker launches an attack on a business server on the Internet, and we have a traffic detection device to detect the attack. Once an attack is detected, the defense is automatically activated, redirecting the attacked IP's traffic to the defense device. The normal traffic is then forwarded back to the business server after being cleaned on the defense device.

Pain Points Analysis of Security Product Testing

1. Hackers' attack methods are diverse and rapidly evolving, requiring products to quickly respond to new attack methods on the live network. However, it is essential not to mistakenly block normal users, so there are many product protection strategies. The table below shows the number of configuration items in the main strategy files, adding up to two or three hundred, and the number is still growing rapidly. Each iteration of a version will add a large number of new configuration items, making the processing logic very complex.

Even if the development is very careful, it is still impossible to ensure that each function is highly cohesive and low-coupling. Sometimes it is still inevitable that originally unrelated configuration items will affect each other. If there is an undue influence between switches that do not affect each other, it may lead to uncontrollable protection after switching strategies. We once had an example of an unexpected impact that caused a failure. The failure at that time was: that a configuration item protecting UDP traffic affected the protection function of HTTPS traffic, but the two configurations had nothing to do with each other. Therefore, we need to test that the product functions can be stable and reliable under various combinations of strategies.

2. For specific traffic, most of the traffic will be protected by a specific protection module. Using this feature can simplify the model, we can grasp the main features for modeling, and other protection details can be ignored for the time being.

The industry solves the problems caused by this combination of parameters mainly by using the full dual algorithm to combine the parameters in pairs. The generated test set can cover all value combinations of any two variables with the least number of combinations. In theory, this set of use cases can expose all the flaws caused by the interaction of the two variables. Although this algorithm generates the least number of combinations, if a new parameter is added to regenerate the combination, the new combination is completely unrelated to the previous combination. So when there are fewer parameters, we often use it to reduce the number of test cases while maintaining better test coverage. However once the number of parameters is large, a new combination is generated every time, and the expected result must be recalculated according to the combination every time, and the whole process will become very complicated. It is difficult to solve the problem of "the protection method of hundreds of switches for specific traffic under different configurations".

In our project team, use cases are added manually and use cases are executed automatically. In this way, it is very difficult to maintain full duality every time a new configuration item is added. For example, assuming that the existing use cases are all dual, now add a new configuration item, and this configuration item can only take two values of 0 and 1. To ensure that all parameters are combined again, it is necessary to add new configuration items based on all original use cases to test once when it is set to 0 and to test again when set to 1. Every time a configuration item is added, the number of use cases doubles, and the number of use cases is very large. If new combinations are generated every time, there are only about 130 combinations of 150 configuration switches in the case of a new generation of full dual combinations. The incremental method can reach 2^150 combinations.

How the Industry Automates Use Case Generation

Is there any solution in the industry that can generate new combinations with a small number of combinations without recalculating the expected results manually? The answer is yes. UML modeling technology is to update and maintain the model with the tested version, and each test is reorganized to generate a new use case for testing. The core value of this technology lies in automatically generating use cases, maximizing functional coverage with the least number of use cases, and ultimately testing the version faster and more completely. The disadvantages of this technology are: that the maintenance of the model is complicated, it is difficult to find design defects (the use case is only traversed mechanically), and the use case is not designed from the user's perspective.

Application of AI in the Field of Front-end Page Testing

In recent years, AI technology has developed very fast, and AI technology also has the same characteristics as UML: it likes to build models. So can complex modeling be bypassed through AI technology? Coordinate use cases as a whole to achieve maximum coverage with the least number of use cases. Also, avoid manually calculating expected results.

To explore the application of new technologies in the field of testing, I quickly scanned the AI's knowledge and then found out that the future of AI in the field of testing has come when I conducted a more in-depth study. There are already many tools in the industry that use AI for automated testing, and even use cases are automatically designed. For the front-end pages, there are even tools that claim that as long as the URL link is given, the tester only needs to wait for the test result. Similar software includes eggplant, Appvance IQ, Sauce Labs, etc.

Through analysis, it is found that these technologies mainly use AI computer vision technology to identify all buttons on the page, generate a traversal tree according to the buttons on each page, and then automatically traverse possible paths (user journey) according to the traversal tree. To achieve the purpose of automated design use cases and automated testing.

Colleagues from Tencent have previously published a book "AI Automated Testing", which introduces in detail the testing of AI on image games and data games.

The existing technologies in the industry are excellent, but they are mainly applied to front-end page testing, and there is no corresponding technology for back-end testing. Therefore, we began to explore how to apply AI technology to back-end testing. After various attempts and combining the characteristics of AI, we came up with a bold idea: without human involvement, machines cannot understand human-designed business logic, and building models like UML is too heavy. However, AI is very good at handling data classification. Since the expected results cannot be calculated, can we not calculate them? The test suite only records how the traffic is handled, and after recording, AI classifies the traffic and protection results. After classification, the typical configurations for each category are analyzed, and then the human reviews whether the traffic handling method under the typical configuration is reasonable.

Exploring AI's Application in Back-end Testing

Based on these ideas, we quickly developed an implementation plan. Our goal is to increase the coverage of various factor combinations at the lowest cost and deeply mine hidden bugs. The theoretical basis for the successful implementation of the plan is: based on testing theory, using the least number of test cases to cover the most scenarios. Utilize AI to classify and analyze responses in various scenarios. These two aspects are feasible when combined.

The implementation steps are as follows:

Step 1: Each time a new configuration item is added, regenerate the configuration based on the all-pairs algorithm.

Step 2: Use typical attack methods for each configuration and record the protection method of the tested end.

Step 3: Analyze the correlation between various protections and configurations through AI, and find out the main configuration items for various protection methods.

Step 4: Check whether the most relevant N configurations for each protection method meet the expected design.


The first part of generating all-pair combinations based on testing theory is straightforward. I implemented it in half a day. To combine configuration items in multiple configuration files, I designed a way to name configuration items using the configuration item name @ file name. I used the pairwise tool to generate the combinations and then converted them into configuration files using scripts.

Based on the all-pairs algorithm, a total of 250 combinations were generated. We selected 27 typical traffic features and launched 'GET', 'POST', 'PUT', 'DELETE', 'HEAD', 'OPTIONS', 'TRACE', and 'CONNECT' requests respectively. There were 27*8=216 types of traffic, and we tested these 216 types of traffic under 250 configurations and recorded their protection methods, resulting in 250*216=54,000 protection records for different scenarios. The recorded results are divided into three parts: the first part is the configuration item combination data, the second part is the name of the traffic sent, and the last column is the protection method used by the tested end.

With the data in hand, we were ready to involve AI. However, our team only had testing experts and no AI experts. So, we sought advice from AI experts within Tencent. After understanding our needs, the AI experts believed it was feasible, but the specific implementation still posed a significant challenge for us. The knowledge gap between the AI field and the testing field was vast, and learning this knowledge from scratch seemed like reading a foreign language.

However, as long as we are willing to think and learn more, there are always more solutions than difficulties. I found a data mining tool that is easy for AI beginners to get started with. After repeated learning and practice, I believe that these components can be applied to our plan. The model I built is as follows:

PCA, short for Principal Component Analysis, can help us find the N configuration items that have a significant impact on the results, sorting the configuration items by the magnitude of their impact on the results. The output is a one-dimensional list. The processing order of the configuration switches designed by developers is a network-like structure, so this result is for reference only. The classification tree is used for quantitative analysis of the impact of configuration items. I think that the information output by this component is relatively valuable.

The analysis results of PCA are as follows. In our case, the curve is quite smooth, indicating that there are no configuration items with a particularly significant impact.

The AI uses the RANK component to analyze the impact of configuration items on the results. Comparing the order with the design flowchart provided by the developers, they roughly match. This preliminary verification shows that the plan is somewhat reliable.

The following figure shows the display after classifying the execution results using the classification tree:

Let's use a typical example to illustrate how to find the problem based on the AI's reference: AI has obtained a large classification tree diagram after processing the data, and each result in the data will be marked with a color, as shown in the figure The yellow, purple, white and green shown in the figure are the data display related to the four results respectively. Among them, the root node in the yellow area indicates a total of 74 pieces of data with the protection method dropos_**.


The most relevant configuration items for this result:




The leaf node on the left indicates:


When drop_@anti_.conf is configured as Android, iOS, or Linux.


The protection method is dropos_**.


The leaf node on the right indicates:


When drop_@anti_.con+f is configured as 0.


The protection method is *_trans.


According to the protection logic of the system under test, I see that there is indeed a problem in this place. This function is a function of discarding specific OS fingerprints, because I only use the Linux system to send traffic when I run the use case, and the function should only be discarded under Linux under normal conditions. AI has analyzed that when drop_@anti_.conf is configured as Android, iOS, win, or Linux, it will be discarded, that is to say, when the configuration is Android, There is a problem with inaccurate OS identification in iOS and Win. Let's make a note of this first.

The configuration item at the bottom of the box is the sub-related configuration item related to the result. Continue to observe its leaf nodes. We pay special attention to the ratio of each leaf node. In this example, when this configuration item is configured with different values, the ratio is close. The tendency of the results is also obvious, which is a signal of low coupling.

Open the original table according to the information displayed in the classification tree, hide the irrelevant columns, and put the related configuration items together. At this time, you can see the problem.

According to the line number corresponding to the problematic scene, find out the corresponding configuration to reproduce the problem in the environment, as shown in the figure. After reproducing the problem, the configuration is as follows:

Expected: The traffic delivered by Linux should not match the above policy and should be forwarded as expected. The actual measurement found that the traffic was dropped because of os_**:

This example shows that under the guidance of AI, it is possible to successfully discover the possibility of misidentification of the OS fingerprint function in a specific scenario, and it also proves that the method of using AI to analyze data is reliable. I think the core value of AI for testing is to present complex data visually, making analysis easier.


To sum up, this method can solve the pain point of "currently there are but not many deep-level bugs caused by mutual coupling of multiple parameters, but to solve these problems, parameter combination testing is required, and the cost of solving is very high". Verify the coupling between multiple factors with a small cost. Automatically generated test cases for 54,000 scenarios, which took 3.5 days to run. After analyzing the results, AI has confirmed 2 bugs with the development. If the use cases for these 54,000 scenarios are manually written, based on the current 30 use cases per person per day, it will take 4.9 years to complete without holidays. After using this method, it only takes a few minutes to generate a combination, and it takes 3.5 days to run. At present, it is estimated that the analysis can be completed in 10 days during the exploration stage, which greatly improves the test efficiency.

About WeTest

WeTest Quality Open Platform is the official one-stop testing service platform for game developers. We are a dedicated team of experts with more than ten years of experience in quality management. We are committed to the highest quality standards of game development and product quality and tested over 1,000 games.

WeTest integrates cutting-edge tools such as automated testing, compatibility testing, functionality testing, remote device, performance testing, and security testing, covering all testing stages of games throughout their entire life cycle.

Latest Posts
1How To Make A Console Game | In-depth Guide How to make a console game? Using a step-by-step approach in this guideline helps to create the console game with far more advanced methodologies in consideration.
4What Problems Can WeTest Global Private Real Device Cloud Solve? WeTest Global Private Cloud solutions are built with robust product capabilities across the IaaS, PaaS, and SaaS layers, providing users with stable, secure, efficient, and automated end-to-end cloud testing capabilities.
5How Does PerfDog EVO v10.0 Conduct Deep Analysis? PerfDog EVO v10.0 is a powerful performance analysis tool designed to enhance the performance of games and applications.