Customer Cases
Pricing

Automated Unit Test Generation for Regression Testing: A Case Study

Learn how Baidu built an automated unit test generation system for C/C++ that detects regression issues proactively. This case study covers code analysis, test data generation, failure analysis, and results from deploying across 140+ modules.
 

Source: TesterHome Community

 


 

Overview

Online system anomalies are a persistent and formidable challenge. Traditional testing methods often fail to catch these issues efficiently or cost-effectively, allowing them to reach production.

This article presents a proven solution: a fully automated, general-purpose unit test generation system. Built on a white-box approach, this system has been deployed across over 100 modules at Baidu, generating millions of lines of test code and detecting over a thousand potential regression issues.

We will walk you through the solution’s architecture, covering four core components:

  1. Code Analysis: Structuring raw code into actionable data.
  2. Test Data Generation: Automatically creating high-coverage test cases.
  3. Code Generation: Writing compilable unit tests from templates.
  4. Failure Analysis: Deduplicating and prioritizing test failures for rapid diagnosis.

 

1. The Problem: Why Do Issues Still Reach Production?

Online stability is critical for both user experience and business revenue. It also directly reflects on the effectiveness of the Quality Assurance (QA) team.

To prevent issues, QA teams typically employ a mix of:

  • Stress Testing
  • Functional Testing
  • Unit Testing
  • Static Code Analysis

Despite these comprehensive measures, some issues inevitably slip through. This begs the question: Why does this gap persist?

 

2. Root Cause Analysis: The Cost and Lag of Current Methods

We conducted a comparative analysis of existing anomaly detection methods, focusing on two key pain points: high cost and low recall (see Table 1).

Table 1: Comparative Disadvantages of Common Anomaly Detection Methods

Method

Key Disadvantages

Stress Testing

High resource consumption; reactive detection.

Functional Testing

High development cost; difficult to create exceptional scenarios; reactive.

Unit Testing

High development cost; heavily reliant on developer expertise and effort.

Static Code Analysis

Reactive (rules created post-incident); low precision/recall; unsustainable ecosystem.

 

Overall, current methods are either too costly or are reactive, detecting issues only after they have occurred. While Static Code Analysis is widely adopted, it suffers from:

  • Reactive Detection: Rules are typically written in response to production incidents.
  • Low Precision/Recall: Manual rules often miss specific scenarios or produce false alarms.
  • Lack of Sustainability: No established ecosystem leads to duplicated efforts and difficulty in evaluating rule effectiveness.

Unit testing, however, offers substantial advantages:

  • Tests the smallest units of code, simplifying data construction and verification.
  • Facilitates future regression testing.
  • Consumes minimal resources.
  • Enables earlier bug detection, reducing debugging costs.

This led us to a critical hypothesis: Can we maximize the benefits of unit testing while eliminating its dependency on manual coding and developer experience?

Our answer was to build an intelligent Unit Test generation system that acts as a proactive detection layer.

Figure 2.1: Stability Testing Funnel (A conceptual diagram showing a multi-layered approach, with Intelligent UT as a key proactive step after static analysis).

 

3. Our Strategy: Automating the Developer’s Workflow

In early 2019, we evaluated existing C/C++ unit test generation tools like C++test and Wings. However, neither could meet our requirements for fully automated test generation for complex data types in intricate business scenarios, nor were they easily extensible. We therefore decided to build our own solution.

We first deconstructed the manual process a developer follows to write a unit test:

  1. Identify the Function Under Test: Filter out low-risk functions (e.g., constructors, logging logic).
  2. Analyze the Code: Gather function signature and dependency information.
  3. Construct Test Data: Create input values for the test cases.
  4. Generate Test Code: Write the code that executes the test and checks results.

Our core strategy was to replicate these steps using white-box static code analysis, thereby achieving full automation.

Figure 3.1: The Manual Unit Test Writing Process (An illustration of the typical developer workflow that our system automates).

 

4. Implementation: Building the Intelligent UT System

The overall architecture of our solution is depicted in Figure 4.1. We will now detail the implementation of its four core capabilities.

Figure 4.1: Technical Architecture Diagram (A high-level diagram showing the data flow from source code through analysis, data generation, and code generation, culminating in compilation and execution).

4.1. Code Analysis: Structuring Raw Code

The goal of this stage is to use static code scanning to abstract complex function code into structured feature data, akin to a compiler’s symbol table. This data allows the system to programmatically understand the code.

4.1.1. Key Code Features Extracted

We identified the essential information to be extracted from C/C++ code:

  • Function Calls: Distinguishing between regular and class member functions.
  • Variable Declarations: Determining the type (primitive, class/struct, STL) for correct syntax.
  • Modifiers: const, static, virtual, inline – affecting instantiation and invocation.
  • File-Level Information: Necessary header files and namespaces for successful compilation.
  • Other Attributes: Properties like deleted copy/assignment constructors.

Table 4.1 (details omitted) summarizes the full set of features, including function names, class/namespace, parameter names/types, return type, and modifiers.

4.1.2. Feature Storage

The extracted features are stored in an XML format called Code Struct Data (CSD) . This ensures easy access for other system modules.

Figure 4.2: CSD Example (A snippet of the XML-like structure, showing fields like function, param, and their respective attributes).

4.1.3. Feature Collection

We required a lightweight, efficient, and open-source static analysis tool. We chose cppcheck and performed a secondary development to collect function call chain information and other global data.

Figure 4.3: Code Analysis Flow using cppcheck (A diagram illustrating how cppcheck parses the source code and generates the CSD).

4.2. Test Data Generation: Creating High-Coverage Inputs

This is the core fuzzing engine of our system. We use both generation-based and mutation-based fuzzing methods.

  • Mutation-based: Generates test cases by mutating known valid input samples (e.g., AFL-fuzz).
  • Generation-based: Creates test cases based on interface specifications (e.g., libfuzzer).

Our approach extends the generation-based method by using white-box information (paths, branches, and variable propagation) to guide the fuzzing process. This aims for better coverage and fewer invalid test cases.

Figure 4.5: Test Case Generation Architecture (A diagram showing the process: CSD + Source Code -> Path Selection -> Parameter Selection -> Candidate Data Sources -> Generation & Filtering -> Final Test Case Set).

4.2.1. Path Selection

This module performs three tasks to guide data generation:

  • Constraint Solving: Uses a solver like Z3 to calculate symbolic values for branch conditions (e.g., if statements).
  • Path Reachability Analysis: Builds a control flow graph and prunes infeasible paths, eliminating impossible test cases.
  • Path Merging: Combines paths with overlapping conditions to reduce the total number of test cases needed (see Figure 4.6).

Figure 4.6: Program Example for Path Analysis (Illustrates how multiple branches can be covered by a single merged test case).

4.2.2. Candidate Data Sources

Candidate values for parameters come from:

  • Static Data: A database of common boundary and edge-case values.
  • Dynamic Data: Edge cases generated or extracted through instrumentation from logs and traffic.

4.2.3. Test Case Generation & Filtering

Combining candidate values for multiple parameters can lead to an explosion in test cases. We tackle this in two stages:

  1. Eliminate Unused Attributes: Analyze the code to see which members of complex parameters are actually used.
  2. Eliminate Redundant Test Cases: Use a pairwise testing (2-Wise) strategy. This method generates a minimal set of test cases that covers all possible pairs of parameter values. Research shows this is highly effective.

Figure 4.9: 2-Wise Pairwise Testing Example

Test Case

X

Y

Z

1

x1

y1

z1

2

x1

y2

z2

3

x2

y1

z2

4

x2

y2

z1

5

x3

y1

z2

6

x3

y2

z1

Caption: The 6 test cases generated by 2-Wise testing from 3 parameters (X: 3 values, Y: 2, Z: 2), compared to 12 for a full combinatorial approach.

 

 

 

This method has eliminated over 90% of redundant test cases in our deployments. The final set is stored in JSON format for flexibility.

Figure 4.10: Test Case Set JSON Demo (A JSON snippet showing a function name and a map of parameter values for one test case).

Future work will incorporate meta-heuristic algorithms like genetic algorithms to handle interdependent parameters, further enhancing the system’s intelligence.

4.3. Code Generation: Writing Compilable Tests

For robustness and reliability, we chose a syntax-rule and template-based generation method. This guarantees the generated code is syntactically correct and compilable, unlike deep-learning approaches which are not yet reliable for industrial use.

Figure 4.11: Code Generation Architecture (A diagram showing the flow from the Test Case Set and CSD through a Code Generator that uses a Template Engine to produce the final Unit Test Code).

For C/C++, we generate test code using the Google Test (GTest) framework’s “death test” functionality, which validates that a function terminates the program as expected (or unexpectedly, indicating a crash).

Figure 4.12: Complete Generation Example (Walks through the entire process from the source code of explore_filter to the final generated GTest death test).

4.4. Failure Analysis: From Raw Crashes to Actionable Reports

Analyzing test failures is a major challenge. Key issues we addressed are:

  • Poor Readability: Stack traces (especially from death tests) are often incomplete or overwhelming.
  • Duplicate Issues: The same bug can be triggered by multiple test cases or manifest in similar ways across different functions.
  • High Diagnosis Cost: Developers struggle to pinpoint the root cause from raw stack traces.
  • Lack of Prioritization: No standard criteria for determining which failures are critical.

Our solution implements a processing pipeline that performs:

  1. Storage: Capturing complete stack trace data.
  2. Analysis & Deduplication: Grouping similar failures to avoid redundant reports.
  3. Failure Prediction: Attempting to classify the type of error (e.g., null pointer, out-of-bounds).
  4. Severity Grading: Prioritizing issues for the development team.

Figure 4.16: Stack Trace Analysis Process (A flow diagram illustrating the steps from test execution failure to the generation of a structured, actionable failure report).

4.5. Deployment: Full and Incremental Modes

We designed two deployment modes to fit development workflows:

  • Full-Scan (Legacy) Mode: Scans the entire codebase for existing issues. Used for new module onboarding or scheduled daily/full-regression tasks.
  • Incremental Mode: Triggered by code commits in a CI/CD pipeline. It analyzes the impact scope (directly modified and indirectly affected functions) and only generates tests for that subset. This provides fast feedback without wasting resources.

A risk-assessment step further optimizes the process, filtering out low-risk changes like logging modifications.

Figure 4.17: Deployment Architecture (A comprehensive diagram showing how the core capabilities are integrated with Baidu’s internal platforms to support the entire testing lifecycle).

Figure 4.18: Task Execution Result Example (A screenshot of a CI/CD task report showing a detected crash with details like failure type, stack trace, and the triggering test case).

 

5. Results: Measurable Impact

Engineering Outcomes:

  • Proven Solution: A general-purpose solution developed and deployed for C/C++, generating over 10 million lines of test code.
  • High Coverage: Achieved over 50% function coverage and 20% branch coverage for cold-start modules.
  • Low Resource Footprint: Resource consumption is negligible compared to system-level testing.
  • Low Human Effort: The system automatically handles UT framework adaptation and compilation, eliminating manual maintenance.

Business Outcomes:

  • Wide Deployment: Rolled out across 140+ key backend modules and core libraries.
  • Legacy Issue Detection: Uncovered over 900 pre-existing issues.
  • Incremental Issue Detection: Detected over 200 issues introduced by new code changes.

 

Latest Posts
1Automated Unit Test Generation for Regression Testing: A Case Study Learn how Baidu built an automated unit test generation system for C/C++ that detects regression issues proactively. This case study covers code analysis, test data generation, failure analysis, and results from deploying across 140+ modules.
2Optimizing RSpec Test Suite Speed: Practical Performance Tuning Guide Learn proven RSpec test suite optimization tactics to cut local & CI runtime drastically. Fix slow test cases, optimize DatabaseCleaner, eliminate redundant DB calls & real network requests with complete code examples.
3Server-Side Performance Testing Complete Guide: Core Concepts, Test Types & Tool Benchmarks Learn end-to-end server performance testing fundamentals, key SLAs, standard testing workflow, plus head-to-head benchmarks of wrk, JMeter and Locust load testing tools. Explore self-hosted open-source tools and enterprise managed server performance testing via WeTest.
4Intelligent Test Grading & Release Risk Assessment | Quality Score Model Learn how Baidu’s Quality Score Model enables intelligent test grading, release risk assessment, and data-driven QA automation to boost software delivery efficiency & quality control.
5Test Platform Controversies: Pain Points & Low-Code Solutions What makes a good API testing platform? This article analyzes core pain points of Postman & JMeter, explains testing platform controversies, and shares low-code chaos testing solutions for modern DevOps teams.