Since the explosion of various AIGC (Generative AI) technologies triggered by the emergence of ChatGPT in 2022, there has been enormous interest in its application potential across industries. In R&D, numerous studies have proven the positive effect of GitHub Copilot on improving development efficiency.
In the testing field, the AIGC boom has sparked extensive research and discussion about its possible applications in software testing. Traditional software testing methods often demand massive human and time investment, and the introduction of AIGC could bring revolutionary changes to the testing industry. The advantages of AIGC in testing lie in its highly efficient automation, rapid learning ability, and capability to process large‑scale data. This paper aims to explore the application of AIGC in testing and conduct an in‑depth study of the pain points in adopting AIGC. By analyzing the potential applications of AIGC in testing, we discuss how AIGC technology impacts testing processes and quality, as well as its potential to improve efficiency, reduce costs, and enhance software quality.
AIGC has a wide range of potential applications in the testing field and can bring many benefits to testing processes and quality. This section focuses on several key scenarios of AIGC in testing and compares them with traditional testing practices to highlight improvements brought by AIGC.
In traditional testing, engineers manually write and execute test cases, which consumes a huge amount of time and resources. With AIGC, test cases can be automatically generated and executed by learning and analyzing the characteristics and behaviors of the software system. By learning and inferring from large‑scale data, AIGC can discover potential test scenarios and exceptions, improving the comprehensiveness and coverage of testing. Compared with traditional methods, AIGC offers efficiency and automation in case generation and execution, greatly reducing testing time and cost.
AIGC also has great potential in defect detection and automated regression testing. Traditional defect detection usually relies on manual experience and rules, which are limited and subjective. AIGC can automatically identify and detect potential defects by learning the normal behavior of the system and common defect patterns. It can analyze test data, logs, and reports, quickly locate and report issues, and support automated regression testing to ensure system stability after fixes. These capabilities improve testing efficiency and quality while reducing manual workload and error rates.
AIGC can also improve collaboration and interaction between testers and developers, as well as between testers and users. During testing, frequent communication and problem‑solving among team members are required. AIGC can act as an intelligent dialogue system, interacting with testers in real time and providing instant feedback and solutions. Testers can converse with AIGC, ask questions, seek advice, and get support. Compared with traditional collaboration, AIGC enables faster, more efficient, and more accurate information exchange, promoting teamwork and decision‑making.
1. Efficient Automation
AIGC provides strong automation in generating test cases, executing test tasks, and analyzing results. Compared with manual testing, it greatly reduces manual effort and time cost, boosting productivity.
2. Rapid Learning Ability
Through machine learning and deep learning, AIGC quickly learns the characteristics and behaviors of software systems. It extracts patterns from large datasets and applies them to test case generation, defect detection, and strategy optimization. This fast learning allows AIGC to adapt to rapidly changing systems and continuously improve test quality and accuracy.
3. Large‑Scale Data Processing
As software systems grow in scale and complexity, testing data expands exponentially. AIGC excels at processing massive data, effectively analyzing and mining potential problems and patterns. It extracts valuable insights from huge datasets, helping engineers make more accurate decisions and discover hidden defects and performance issues.
4. Improved Comprehensiveness and Coverage
Traditional testing is often limited by time and resources, making full system coverage difficult. AIGC improves comprehensiveness and coverage by intelligently generating and executing test cases, uncovering more potential issues and exceptions that traditional methods might miss.
5. Collaboration Between AI and Human Testers
As an intelligent dialogue system, AIGC supports real‑time interaction and collaboration with human testers, providing instant feedback and solutions. It helps engineers understand requirements, verify functions, and solve problems together. This intelligent collaboration improves team efficiency and communication quality, driving the optimization of testing processes.
In summary, AIGC has many strengths in testing: automatic test case generation and execution, intelligent defect detection, regression testing, and better team collaboration. Compared with traditional methods, AIGC improves efficiency, accuracy, and comprehensiveness, thus enhancing software quality to meet growing testing demands. However, realizing this potential requires overcoming a series of challenges, which we discuss in the next chapter.
Although AIGC has broad potential in testing, its real‑world implementation faces many difficulties. This chapter focuses on the major challenges at each stage of AIGC adoption in testing.
Both foundation models and fine‑tuned or distilled small models require large amounts of data for training and fine‑tuning. However, testing data and documents are often fragmented and unstructured. In addition, the quality and accuracy of such data and documents may be problematic. As we all know, training data quality directly determines the capability of foundation models and affects knowledge bases built with LLM embeddings. A lack of sufficient high‑quality data can severely limit the application of AIGC in testing.
Under the current R&D model in China’s internet industry, different phases—including requirements analysis, system design, test analysis, test execution, and release—produce different documents. The structure of these documents and the skill level of their authors directly determine their quality. Meanwhile, the complex relationships between requirements, design, development, and testing documents add further difficulties to data preparation.
Most companies have standardized templates for documents across R&D phases, which helps improve structural consistency to some extent. Even so, the author’s understanding and clarity directly determine the quality of training corpus produced from these documents.
High‑quality requirements for common documents include:
Requirements documents: Accurate, clear, verifiable, and complete for development and testing teams.
System design documents: Detailed design logic, interface specifications, and data structures.
Test analysis documents: Clear test scope, objectives, and methods to ensure effective coverage.
Test execution documents: Accurate records of problems and results for defect tracking.
Release documents: Clear release plans, version changes, and user guides.
High‑quality documents are extremely helpful for embedding and training data preparation. They provide rich, accurate, and representative corpora that help models learn semantics, model context, and improve performance and generalization. Unfortunately, the rapid iteration of China’s internet industry has negatively affected internal documentation and knowledge base quality. Issues such as inaccuracy, incompleteness, poor timeliness, inconsistency, and low accessibility persistently affect the quality of data used for AIGC training, fine‑tuning, and embedding, ultimately limiting overall model performance.
Effectively organizing and identifying contextual and extended relationships between requirements, design, development, and test documents is critical. Challenges include:
Context acquisition and integration: Context is scattered across documents and must be unified and consistent.
Inconsistent information: Changes in requirements or design may not be updated across all documents, leading to conflicts.
Cross‑team collaboration: Different teams have varying writing styles and formats, increasing integration difficulty.
Change management and version control: Frequent changes affect data preparation and model stability.
Graph content: Flowcharts, sequence diagrams, and architecture diagrams must be converted into processable text or structured data.
Daily work documents—including requirements, design, system analysis, and test documents—inevitably contain charts, images, audio, and video. These multimodal contents are information‑rich, but processing them presents many challenges:
Chinese language complexity: More complex word formation, flexible grammar, and ambiguity create difficulties.
Scarcity of Chinese datasets: Smaller scale and lower quality compared to English resources.
Lag in Chinese foundation models: Inferior language understanding, generalization, and extensibility compared to English models.
Immature multimodal processing: Limited ability to comprehensively understand text, images, audio, and video.
Annotation and quality control: Requires professional annotators and tools, with high consistency demands.
UI and API test code written by engineers often has quality issues that affect the overall performance of AIGC models in the quality domain:
Poor maintainability: Unstructured, non‑modular, poorly commented code.
Duplication and redundant logic: Increases complexity and reduces readability.
Weak error and exception handling: Makes failure causes hard to locate.
Insufficient edge case coverage: Fails to validate boundary and abnormal scenarios.
Lack of flexibility and configurability: Difficult to adapt to environment or requirement changes.
Inadequate comments and documentation: Reduces understandability and collaboration.
Low performance and efficiency: Slows down test execution.
Improving models using reinforcement learning (RL) and human feedback (RLHF) is challenging, involving difficulties in data acquisition, training, and fine‑tuning:
High cost and complexity of data acquisition: High‑quality human feedback requires intensive time and labor.
Subjectivity and inconsistency in human feedback: Different evaluators may have conflicting judgments.
Exploration‑exploitation balance: Difficult to optimize model behavior.
Highly complex state and action spaces: Increase computational difficulty.
Convergence and training stability: Long training time and unstable results.
Accurately understanding the semantics of the business under test is a key challenge. This is because tested businesses often involve complex logic, industry specifications, and domain knowledge. Challenges include:
Domain knowledge acquisition and application: Requires deep understanding of business processes, rules, and terminology.
Handling diversity and complexity: Must support various scenarios, inputs, and logic combinations.
Semantic precision and constraints: Must strictly follow business rules and expected outputs.
Context and relationship handling: Must maintain consistency across business workflows.
Adaptation to business changes: Must evolve with dynamic requirements.
Transforming domain documents into usable data for AIGC training, fine‑tuning, and embedding faces additional challenges:
Inconsistent data formats and structures: Diverse document types from different teams.
Large‑scale data collection and cleaning: Handling missing values, noise, and duplicates.
Complex annotation: Requires domain expertise for labeling test cases.
Diverse and heterogeneous data sources: Integrating data from different systems and tools.
Ensuring data quality and accuracy: Validation, denoising, correction, and quality control.
Unstructured data processing: Converting text, graphs, and tables into structured formats.
Multi‑turn dialogue data: Managing context, coherence, history, and evaluation.
The AI wave has brought a flourishing ecosystem of open‑source foundation models. Domain‑specific models rely heavily on base model capabilities, so model selection is critical before training and fine‑tuning. Popular options include ChatGLM, Vicuna, BELLE, and others. The extension of large model capabilities also depends on supporting tools, a typical example being LangChain.
Open‑source models differ greatly in natural language processing performance. A testing‑oriented AIGC model requires integrated capabilities: language understanding, image comprehension, and code understanding and generation. Choosing a suitable base model is a major challenge.
Differences in supporting frameworks and tools also determine the final performance of testing AIGC systems:
LangChain: Best adapted to ChatGPT. Open‑source models often require extra adaptation work, increasing development and training costs.
Plugins: ChatGPT plugins provide strong multimodal and real‑time information capabilities, while alternatives lag significantly.
After overcoming data collection and cleaning, training and fine‑tuning a quality‑domain AIGC model brings new difficulties:
Computing resource demand: Large models require high‑end hardware such as A100 GPUs, which are in short supply.
Model tuning and parameter configuration: Greater complexity requires more experiments and expertise.
Transfer learning and domain adaptation: Models need fine‑tuning to perform well in specific domains.
General benchmarks exist for foundation models, such as MMLU (English), C‑Eval (Chinese), GSM8K (math), BBH (English), etc.
However, there is still no universal evaluation system for AIGC models in the testing domain. Accurately evaluating model performance and deciding whether it can be put into use is a new challenge. Building such a system involves:
Diverse testing scenarios: Functional, performance, security, and more.
Large‑scale data processing: Costly to collect, process, and annotate.
Metric selection: Coverage, accuracy, recall, false positive rate, etc.
Subjectivity in human evaluation: Inconsistent judgments among evaluators.
Model explainability and reliability: Must align with testing standards.
Generalization ability: Must perform well on unseen data and scenarios.
A key application of AIGC in testing is automatically generating and executing test cases. However, generating reasonable, effective, high‑coverage test cases and automatically executing them in properly configured environments remains challenging. Two practical directions are Code2Test and Text2Test.
Code2Test generates test cases by analyzing source code. It extracts key paths, boundary conditions, and exceptions to reduce manual effort. Challenges include:
Code complexity and diversity: Different languages, frameworks, and libraries.
Dynamic code and runtime behavior: Reflection, plugins, dynamically generated code.
Implicit logic and hard‑to‑capture constraints: Runtime dependencies and environment conditions.
Error handling and exceptions: Complex failure paths.
Data and environment dependencies: Requires specific setups to run correctly.
Text2Test generates test cases from natural language documents such as requirements, specifications, and user stories using NLP. Challenges include:
Ambiguity in natural language: Multiple interpretations of the same text.
Missing and incomplete information: Gaps in requirements or user stories.
Large‑scale text processing: Requires efficient parsing algorithms.
Domain‑specific language and jargon: Needs specialized vocabulary understanding.
Text noise and redundancy: Requires cleaning before processing.
Human‑AI collaboration is essential during AIGC‑based test data generation, but it faces several pain points:
Language and communication barriers: Imperfect semantic understanding can lead to misunderstanding.
Risk of misleading results: AIGC output may guide testers in the wrong direction.
Trust challenges: The black‑box nature of AI reduces acceptance.
Complex human‑AI interaction: Requires clear guidance and interpretation.
Importance of human expertise: Professional testing experience remains irreplaceable.
Even with a high‑performance model, designing a product form that fits existing quality assurance workflows and meets diverse platform needs remains difficult.
The fastest MVP for testing AIGC is usually a chatbot or knowledge base based on embedding technology. However, building full‑featured testing products presents challenges:
Wide range of testing requirements: Functional, performance, security, automation, etc.
Customization and flexibility: Must adapt to different businesses and workflows.
Data and model integration: Must connect with existing testing data and systems.
UI/UX design: Must align with testers’ working habits.
Effectiveness verification and credibility: Requires reliable validation and explainability.
Product form directly affects commercialization strategy, promotion, and profitability.
When testing AIGC models are provided as SaaS services, privacy computing is often used to protect user data security. This “usable but invisible” data mode creates problems:
Model understanding barriers: No direct access to raw data limits learning.
Limited model optimization: Cannot refine based on real data distribution.
Lack of personalized tuning: Unable to adapt to user‑specific patterns.
Sampling and noise: Privacy operations distort data and affect model quality.
This paper analyzes the difficulties of applying AIGC technology in the testing domain and aims to inspire further discussion to promote wider adoption and development of AIGC in testing.
Despite the challenges, we firmly believe that large AIGC models will dominate the future of the testing industry. Through continuous exploration and innovation—solving problems in data quality, model explainability, data processing, training, and evaluation—we can further advance the practical application of AIGC in software testing.