The pace of AI technology development has far exceeded our expectations. Every morning, we may wake up to a new AI testing tool released, a new framework open-sourced, or a new concept going viral. Instead of bringing clarity and excitement, this information explosion has created even more anxiety for software testers: Which tools should we learn? Which trends should we follow? Which opinions should we trust?
In this overwhelming wave, it is easiest to go with the flow—but also easiest to lose ourselves. Today’s trend may be forgotten tomorrow, and today’s tool may be replaced the day after. If we only chase the waves, we will never see the direction of the tide.
To avoid being swallowed by information and overwhelmed by cognitive overload, I chose to pause from wave-chasing, step onto the shore, and use my more than a decade of experience in software quality, combined with my current limited understanding of AI, to clarify a fundamental question: Where is software testing heading in the AI era?
This article is merely a snapshot of my thinking at this moment. As the AI era evolves at light speed, these views may soon change fundamentally alongside technological advances, my cognition, and practical experience. For software testers looking to adapt and grow, understanding the core shifts below is critical.
In my opinion, the impact of the AI era on software testing practitioners can be divided into three key propositions—each addressing a fundamental shift in how we approach testing. These propositions guide the future of software testing and help testers prioritize their learning and efforts.
Whether it is large model applications, Agents, or various intelligent decision-making platforms, when the system under test itself becomes an AI system, and when the core logic of the system shifts from deterministic "if-else" to "probability distribution", does our traditional software testing paradigm still apply? Does the testing paradigm need reconstruction? And how?
When AI becomes a tool at our disposal, what can it help us do? Write test cases, generate code, analyze logs, predict risks—can these tasks that once required massive manual effort be automated or even intelligentized by AI? This is not just an evolution of "testing tools", but a leap in "testing productivity" that every tester must embrace.
As more and more code is generated by AI (with the popularity of tools like Claude Code, Cursor, etc.), what kind of system will we face? Are there obvious differences in error patterns between AI-generated code and human code? Does AI understand requirements differently from humans? When development speed increases exponentially due to AI, can testing keep up?
In this article, I will focus on the first two propositions: how to test AI software systems and how to empower traditional software testing with AI. (The company I work for is a startup, and we have not yet widely adopted AI programming. In line with the principle that "practice is the sole criterion for testing truth", Proposition 3 will not be discussed for now. However, I include it here for conceptual completeness.)
Some may say: This is not a new proposition. E-commerce personalized recommendations, short-video recommendation systems—aren’t they all AI systems? Testing has always been done for them. That is true! But as the AI era arrives, the connotation of this proposition has changed, bringing new challenges and opportunities forAI system testing.
In the pre-large-model era, although many vendors applied self-developed vertical models to empower businesses, such practices had extremely high infrastructure requirements: massive data accumulation, complete A/B testing platforms, top-tier algorithm R&D teams. This meant that only leading software vendors could truly implement AI at scale. For mid-tier and smaller enterprises, the cost of self-developed AI was prohibitive.
The emergence of large models broke this barrier. Suddenly, a large number of non-software vendors and mid-to-long-tail software developers saw the possibility of implementing AI in their own businesses. A manufacturing factory, a medical device company, a law firm—all could quickly build their own AI applications.
But with this came an exponential rise in pressure on AI accuracy—a critical factor that redefines how we test AI systems.
Why? Because the scenarios have changed. In the past, AI applications were mainly in ToC fields—if a recommendation was wrong, users would just scroll away; if search results were inaccurate, users would try again. Users had relatively high "error tolerance" for AI.
But when AI enters mission-critical fields such as manufacturing, healthcare, finance, and law, error tolerance plummets. If a knowledge base Agent gives incorrect data when answering "What is the temperature limit of a certain material", the consequences for rigorous manufacturing can be catastrophic. This shift demands a new approach to AI testing—one that is tailored to specific industries.
There are exceptions, of course. Due to widespread "AI anxiety" across industries, we have indeed seen scattered use cases: AI embedded in platforms with heavy business logic, or customers symbolically launching AI features to avoid being seen as "outdated". Quality assurance for such systems still mainly relies on traditional functional testing, and accuracy is not a current priority. But such scenarios are relatively rare, and testers should focus on the broader trend of industry-specific AI testing.
Evaluating the effectiveness of AI systems involves many steps: how to construct evaluation data, determine sample sizes, select metrics, and build tools. For a deeper dive into AI evaluation best practices, check out this guide. But if I must pick the most important one, it is the fidelity of evaluation data—a make-or-break factor for accurate AI testing.
Fidelity means we need high-quality business data as evaluation data. To obtain such data, we must understand the business to a certain extent—understand how users interact with AI systems in business scenarios, and understand what is "correct" from a business perspective (i.e., how to determine ground truth).
In recent discussions, some ToC practitioners and even managers do not understand why evaluation data is difficult. For ToC software, traffic is abundant, and large amounts of evaluation data can be obtained through online feedback. Moreover, most ToC business experts are in-house, making data construction relatively smooth.
But in most ToB scenarios, we face a harsh reality: no evaluation data exists. Customers purchase AI systems often because "they don’t know how to do it themselves and hope AI can help". Yet when we need evaluation data to verify AI effectiveness, we find customers cannot provide "standard answers". This is a classic "chicken or egg" dilemma that testers must navigate.
Even worse, even if customers are willing to cooperate, their level of collaboration is often insufficient to support high-quality data co-construction. Data annotation requires manpower, domain knowledge, and time costs—and in customers’ priority lists, their own business progress matters more. Annotating data for an auxiliary AI system is usually low-priority.
In this situation, we can only adopt a pragmatic strategy: At the early stage of a project, first design conceivable scenarios, build a preliminary evaluation set, obtain a baseline, and decide whether to launch. After launch, through continuous verification, compare the differences between preset scenarios and real customer scenarios, and gradually optimize evaluation data and industry know-how. This iterative approach ensures that AI testing evolves with the business.
Conclusion: Understanding basic AI concepts is necessary, but in-depth understanding of AI technology is "nice to have" rather than mandatory for software testing. This is a key insight for testers looking to avoid unnecessary overwhelm.
Take my own experience as an example. I once participated in testing for autonomous driving. To understand the source code, I reviewed linear algebra and probability theory. When studying algorithm implementation, I even wondered why developers used left-matrix multiplication instead of right-matrix multiplication, or whether confidence intervals were calculated. But the deeper I dived, the more I realized: such in-depth research contributed very little to business quality improvement.
For developers, telling them "positioning is inaccurate under overpasses" is more important than questioning "why use left multiplication instead of right multiplication". As long as you can define scenarios (under overpasses, in tunnels, under strong backlight…), developers can fix them targeted. Instead of spending time digging into source code, you would be better off driving around Beijing to collect those complex scenarios on-site. Although the latter may seem technically worthless, it is a critical action that drives real quality improvement.
My judgment is: if in-depth understanding of AI technology does not bring substantial improvement to AI evaluation, then this action is not mandatory. The core value of testing lies in defining "what the correct scenario is", not proving "why the algorithm is written this way".
My judgment: Some links of AI evaluation will be enhanced or replaced by AI, but evaluation itself will not disappear—it will evolve into new forms. This is good news for testers, as it means our role will adapt, not vanish.
Predictable evolutionary directions include:
The generation of evaluation data will require more vertical, private-domain industry know-how to guide AI. This means testers with deep business knowledge will remain invaluable.
The design phase of evaluation still needs humans. Who defines evaluation dimensions? Who sets pass criteria for different scenarios? Who ensures evaluation data represents real business? These issues involving business understanding, risk judgment, and value trade-offs are difficult for AI to replace in the short term.
Because the essence of evaluation design is to build a "digital twin" of the business in the human brain—a cognitive model that can simulate real business scenarios, user behaviors, and value judgments. Test designers must deduce system performance in different situations, predict potential deviations, and design targeted evaluation plans based on this cognitive model.
Current AI cannot fulfill this role mainly due to two limitations:
Fundamental limitations of model capability: Large language models represented by the Transformer architecture are still essentially probability-based statistical models. They can generate fluent text and match semantic similarity, but do not truly "understand" the world. Such "understanding" is critical when evaluation design requires insight into implicit business logic, prioritizing scenarios, and judging tolerable deviations. This may only change when Professor Li Feifei’s "World Model" is realized, enabling AI to have causal reasoning ability about the physical world and social rules.
Realistic constraints of enterprise digitalization: Building a digital twin relies on a highly digital enterprise environment—standardized processes, data assets, quantifiable businesses. However, constrained by domestic market-driven models and management levels, except for a few top enterprises, most companies have not reached this state. Key information is still scattered in documents, communication, and personal experience, difficult for AI to fully capture. In this case, only insiders can form a "dynamic snapshot" of the business through long-term immersion.
These two limitations jointly determine that, at least in the foreseeable future, the dominance of evaluation design will remain in human hands, and the focus of software testers may shift toward this direction.
AI is not just changing what we test—it’s changing how we test. For traditional software testing, AI offers unprecedented opportunities to boost efficiency, reduce manual effort, and focus on high-value work. Below, we break down the scope of traditional testing and how AI is transforming its core competencies.
Since the early 2000s, driven by Agile and DevOps, the role of software testing has undergone a profound transformation—no longer intervening only during the testing phase, but running through the entire lifecycle from requirements to production through shift-left testing and shift-right testing. This end-to-end approach is critical for modern software quality, and AI is making it more efficient.
Although different companies define testing work differently, within this framework, after shift-left and shift-right, the scope of testing can be roughly categorized as:
Plan Review: PRD review, technical solution review, Code Review, launch plan review, monitoring plan review, canary release review.
Test Design: Test scope assessment (based on code changes, business impact), test strategy writing (test layering, tech selection, environment dependencies, pass criteria), test case writing (functional flows, abnormal scenarios, boundary conditions, implicit requirements).
Test Execution: Test environment preparation, test data construction, test tool development, test execution (functional / performance / security / effectiveness / experience / high-availability testing), data collection and analysis (results, system logs, monitoring metrics).
Quality Decision: Process risk disclosure, test report writing.
Production Insight: Production monitoring system construction, canary monitoring and analysis, production issue localization and mitigation, post-incident review.
Despite differences in maturity across companies, teams, and tech stacks, some core, universal competency requirements for testers run through all appearances. These competencies form the internal "moat" of the software testing community—and AI is reshaping how we apply them.
What it is: Information is the underlying fuel for quality decisions. In most domestic companies, due to low process standardization and digitization, key information is often "fragmented" and "high-entropy". It is scattered in verbal communication, offline/online documents, internal system data and logs, code and comments—isolated and even contradictory.
Before starting specific work, testers must first reorganize and splice these isolated data in their minds to build a model that can be reasoned about. Only when these fragments form a physical system model in the brain can subsequent planning, execution, and decision-making have a framework to rely on. Otherwise, the work will be scattered and prone to omissions.
Replaceable technologies in the AI era:
Obtaining information through communication: AI cannot fully replace human-to-human communication for now. However, as technical roles deeply adopt AI tools, knowledge precipitation and experience output are gradually becoming standardized and normalized. In the long run, the current communication-dominated information acquisition mechanism may be weakened or eliminated.
Obtaining information through documents: The mainstream technical path is to use knowledge base platforms combined with RAG (Retrieval-Augmented Generation) and graph reasoning for information extraction and integration. Limitations remain: low RAG retrieval accuracy, unresolved model hallucinations, limited QA accuracy based on graph structures. Knowledge graphs are being built but lack mature best practices.
Obtaining information through internal systems, code, logs: This essentially requires two basic capabilities: (1) The ability to call local tools (Windows/Linux CLI/GUI). (2) The ability to operate business systems via the Web. The rise of tools like OpenClaw is expected to accelerate the solution to the first category. For the second, paths such as OpenClaw, Playwright automation scripts, and Web-based MCP services have emerged. With moderate human intervention and continuous iteration, these limitations are expected to be substantially broken in the short term.
It should be emphasized that the effectiveness of AI technologies depends not only on tool evolution but also on the enterprise’s strategic willingness, technical confidence, digital infrastructure, and data governance. Only when AI tools evolve synergistically with the organizational digital foundation can these alternative paths truly unleash their potential.
What it is: In most scenarios, testers receive only a goal. We need to break it down into executable steps and act. For example, a performance test requires data construction, script writing, tool debugging, execution, metric collection, result analysis, and go/no-go judgment.
Decision-making means making trade-offs under constraints. If a version is delayed by three days, performance testing shows interface response increased by 50ms, and the business side says "almost unnoticeable"—do you approve the release? This decision requires understanding technical risks, business tolerance, delivery rhythm, and organizational culture. There are no standard answers, only comprehensive judgments based on information.
Replaceable technologies in the AI era: Agentic AI has shown strong potential in problem planning. Intelligent agents such as Claude and Cursor can automatically decompose user-given goals and perform reasoning and decision-making through self-query, self-reflection, and code generation. However, AI’s performance in planning and decision-making highly depends on the integrity of information modeling and the quality of experience precipitation. If input information is inaccurate, domain models are unclear, or real decision cases are lacking, AI’s decomposition and judgment may deviate from reality.
What it is: No matter how complete information modeling or planning is, it must finally land in executable actions. Action execution is the most basic and core practical capability of testers, covering document output, tool operation, script development, and collaboration.
Main forms:
Document output & knowledge precipitation
Test script development
Local tool operation (Windows/Linux CLI/GUI)
Browser operation for test platforms
Communication & promotion (coordinating dev, ops, etc.)
Replaceable technologies in the AI era:
Document and script writing: LLMs (Large Language Models) and agents have mature capabilities to efficiently complete templated, structured output—saving testers hours of manual work.
Local tool operation: Technologies represented by OpenClaw are making automated interaction in local environments possible, reducing the need for manual tool operation.
Browser operation: Agents can already interact with pages through visual understanding and simulated clicks, automating repetitive UI testing tasks.
Communication & promotion: This is one of the areas difficult for LLMs to replace, as it involves human understanding, emotional judgment, and dynamic coordination.
Overall, similar to planning and decision-making, action execution also faces a high probability of being replaced in the foreseeable future. When knowledge precipitation, code generation, tool operation, and UI interaction can all be done by AI, many execution-layer tasks will gradually shift from human-machine collaboration to agent-led.
What it is: Information modeling, planning, decision-making, and execution all rely on a common foundation—experience. Experience is the basic building block that can be reused and inherited at the industry, project, domain, or company level. Examples include "how to write test cases", "how to review product requirements", "how to perform performance testing".
Replaceable technologies in the AI era: Prompt engineering and recent Skills features in platforms like Claude and Cursor are essentially systematic ways to precipitate experience. Technically, these tools have no obvious limitations. However, two key issues affect the effect:
How to effectively extract capabilities from human thinking. Some people lack high-level abstraction ability; others hold back experience for job security.
No clear best practices yet. For example, which content belongs in Prompts, Skills, or Skills/references? How detailed should content be to ensure AI understanding without overfitting?
Issue 1 is not a fundamental limitation—the testing industry has a strong culture of sharing. Issue 2 is a matter of practical exploration, and best practices will emerge over time.
In summary: Except for information modeling, which is deeply bound to the organization’s digital foundation, and local/browser operations that need further iteration, most other aspects are basically ready for implementation through "AI + human on the side" or "human in the loop".
The above capabilities form an end-to-end closed-loop for humans to complete tasks, and the integrity and effectiveness of this loop define the end-to-end quality of deliverables. This loop is recursive: complex problems are broken into subproblems, each following the same closed-loop until simple enough to solve with existing experience.
Although current limitations will eventually be broken, accuracy remains a real challenge when relying on AI for tasks. For example, if a task has 4 steps, each with 90% AI accuracy, the end-to-end accuracy is only 65.61%. Therefore, humans must intervene in AI workflows through specific collaboration models to ensure quality.
Currently, human-AI collaboration mainly takes three forms—each suited to different testing tasks and AI maturity levels:
AI autonomously completes the workflow; humans observe key outputs and supplement or intervene when necessary. Suitable for: test tool development, test document writing, preliminary plan review—where AI capabilities are relatively mature. This mode saves time while ensuring quality control.
Humans check and correct in real time at each key step of AI execution to ensure relatively accurate input for the next AI module. Suitable for: test case generation (strategy design → business model creation → test case creation). AI is not 100% accurate, and human real-time modification helps both the brain and AI model requirements. This is the most common and effective mode for most testing teams today.
Humans dominate the workflow; AI only provides fragmented assistance in local links. Suitable for: core test execution. Even so, we must keep an eye on new technologies to judge the right time for introduction. This mode is ideal for high-risk, business-critical testing tasks.
In my opinion, even as AI continues to breakthrough, according to basic reliability engineering principles, for complex processes (e.g., 10 AI nodes), even if each node reaches 98% accuracy, the end-to-end success rate will drop to about 81.7%. This means that before true AGI (Artificial General Intelligence) arrives, "human in the loop" will likely be a long-term basic form of human-AI collaboration.
Therefore, how to effectively control AI Agents based on the "human in the loop" logic to complete various testing tasks with high quality and efficiency may be the core proposition for software testers from the current AI era to the AGI era.
Under "human in the loop" or "human on the loop" collaboration, an implicit challenge emerges: career insecurity brought by AI. Many testers worry that AI will replace their jobs, but this anxiety is often unfounded—if you adapt proactively.
This insecurity mainly comes from two psychological struggles:
Natural doubt about AI accuracy: Testers know AI makes mistakes. Handing tasks to AI feels like hanging in the air, not knowing when problems will occur.
Worry about personal capability degradation: Long-term habit of "reviewing" AI answers instead of building test logic from scratch may quietly reduce the brain’s modeling accuracy and reasoning ability for complex tasks. Worse, AI bears no responsibility—humans do, ultimately.
My suggestion: Don’t over-internalize friction. The scale of decision-making should align with the organization’s current strategic rhythm.
If your organization is strongly promoting AI: Let go of the obsession with perfection, follow the company’s defined best practices. Trial and error is inevitable. Focus on rapid response and remediation—every mistake is a learning opportunity for the team.
If your organization is exploring or watching: Stay cautious. Proactively align expectations with leaders, define clear "experimental zones" and "protected zones". Boldly try AI for low-risk, failure-tolerant scenarios; rely on humans for high-risk, core business scenarios. This balanced approach minimizes risk while building AI experience.
AI is penetrating every link of software R&D at an unprecedented speed. When developers use AI to generate code and product managers use AI to write PRDs, a sharp question faces every tester: What should we do? The answer lies in a balanced approach—neither waiting nor rushing, but walking.
No one can stay untouched by the AI wave. When developers multiply code output with AI and products iterate rapidly with AI, testing that remains stagnant with manual or traditional automation will inevitably become the obvious bottleneck in the R&D chain. Waiting means giving up adaptation, and giving up adaptation means being marginalized. When the entire production line speeds up but the only quality control link stalls, the result can only be collapse or elimination. For testers, waiting is not an option.
If we cannot wait, should we immediately turn to full-speed AI transformation? For top-tier teams with sufficient computing power and algorithm talents, maybe. But for most mid-tier and below testing teams, blind "rushing" is likely a risky gamble with little gain.
On one hand, "rushing" requires huge investment: purchasing tokens, building services, assigning core members to self-development—all cost massive manpower and energy. On the other hand, "rushing" faces great uncertainty: AI develops exponentially. A tool your team spent half a year building may be overtaken by an open-source project or commercial product tomorrow, making long-term efforts worthless overnight.
For most testers and teams, the most pragmatic and resilient strategy is neither waiting nor rushing, but walking. "Walking" has two meanings that align with long-term success:
Walk with the industry wave: Stay sensitive to cutting-edge tools, not pursuing first release, but rapid follow-up. Test new AI testing tools, learn from industry best practices, and adopt what works for your team. This avoids the risk of overinvesting in unproven technologies.
Walk with the company’s AI strategy: Closely align the testing team’s efficiency goals with the company’s overall AI organizational transformation. This ensures that your AI efforts add value to the business and receive support from leadership.
For most mid-tier teams, the initial dividend from AI may only be fragmented time: A requirement review that originally took half a day is shortened to 3 hours with AI assistance, saving 1 hour. Manually written test data is generated via AI scripts, saving 30 minutes. These fragmented time slices do improve personal efficiency but are often not enough to take on an entire new project independently. They only ease the originally tight work rhythm.
So what is the value of such minor efficiency improvement? The value lies in winning mobility for the team and individuals—becoming "masters of current AI tools". When the company needs to introduce AI testing technology, the team can immediately provide valuable judgments instead of exploring from scratch. This positions testers as strategic partners in the AI transformation, not just executors.
The AI era is not a threat to software testing—it is an opportunity to elevate our role. By focusing on domain expertise, scenario definition, and human-AI collaboration, testers can remain indispensable. The key is to adapt strategically: follow the wave, align with your company’s goals, and prioritize high-value work that AI cannot replace.
As AI continues to evolve, the best testers will be those who embrace change, learn continuously, and leverage AI as a tool to deliver better quality software. The future of software testing is not about being replaced by AI—it’s about working with AI to achieve more than we ever could alone.