With the explosive development of AI technology and the in-depth upgrading of enterprises' intelligent demands, many companies are shifting from traditional RPA (Robotic Process Automation) to AI Agent product research and development — and our company is no exception. In late 2024, we made a strategic transformation, moving away from the RPA product track we had been deeply engaged in for years to focus on AI Agent innovation.
This transformation was driven by clear limitations of traditional RPA: while RPA excels at standardized, procedural repetitive tasks, it struggles with autonomous decision-making, complex scenario adaptation, and multi-round interactive collaboration. These gaps make RPA unable to meet modern customers' demands for "intelligent, autonomous, and scenario-based" solutions. In contrast, AI Agent leverages autonomous decision-making, interaction, and task execution capabilities — it can proactively decompose tasks, call tools, and adapt to complex scenarios based on objectives, becoming a core breakthrough for enterprise digital transformation.
After this transformation, the core value and technical architecture of our products changed fundamentally — and the logic of AI Agent performance testing needed to iterate synchronously. Unlike traditional RPA testing, which focuses on "process execution efficiency and stability", AI Agent performance testing covers both basic metrics (response speed, concurrency) and Agent-specific dimensions (thinking efficiency, decision accuracy, tool call rationality). The core goal? Verify that your AI Agent runs stably under different pressure scenarios while maintaining its intelligent decision-making capabilities, ensuring it truly adapts to production environment requirements.
Evaluating AI Agent performance requires more than just traditional service metrics — you must also assess its intelligent features. Both are critical to ensuring usability and reliability.
These metrics are the foundation of an AI Agent’s normal operation, aligned with conventional microservice testing but tailored to Agent operating characteristics:
Response Performance: Focus on average response time and P90/P95/P99 percentile response times. Crucially, distinguish between "pure thinking time", "total time for thinking + tool calls", and "single-round response time in multi-round interactions" — evaluate time standards for each scenario separately.
Concurrency and Throughput: Key metrics include maximum concurrent users supported without lag, the concurrency threshold that triggers performance degradation, and tasks/interactions processed per unit time (TPS/QPS).
Resource Utilization: Monitor CPU usage, resident/peak memory, disk I/O (log/cache writing), and network I/O (large model calls, tool integration, multi-agent communication). Watch closely for memory leaks and long-term high CPU load.
Stability and Scalability: Track error rates (interface errors, tool call failures) during long-term operation, mean time between failures (MTBF), and automatic recovery after exceptions. Verify that throughput and concurrency scale linearly with horizontal expansion to avoid ineffective scaling.
These are the key differentiators for judging AI Agent usability. Design detection standards around real business scenarios (single task, multi-task decomposition, tool calls, multi-round interaction, multi-agent collaboration):
Thinking Efficiency: Measure time per thinking step, number of steps to complete a goal (fewer = more efficient), and presence of ineffective thinking (detours, repeated reasoning).
Tool Call Performance: Evaluate tool call success rate, average time consumption (request + response + result parsing), and unnecessary call rate (invalid tool calls). For serial multi-tool calls, verify total time and success rate meet standards.
Decision Accuracy: Compare decision accuracy under pressure to low-concurrency baseline values — ensure no elementary errors occur at high pressure. Task completion rate is critical: define clear "completion standards" (e.g., goal achieved, results meet expectations) and track instruction understanding errors or task decomposition failures.
Multi-Round Interaction Capability: Check for context loss in multi-round conversations, controllable cumulative response time, and ability to complete complex tasks end-to-end.
Multi-Agent Collaboration (for multi-Agent scenarios): Monitor inter-Agent communication latency, total collaborative task time, conflict resolution speed, and resource contention during concurrent collaboration.
Context Window Adaptation: Verify response speed and resource utilization stability across different window sizes (4k/8k/32k) — ensure thinking/decision accuracy doesn’t drop with large windows.
Exception Handling Capability: Evaluate retry strategy effectiveness, rapid recovery, and ability to resume tasks after tool call failures, large model timeouts, or task interruptions.
A reliable test environment is critical for credible results. It must be standardized, isolated, and simulate production dependency links (large models, tool services, databases) — otherwise, test results are irrelevant for production readiness. Here’s how to build it:
Baseline Test Environment: Single Agent instance with dedicated dependencies (large models, tool services) — no external pressure interference. Purpose: Obtain "clean baseline metrics" (response time, decision accuracy under low concurrency) to use as a benchmark for pressure tests.
Pressure Test Environment: Fully aligned with production configurations (Agent deployment method, instance count, server specs, dependency versions). Simulate production status for dependent services (e.g., add large model delays, set tool service concurrency limits). Never run pressure tests in production.
|
Component |
Configuration Key Points |
|---|---|
|
Agent Deployment |
Consistent with production (container/virtual machine, instance count, running parameters, resource limits); no arbitrary adjustments. |
|
Server |
Record CPU, memory, disk, and network specs; monitor resource changes in real time during pressure tests. |
|
Dependent Services |
Match production large model vendor, model, and temperature; align tool service API address, authentication, and concurrency limits; use production-level database/cache data volume. |
|
Middleware |
For multi-Agent collaboration, align message queue (Kafka/RabbitMQ) and distributed lock configurations with production. |
|
Monitoring & Pressure Testing Tools |
Deploy full-link monitoring (Prometheus+Grafana, SkyWalking); use tools that support custom requests, multi-round interactions, concurrency control, and result assertion. |
Physically isolate the pressure test environment from development/testing environments to avoid resource preemption. Deploy dedicated pressure test instances for dependent services — don’t share with other environments — to ensure pressure is applied only to the test object.
Test cases must be rooted in real business scenarios, with clear objectives, input conditions, metric thresholds, and judgment criteria. Progress in this order: single-Agent basic scenarios → complex scenarios → multi-Agent collaboration scenarios.
Every test case should include: a clear scenario (e.g., "single-Agent tool call", "multi-round interactive Q&A"), specific inputs (user instructions/task objectives covering simple/medium/complex levels), concurrency model (concurrent users, test duration, pressure mode: step/continuous/burst), baseline metrics (low-concurrency reference values), threshold requirements (qualification standards), and metrics to collect.
Input: 3 instruction types (simple: 1+2*3=?; medium: design a weekend parent-child travel plan; complex: analyze product user growth logic and propose 3 suggestions)
Pressure mode: Step pressure (10→50→100→200 concurrency, 5 minutes per level)
Key metrics: Response time, TPS, CPU/memory utilization, decision accuracy, error rate under different concurrency
Qualification standards: P95 response time ≤8s, decision accuracy ≥98%, error rate ≤0.5%, CPU utilization ≤70% at 100 concurrency
Input: Single tool call (query today’s Beijing temperature) and multi-tool serial calls (query latest stock price → calculate price change rate → generate simple analysis)
Pressure mode: Continuous pressure (50 concurrency for 30 minutes)
Key metrics: Tool call success rate, total time consumption, invalid call rate
Qualification standards: Call success rate ≥99%, P95 total time consumption ≤15s, invalid call rate ≤1%
Input: Multi-round context dialogue (recommend sci-fi movies → introduce directors → recommend 3 similar movies)
Pressure mode: Burst pressure (0→100 concurrency for 10 minutes)
Key metrics: Single-round response time, context retention rate, final task completion rate
Qualification standards: No context loss, task completion rate ≥95%, single-round P95 response time ≤10s
Input: Collaborative task (A collects industry data → B analyzes → C generates report → summarizes for users)
Pressure mode: Multi-batch pressure (10 collaborative tasks per batch, 10 concurrent batches total)
Key metrics: Total collaborative time, communication latency, overall completion rate, resource contention
Qualification standards: Total time ≤30s, completion rate ≥90%, no resource deadlock
Pressure mode: Mixed low/medium concurrency (50 concurrency for 24 hours, burst 100 concurrency every 2 hours)
Key metrics: Resource utilization trend (CPU/memory stability), cumulative error count, task completion rate fluctuation
Qualification standards: Memory fluctuation ≤10%, cumulative error rate ≤0.3%, TPS fluctuation ≤15%
AI Agent’s unique characteristics mean general tools alone are insufficient. Use a combination of "general tools for basics, custom development for gaps" to cover all metrics.
|
Tool |
Applicable Scenarios |
Advantages |
Notes |
|---|---|---|---|
|
JMeter |
Single-Agent HTTP/HTTPS interface pressure testing, multi-round interactions, tool calls |
Full-featured, supports custom Groovy scripts/step pressure, plug-in expandable |
Requires custom scripts for multi-Agent collaboration; secondary development for decision accuracy assertion |
|
Locust |
Distributed pressure testing, custom business scenarios |
Python-based, easy to write pressure logic (multi-round interactions, tool links), supports distributed deployment |
Weak visualization; pair with Prometheus monitoring |
|
k6 |
Lightweight pressure testing, cloud-native environments |
Concise syntax, supports CI/CD integration, ideal for containerized Agent deployment |
High customization cost for complex scenarios |
|
Postman+Newman |
Low-concurrency baseline testing, interface verification |
Easy to use, perfect for early baseline metric collection |
Does not support high-concurrency pressure testing |
General tools can’t measure thinking steps, decision accuracy, or other Agent-specific metrics — use these targeted solutions:
Count Thinking/Decision Metrics: Parse Agent logs or link tracing data to extract thinking steps, tool call times, and decision results. Compare with baseline results to calculate accuracy and invalid call rates.
Simulate Multi-Round/Collaboration Scenarios: Write Python/Java custom scripts to simulate user multi-round input and multi-Agent communication logic. Implement end-to-end task execution and count completion rates/total time.
Large Model Dependency Monitoring: Use large model platform built-in tools (OpenAI Dashboard, Alibaba Cloud Pailian Monitoring) to collect interaction latency, success rate, and Token consumption.
Automated Assertion: Develop a "result validator" — send Agent execution results and baseline correct results to a large model to judge if task objectives are met. Solves the assertion problem for open-ended tasks.
Cover the Agent itself, dependent services, and servers — collect metrics in real time and visualize results:
Resource Monitoring: Prometheus+Grafana (mainstream), Zabbix — monitor CPU/memory/disk/network.
Link Tracing: SkyWalking, Jaeger — locate slow nodes in thinking, tool calls, and large model interactions.
Log Analysis: ELK, Loki — parse errors, thinking processes, and tool call records.
Custom Monitoring Panel: Use Grafana to aggregate basic metrics (RT/TPS/CPU) and specific metrics (thinking steps/tool success rate) for one-stop viewing.
Follow this standardized process to ensure test results are reliable and reproducible:
Baseline Testing: Run 1-concurrency pressure tests in the baseline environment. Collect baseline metrics for all indicators, confirm Agent functions and decision accuracy are normal — use as a comparison for subsequent tests.
Script Verification: Test pressure scripts under low concurrency (e.g., 10 users) to ensure metrics are collected completely and assertion logic is correct.
Graded Pressure Testing: Increase concurrency from low to high, run each level for a fixed time, record metrics, and identify the performance inflection point (concurrency threshold).
Specialized Pressure Testing: Focus on core scenarios (tool calls, multi-Agent collaboration) and specific metrics for in-depth testing.
Stability Testing: Run long-term mixed low/medium concurrency pressure tests to check for memory leaks and resource exhaustion.
Scaling Testing: Increase Agent instances to verify linear throughput growth and effective load balancing.
Result Review: Compare metrics to thresholds, judge performance qualification, and identify bottlenecks.
Key Reminder: Clean up the environment after each pressure test (restart Agent, clear cache and redundant database data) to avoid residual impacts on subsequent results — reproducibility is critical for reliable testing.
AI Agent performance bottlenecks typically fall into 5 categories — use monitoring and logs to locate them quickly:
Long-term high CPU utilization (inefficient thinking logic or scripts)
Continuous memory growth (memory leaks, e.g., unleased large text/context cache)
Insufficient network bandwidth (tool calls/large model interactions consume bandwidth)
No load balancing (single instance can’t handle high concurrency)
Slow dependent services (database delays, message queue blockages)
Slow large model response (accounts for 80%+ of total time — the most common issue)
Excessive/ineffective thinking steps
Low context processing efficiency (slow parsing with large windows)
Insufficient asynchronous tool calls (long serial multi-tool call time)
Cumbersome communication protocols/resource contention in multi-Agent collaboration
No context cropping strategy (slow response due to window expansion)
Optimize by following these principles: "resolve core bottlenecks first, then fine-tune details; balance performance and intelligence without losing decision accuracy". Optimize from top to bottom:
Use lightweight/local models for simple tasks; reserve cloud-based large-parameter models for complex tasks. Enable streaming responses and batch requests, streamline prompts to reduce Token consumption, and cache repeated request results. Adjust temperature and maximum generation length to balance speed and accuracy.
Streamline prompts to reduce thinking steps; solidify thinking paths for common tasks. Intelligently crop context (retain only key information) and cache core context. Execute thinking, tool calls, and result parsing asynchronously (parallel multi-tool calls). Set reasonable retry times and timeouts; degrade gracefully after failures (return default results/skip steps). Perform simple calculations/parsing locally (no large model/tool dependency).
Optimize tool API response speed (e.g., add database indexes, cache tool results). Call multiple tools in parallel and streamline call parameters. Use pooling for high-frequency tools and intercept invalid calls.
Build an Agent cluster for load balancing. Adjust server configurations to Agent characteristics (add CPU cores for CPU-intensive tasks, increase memory for memory-intensive tasks). Use Redis to cache common results and context. Implement K8s auto-scaling to adapt to concurrency fluctuations.
Use lightweight communication protocols (JSON/Protobuf) and split tasks to avoid repetition. Adopt asynchronous collaboration to reduce waiting. Resolve resource contention with distributed locks and cache intermediate results.
A useful test report isn’t just a data dump — it should support actionable decisions. Include these core sections:
Test Overview: Objectives, environment, test cases, tools used.
Baseline Metrics: Low-concurrency reference values for comparison.
Scenario Test Results: Display metrics by scenario (tables/charts), compare to thresholds, and mark qualification status.
Performance Inflection Points: Maximum concurrency, throughput peak — clarify the Agent’s maximum support capability.
Bottleneck Localization: List core bottlenecks with monitoring screenshots/log snippets; explain impact scope.
Optimization Suggestions: Actionable solutions for each bottleneck, with clear priorities.
Test Conclusion: Judge if the Agent meets online requirements; provide online suggestions (e.g., maximum concurrency limit, instance count).
Follow-Up Plan: Regression test scenarios and retest focus after optimization.
The key to AI Agent performance testing is balancing "basic metrics for availability" and "specific features for usability". Unlike traditional testing, it focuses heavily on intelligent characteristics like thinking efficiency and decision accuracy. Large model dependency is the most common bottleneck — optimize from the top-level large model down. In practice, design test cases around business scenarios and follow the "baseline → graded → specialized → stability" process to fully verify your AI Agent’s production readiness.
Traditional RPA testing focuses on process execution efficiency and stability. AI Agent testing adds assessments of intelligent features (thinking efficiency, decision accuracy, tool call rationality) — critical for verifying the Agent’s core value.
Slow large model response — it often accounts for 80%+ of total time consumption. Optimize by using lightweight models for simple tasks, streamlining prompts, and caching results.
Build a test environment fully aligned with production, isolate environments to avoid interference, follow a standardized execution process, and clean up the environment after each test to ensure reproducibility.
No. General tools handle basic metrics (response time, concurrency) but can’t measure Agent-specific metrics (thinking steps, decision accuracy). Use a combination of general tools and custom development for full coverage.