Customer Cases
Pricing

How to Test AI Products: A Complete Guide to Evaluating LLMs, Agents, RAG, and Computer Vision Models

A comprehensive guide to AI product testing covering binary classification, object detection, LLM evaluation, RAG systems, AI agents, and document parsing. Includes metrics, code examples, and testing methodologies for real-world AI applications.
 

Source: TesterHome Community


 

Table of Contents

 


 

Part 1: Why We Rarely Talk About “Bugs” in AI Models

The biggest difference between traditional software testing and AI testing is:

Traditional Software

AI Systems

Pursues logical correctness

Pursues statistical sufficiency and continuous optimization

The core competency of a professional AI tester isn’t just finding functional defects; it’s translating the uncertain behavior of a model into an observable, comparable, and trackable system of quality metrics.

Why “Bug” Is the Wrong Concept for AI

In the AI field, we rarely judge a model by saying it has a “bug.” Instead, we evaluate:

  • Whether its effectiveness is good
  • Whether metrics meet targets
  • Whether it satisfies business goals

This isn’t lowering the bar; it’s because the nature of models is fundamentally different:

  1. Models learn statistical patterns from historical data, not complete rule sets covering all world knowledge
  2. The world’s data is dynamic — new knowledge, expressions, and scenarios emerge daily
  3. No model can be 100% accurate, especially in open-ended scenarios

The Statistical Mindset for AI Testing

Evaluating a model based on 1-2 samples is unscientific:

  • A single hit might just be a lucky match with the training distribution
  • A single failure might just hit a known weakness of the model
  • A small sample cannot represent the whole or reflect the true boundaries of the model’s capability

AI testing must adhere to a statistical mindset:

  • Evaluate overall effectiveness with a sufficiently large dataset
  • Evaluate performance on different sub-scenarios using stratified data
  • Use consistent metrics for horizontal and vertical comparisons

Key Principle: AI testing isn’t about finding one wrong example to prove a model is bad, but about using large amounts of data to prove whether the model is reliable for the business.

The AI Testing Process

 

 

1. Deconstruct Business Scenarios → Identify model type
2. Define Metrics → Based on model type and business goals
3. Collect Data → Large, diverse datasets, stratified by user personas
4. Label Data → Create ground truth (strict isolation from dev teams!)
5. Calculate Metrics → Automated scripts using labeled data

 

Critical Note: Test data must be strictly isolated from algorithm and development teams. Developers could overfit the model to your specific test data, making it perform well in testing but terribly in production.

 

Part 2: A High-Level Look at Metrics – Match the Scenario, Then Choose the Metric

Since evaluation relies on large datasets, you must choose metrics that match the scenario. Choosing the wrong metric leads to inaccurate conclusions.

2.1 Classification Scenarios (Binary & Multi-class)

Definition: The output is a discrete, enumerable category.

Typical Scenarios:

  • Fraud detection (credit card fraud)
  • Spam email detection
  • Intent recognition (weather, order, small talk, knowledge retrieval)
  • Facial identity recognition

Core Metrics:

  • Precision – Of samples predicted positive, how many were truly positive?
  • Recall – Of all actual positives, how many were correctly identified?
  • F1 Score – Harmonic mean of precision and recall
  • AUC – Especially suitable for binary classification with probability output

2.2 Regression Scenarios

Definition: The output is a continuous numerical prediction.

Typical Scenarios:

  • Housing price prediction
  • Sales forecasting
  • Dynamic pricing
  • Duration/risk score prediction

Common Metrics:

Metric

Description

MAE

Mean Absolute Error – intuitive and easy to interpret

MSE

Mean Squared Error – penalizes larger errors more heavily

RMSE

Root Mean Squared Error – heavy penalty on large errors, interpretable units

MAPE

Mean Absolute Percentage Error – good for relative error

SMAPE

Symmetric MAPE – improves MAPE’s instability near zero

Coefficient of Determination – explains variance

Median AE

Median Absolute Error – robust to outliers

Practical Advice: Use at least one absolute error metric + one relative error metric + one robustness metric.

2.3 Composite Scenarios (Classification + Regression)

Definition: Tasks involving both “category determination” and “numerical localization/regression.”

Typical Scenario: Object detection in computer vision

  • Classification: Is there an object? What class?
  • Regression: Bounding box coordinates (x, y, w, h)

Core Metrics: Combine classification and localization metrics, such as Precision/Recall paired with IoU.

2.4 Text Scenarios

Definition: Tasks focused on text consistency and correctness.

Typical Scenarios:

  • OCR (Optical Character Recognition)
  • ASR (Automatic Speech Recognition) transcription
  • Text translation
  • Document parsing

Common Metrics:

  • CER (Character Error Rate)
  • WER (Word Error Rate)
  • Levenshtein Distance (Edit Distance)
  • Text Similarity (character-level, word-level, semantic)

2.5 Generative Model Scenarios

Definition: Outputs are open-ended with large answer spaces.

Typical Scenarios:

  • Single-modal large model Q&A
  • Multi-modal Q&A (images, documents, video)
  • Agent task execution

Complexities:

  • No single standard answer
  • Need to distinguish objective correctness from subjective quality
  • Must incorporate safety, compliance, and stability evaluations

 

Part 3: Deep Dive into Binary Classification – Why Accuracy Isn’t Enough

3.1 The Value and Limitations of Accuracy

Accuracy is simple but can be misleading in imbalanced class scenarios.

Classic Example: Disease Screening

Metric

Value

Total samples

10,000 people

Healthy

9,500

Sick

500

Model prediction

Everyone as “healthy”

Accuracy

95% (9,500 correct predictions out of 10,000)

But the model missed ALL sick individuals — a complete business failure.

Conclusion: Accuracy can be observed, but it shouldn’t be the sole core decision-making metric.

3.2 Confusion Matrix: The Foundation of Classification Testing

The confusion matrix compares predictions against ground truth:

 

Predicted Positive

Predicted Negative

Actual Positive

TP (True Positive)

FN (False Negative)

Actual Negative

FP (False Positive)

TN (True Negative)

 

Example: Disease Diagnosis

 

Predicted Sick

Predicted Healthy

Total

Actually Sick

25 (TP)

5 (FN)

30

Actually Healthy

3 (FP)

67 (TN)

70

Total

28

72

100

 

3.3 Precision

Formula:

 

\text{Precision} = \frac{TP}{TP + FP}

Meaning: Of samples predicted as positive, how many were truly positive?

Example Calculation:

 

\text{Precision} = \frac{25}{25 + 3} = \frac{25}{28} = 0.893 = 89.3%

Interpretation: Of 28 cases predicted as ‘sick’, 25 were actually sick.

Applicable Scenarios:

  • High cost of false positives (e.g., wrongly banning a user)
  • “Better to miss some than to be wrong too often”

3.4 Recall

Formula:

 

\text{Recall} = \frac{TP}{TP + FN}

Meaning: Of all actual positives, how many were correctly identified?

Example Calculation:

 

\text{Recall} = \frac{25}{25 + 5} = \frac{25}{30} = 0.833 = 83.3%

Interpretation: Of 30 truly sick people, the model found 25. 5 were missed.

Applicable Scenarios:

  • High cost of false negatives (e.g., missed disease diagnosis)
  • “Better to flag more than to miss a critical risk”

3.5 F1 Score: Balancing Precision and Recall

Formula:

 

F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

F1 is the harmonic mean of Precision and Recall, penalizing models skewed toward one metric.

Applicable Scenarios:

  • When both false positives and false negatives matter
  • When a balanced overall performance is needed
  • As a comprehensive metric for comparing model versions

3.6 Thresholds and Business Goals

For the same model, Precision and Recall trade off against each other as the decision threshold changes.

Testers should work with business stakeholders to determine:

  • Which type of error has higher cost?
  • Which metric is primary?
  • Is F1 the balanced metric or a minimum acceptance criterion?

 

Part 4: Testing Object Detection Systems

Object detection is a classic composite evaluation task in computer vision.

4.1 The Nature of the Task

Object detection answers three questions:

  1. Is there an object in the image?
  2. What class is the object?
  3. Where is it located? (bounding box coordinates)

Evaluation Must Cover:

  • Classification: Is the object class correct?
  • Localization: Is the bounding box accurate?

4.2 IoU: The Core Metric for Localization

IoU (Intersection over Union) measures overlap between predicted and ground truth bounding boxes:

 

\text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}}

  • More overlap = higher IoU
  • IoU ≥ threshold (e.g., 0.5) + correct class = correct detection
  • Common thresholds: 0.5, 0.75, or evaluated in steps from 0.5 to 0.95

4.3 Data Collection Considerations

Instead of user personas, differentiate based on business-relevant attributes:

Dimension

Examples

By Size

Large objects vs. small objects

By Scenario

Daytime vs. nighttime, occluded vs. clear, backlit vs. well-lit

By Distractors

Test with similar objects (e.g., bicycles when detecting scooters)

 

4.4 Integration with Peripheral Systems

An AI product is rarely just a model. The model is the core, but it’s supported by a whole system.

Example: Vehicle Illegal Parking Detection

 

 

Camera Stream → Frame Extraction → Object Detection → ROI/AOU Logic → Alert

 

Key System Components:

Component

Description

ROI (Region of Interest)

Manually defined parking zones

AOU (Area Overlap)

Minimum overlap to count as “parked”

Temporal AOU

Track position across frames to distinguish parked vs. passing vehicles

Critical Insight: We test the AI product, not just the model. The model must be tested, but so must the effectiveness of the integrated business system.

4.5 End-to-End Testing Strategy

Data Collection Using Pre-labeling:

Step

Description

1

Use ffmpeg to pull RTSP streams and extract frames

2

Use a model with lowered confidence threshold (e.g., 0.4 vs. business threshold 0.8) to filter candidates

3

Human testers perform secondary screening and labeling

 

Simulating Video Streams:

 

 

ffmpeg -re -stream_loop 100 -i video.mp4 \
  -rtsp_transport tcp -c:v libx264 \
  -f rtsp rtsp://localhost:8554/mystream

 

This pushes a video file as an RTSP stream to a media server, allowing end-to-end testing without physical cameras.

 

Part 5: Testing Recommendation Systems

5.1 Reframing the Problem

Recommendation systems are often reframed as ranking problems:

Step

Description

1

Candidate Generation — Filter items based on user profile

2

Binary Classification — Predict click probability for each candidate

3

Scoring & Ranking — Rank by probability, present Top N

 

5.2 Key Metrics for Retrieval

Metric

Description

Top N Recall

Proportion of items the user liked that appear in Top N recommendations

mAP (Mean Average Precision)

Considers both relevance and order of retrieved items

 

5.3 The Self-Learning Flywheel

Recommendation systems require high-frequency self-learning cycles:

 

 

Data Collection → Model Training → A/B Test → Deployment → Data Reflow

 

Why This Matters for Testing:

Factor

Implication

Automatic labeling

A click is a label — no manual annotation

Rapidly changing interests

User preferences shift with news and trends

Time sensitivity

No time for offline testing — models become stale quickly

Mitigation strategy

A/B testing and real-time monitoring become critical

Real-World Example: A classmate working on ByteDance’s ad system (annual compensation >1.35M RMB) focuses entirely on A/B testing, data quality monitoring, and real-time metric alerting — there’s no traditional testing role.

 

Part 6: OCR and Document Parsing Evaluation

6.1 What Is Document Parsing?

Document parsing converts various document formats (PDF, Word, Excel, images) into structured text. This is the first and most critical step in building a knowledge base for RAG (Retrieval-Augmented Generation).

6.2 Why Document Parsing Matters

Reason

Impact

Data Entry Point

Over 90% of enterprise knowledge resides in documents

Foundation of Accuracy

If parsing is wrong, everything downstream is wrong

Cost Savings

Automates extraction, reducing up to 80% of manual entry costs

Efficiency Gains

Batch processes thousands of pages in minutes

 

6.3 Common Document Type Challenges

Document Type

Difficulty

Key Challenges

Word

⭐⭐

Nested tables, complex styles

PDF

⭐⭐⭐⭐

Scanned copies, multi-column layouts, watermarks

Excel

⭐⭐⭐

Formulas, merged cells, multiple sheets

Image

⭐⭐⭐⭐⭐

Noise, handwriting, skew, blur

 

6.4 Testing Metrics for Document Parsing

Metric

Description

Target

Text Similarity

How close parsed text is to ground truth

>95%

Table Accuracy

Completeness and correctness of table recognition

>90%

Formula Accuracy

Correctness of mathematical formula recognition

>85%

Layout Fidelity

How well document structure is preserved

>90%

 

6.5 Text Similarity Calculation (Levenshtein Distance)

 

 

def levenshtein_distance(s1, s2):
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1]) + 1
    return dp[m][n]

def text_similarity(text1, text2):
    distance = levenshtein_distance(text1, text2)
    max_len = max(len(text1), len(text2))
    return 1 - (distance / max_len) if max_len > 0 else 1.0

 

 

Part 7: Large Language Model Evaluation

10 Core Evaluation Categories for LLMs

#

Category

Description

1

Instruction Following & Execution

Can the model follow basic and complex instructions?

2

Knowledge Q&A & Understanding

Factual accuracy across domains

3

Logical Reasoning & Mathematics

Multi-step reasoning and math problem solving

4

Code Generation & Understanding

Writing, debugging, and explaining code

5

Role-Playing & Emotional Intelligence

Maintaining character and handling emotional contexts

6

Creative Writing & Generation

Stories, poems, marketing copy

7

Multi-turn Dialogue & Context

Maintaining context across conversations

8

Multi-modal Understanding

Images, charts, video comprehension

9

Safety & Ethics

Refusing harmful content, bias detection

10

Professional Domain Applications

Legal, medical, financial expertise

 

7.1 Recommended Datasets by Category

Category

Key Datasets

Instruction Following

Alpaca, FLAN, Super-NaturalInstructions

Knowledge Q&A

MMLU, C-Eval, CMMLU, TriviaQA

Reasoning & Math

GSM8K, MATH, HellaSwag, LogiQA

Code Generation

HumanEval, MBPP, CodeXGLUE

Role-Playing

EmpatheticDialogues, PersonaChat

Multi-modal

VQA v2, COCO, TextVQA, ChartQA

Safety & Ethics

RealToxicityPrompts, BOLD, TruthfulQA

 

7.2 Prompt Injection Attack Testing

Prompt injection is a core security testing area for LLMs. Attackers craft inputs to make models forget safety rules.

Attack Vector Categories:

Attack Type

Description

Basic Jailbreak (DAN)

Directly ask the model to ignore restrictions

Role-Playing

Act as a character with no restrictions

System Prompt Override

Forge system messages to overwrite original instructions

Context Poisoning

Embed malicious instructions into retrieved documents

Multi-Turn Attacks

Gradually lower the model’s guard through conversation

Encoding Obfuscation

Use Base64, ROT13, etc., to hide prompts

Indirect Injection

Hide instructions in external data sources (websites, APIs)

 

Defense Evaluation Checklist:

Check

Description

☐ Recognition

Does the model identify malicious input?

Refusal

Does it explicitly refuse to execute?

Explanation

Does it explain why it’s refusing?

No Information Leakage

Does it avoid leaking sensitive info?

Consistency

Is defense consistent across multi-turn dialogues?

 

7.3 Using an AI Judge for Automated Evaluation

An AI Judge (another LLM) can automate evaluation against predefined rules:

 

 

{
  "system": "You are an expert AI evaluator...",
  "user": "User Question: {query}\n\nAI Answer: {model_output}",
  "output_format": {
    "dimension_scores": {
      "accuracy": 4,
      "relevance": 5,
      "completeness": 4,
      "fluency": 5,
      "usefulness": 4
    },
    "overall_score": 4.4,
    "strengths": [...],
    "weaknesses": [...]
  }
}

 

 

Part 8: AI Agent Testing

8.1 What Is an AI Agent?

An AI Agent is a system that can perceive its environment, make decisions, and take actions to achieve goals:

Capability

Description

Tool Calling

Search engines, calculators, databases, APIs

Task Execution

Book flights, query databases, analyze data

Multi-step Reasoning

Break down complex problems

Autonomous Planning

Create and execute action plans

 

8.2 Core Components of an Agent System

 

 

┌─────────────────────────────────────────────┐
│              AI Agent Architecture           │
├─────────────────────────────────────────────┤
│  ┌─────────┐    ┌──────────┐                │
│  │  User   │───▶│ Dialogue │                │
│  └─────────┘    │ Manager  │                │
│                 └──────────┘                │
│       ┌────────────┴────────────┐           │
│       ▼                         ▼           │
│  ┌─────────┐               ┌─────────┐      │
│  │   LLM   │◀─────────────▶│Knowledge│      │
│  │         │               │   Base  │      │
│  └─────────┘               └─────────┘      │
│       │                         │           │
│       ▼                         ▼           │
│  ┌─────────┐               ┌─────────┐      │
│  │  Tools  │               │ Vector  │      │
│  │         │               │ Retrieval│     │
│  └─────────┘               └─────────┘      │
└─────────────────────────────────────────────┘

 

8.3 Key Testing Areas for Agents

Area

Description

Intent Recognition

Correctly understand user goals and choose the right tool

Tool Testing

Test each tool’s integration, input/output, and call correctness

Context Engineering

Manage long-term memory and avoid context confusion across turns

Knowledge Base (RAG)

Document parsing, chunking, embedding, retrieval accuracy

 

8.4 RAG Deep Dive

RAG = Retrieval + Generation — allowing LLMs to “look up” information in real-time.

Why RAG Matters:

Problem

How RAG Solves It

Outdated knowledge

Retrieves current information from knowledge base

Lack of private data

Accesses enterprise documents and internal sources

Hallucinations

Grounds answers in retrieved context

The RAG Pipeline:

 

 

User Question → Retrieve Relevant Documents → Build Prompt → LLM → Answer

 

8.5 Embedding & Semantic Search

Embedding converts text into dense vectors (digital fingerprints) that capture semantic meaning. Similar texts have similar vectors.

Cosine Similarity measures vector similarity:

 

\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| \times |\mathbf{B}|}

 

Testing the Retrieval Pipeline:

Aspect

Details

Key Metric

Top N Recall — of pre-labeled relevant chunks, how many appear in Top N retrieved results?

Tooling

Automated scripts using scikit-learn to compute cosine similarity and mAP

 

Summary & Key Takeaways

The AI Testing Mindset

Traditional Testing

AI Testing

Find bugs

Evaluate statistical effectiveness

Single test cases

Large, diverse datasets

Binary pass/fail

Metrics-based evaluation

Logic verification

Continuous optimization

 

Core Principles

  1. Match metrics to scenarios — classification, regression, generation each need different metrics
  2. Think statistically — evaluate on large datasets, not individual examples
  3. Isolate test data — prevent overfitting by keeping test data from development teams
  4. Test the system, not just the model — peripheral systems are critical
  5. Automate where possible — use pre-labeling, AI judges, and monitoring systems

Career Progression for AI Testers

Level

Skills

Entry

Understand metrics and evaluation

Intermediate

Data collection, cleaning, statistical analysis

Advanced

Deep AI principles, performance testing, monitoring systems

Expert

System architecture, A/B testing frameworks, real-time alerting

Real-World Insight: Top AI testers at companies like ByteDance focus entirely on A/B testing frameworks and real-time monitoring systems, with compensation exceeding 1.35M RMB annually.

 

 

 

Latest Posts
1How to Test AI Products: A Complete Guide to Evaluating LLMs, Agents, RAG, and Computer Vision Models A comprehensive guide to AI product testing covering binary classification, object detection, LLM evaluation, RAG systems, AI agents, and document parsing. Includes metrics, code examples, and testing methodologies for real-world AI applications.
2How to Enhance Your Performance Testing with PerfDog Custom Data Extension Discover how to integrate PerfDog Custom Data Extension into your project for more accurate and convenient performance testing and analysis.
3Mobile Game Performance Testing in 2026: Complete Guide with PerfDog Insights from Tencent’s Founding Developer Master mobile game optimization with insights from PerfDog’s founding developer. Learn to analyze 200+ metrics including Jank, Smooth Index, and FPower. The definitive 2026 guide for Unity & Unreal Engine developers to achieve 120FPS and reduce battery drain.
4Hybrid Remote Device Management: UDT Automated Testing Implementation at Tencent Learn how Tencent’s UDT platform scales hybrid remote device management. This case study details a 73% increase in device utilization and WebRTC-based automated testing workflows for global teams.
5How AI Is Reshaping Software Testing Processes and Professional Ecosystems in 2026 Discover how AI is reshaping software testing processes and careers in 2026. Learn key trends, emerging roles, and essential skills to thrive in the AI-driven QA landscape.