Customer Cases
Pricing

Comprehensive Guide to LLM Performance Testing and Inference Acceleration

Learn how to perform professional performance testing on Large Language Models (LLM). This guide covers Token calculation, TTFT, QPM, and advanced acceleration strategies like P/D separation and KV Cache optimization.

1. Fundamentals of Large Language Model (LLM) Processing

What is a Token and How is it Calculated?

In the ecosystem of Large Language Models, a Token is the atomic unit of processing. Unlike human readers who process characters, LLMs interpret language through a "Tokenizer".

  • Tokenization Mechanism: The system breaks sentences into words or sub-words based on a predefined vocabulary list, where each entry has a unique ID.

  • Composition: A token can be a word, a punctuation mark, or even a single letter within a word (e.g., "R", "A", and "P" in "RAP").

  • Efficiency: This method mimics the human brain's retrieval process, reducing memory and computational costs.

  • Billing and Limits: Commercial LLMs often bill based on tokens (e.g., 0.01 yuan per 1k tokens). Model parameters like "72B 32K/16K" specify the model's scale (72 billion parameters) and its maximum input/output token capacity.

Testing Implications for Tokens

  • Truncation and Errors: Exceeding token limits leads to system errors or automatic content truncation.

  • Performance Correlation: The number of tokens is directly proportional to latency; 100K problems require significantly more processing time than 1K problems.

  • Tooling: Testers should use libraries like transformers to pre-calculate the token length of test datasets.

2. LLM Performance Testing Indicators vs. Traditional Web Products

Traditional performance testing focuses on QPS and Average Response Time. However, LLMs require specific metrics due to their streaming output nature.

Key Performance Metrics

  1. Time to First Token (TTFT): The latency between the user's request and the arrival of the first token. This is the most critical metric for user experience.

  2. Token Generation Rate (Token Throughput): Often called the "articulation rate," measuring tokens returned per second (Tokens/s). High-performance models typically require $\ge$ 20/s under concurrency.

  3. QPM (Queries Per Minute): Since LLM response times are long, statistics are measured in minutes rather than seconds.

  4. Input/Output Token Magnitude: Performance must be aligned with data scale (e.g., grouping data into 16k-32k, 32k-48k buckets).

3. Stress Testing Methods and Tooling

Interface Protocols

Stress testing LLMs involves analyzing streaming interfaces such as SSE (Server-Sent Events), WebSockets, or the OpenAI SDK.

  • Packet Analysis: Testers must distinguish between different types of packets, including "thinking" packets (for reasoning models), answer packets, statistical packets, and heartbeats.

Reasoning Models (e.g., DeepSeek-R1)

  • Thinking Process: Reasoning models output their internal logic before providing an answer.

  • Metrics Adjustment: For these models, TTFT should be calculated from the start of the "thinking" packet.

Performance Tools: Locust and Boomer

  • Locust: Commonly used for stress testing, but requires custom functions to report TTFT and token rates.

  • Boomer: A Go-based worker for Locust, capable of simulating high concurrency (e.g., 100,000 QPM) which Python-based Locust might struggle to achieve.

4. Deep Dive: Inference Acceleration and Optimization

The Prefill and Decode Stages

  1. Prefill Stage: Responsible for calculating the K/V matrices and generating the first token. Performance here defines the TTFT.

  2. Decode Stage: Pulls data from the KV Cache to output subsequent tokens. Performance here defines the Token Generation Rate.

Common Optimization Strategies

  • KV Cache Optimization: Storing cached data in GPU memory or system memory to speed up responses for identical or similar prefixes.

  • P/D Separation (Prefill/Decode Separation): Decoupling the prefill and decode stages into different instances to optimize them independently.

  • Model Quantization: Reducing storage precision (e.g., FP32 to FP16, INT8, or INT4) to decrease model size and increase speed, though this may impact accuracy.

  • Multi-head Prediction (MTP): Using a small auxiliary model to predict multiple tokens at once, significantly increasing the token generation rate.

5. Advanced Parallelism and Expert Systems

Parallel Computing Strategies

  • TP (Tensor Parallelism): Splitting a neural network across multiple GPUs.

  • DP (Data Parallelism): Processing multiple data inputs simultaneously.

  • PP (Pipeline Parallelism): Cutting the model into stages for sequential GPU processing.

  • EP (Expert Parallelism): Used in MoE (Mixture of Experts) architectures, where different "experts" (sub-networks) are assigned to different GPUs to handle specific domains.

6. Practical Testing Scenarios and Best Practices

Gradual Pressure Increase (Step Load)

To simulate real-world traffic and avoid artificial TTFT spikes caused by simultaneous task queuing, use Locust’s gradual pressure function (e.g., adding 1-2 concurrent users per second).

Verification Requirements

  • Accuracy Testing: Every performance optimization must be validated against accuracy benchmarks (e.g., Math500) to ensure optimizations haven't degraded model quality.

  • Success Rate: Monitor for empty answer packets or truncated thinking processes.

  • Bypass Testing: Mirroring real online traffic to the test environment to verify performance under authentic user behavior.

Latest Posts
1How to Test AI Products: A Complete Guide to Evaluating LLMs, Agents, RAG, and Computer Vision Models A comprehensive guide to AI product testing covering binary classification, object detection, LLM evaluation, RAG systems, AI agents, and document parsing. Includes metrics, code examples, and testing methodologies for real-world AI applications.
2How to Utilize CrashSight's Symbol Table Tool for Efficient Debugging Learn how to use CrashSight's Symbol Table Tool to extract and upload symbol table files, enabling efficient debugging and crash analysis for your apps.
3How to Enhance Your Performance Testing with PerfDog Custom Data Extension Discover how to integrate PerfDog Custom Data Extension into your project for more accurate and convenient performance testing and analysis.
4Mobile Game Performance Testing in 2026: Complete Guide with PerfDog Insights from Tencent’s Founding Developer Master mobile game optimization with insights from PerfDog’s founding developer. Learn to analyze 200+ metrics including Jank, Smooth Index, and FPower. The definitive 2026 guide for Unity & Unreal Engine developers to achieve 120FPS and reduce battery drain.
5Hybrid Remote Device Management: UDT Automated Testing Implementation at Tencent Learn how Tencent’s UDT platform scales hybrid remote device management. This case study details a 73% increase in device utilization and WebRTC-based automated testing workflows for global teams.