Backend Automated Testing & CI/CD: A Complete Guide

Learning Hub 2026-05-27 11:46 234

Learn backend automated testing and CI/CD practices from a real project. Improve testability, write effective tests, and achieve continuous deployment.

Source: TesterHome Community

Introduction
1. Improving Testability
2. Automated Testing
3. Continuous Integration and Continuous Deployment
4. Summary
Terminology

Introduction

As DevOps practices spread, shift-left testing and developers owning quality have taken hold in many engineering teams. This article walks through the DevOps journey of a real-world project — LogReplay — to show how improving testability, embracing automated testing, and building a solid CI/CD pipeline lead to high-quality, continuous, and fully automated deployment of backend microservices.

Shift-left testing is a key part of developers taking genuine ownership of quality under DevOps. One effective tactic runs meaningful automated tests early and often throughout development, catching issues and providing feedback as soon as possible. Section 2 covers this in detail.

Software testability is the foundation of high-quality, high-efficiency delivery. Poor testability raises testing costs, makes results harder to verify, and discourages developers from testing — or pushes testing later in the cycle. Improving testability must come before any serious automation effort. See Section 1.

With thorough automated testing, slow and error-prone manual validation becomes unnecessary. By plugging tests directly into a CI/CD pipeline, teams trigger builds and tests immediately after code commits, promote artifacts through environments only when all tests pass, and release to production automatically. Section 3 covers CI/CD.

1. Improving Testability

1.1. What Is Testability?

Testability measures how easy it is to test a software system. Poor testability drives up costs, makes results hard to verify, and leads engineers to skip testing or push it later in the cycle.

Common Testability Issues

At the API test level:

Lack of detailed design documentation — Without a contract defining expected behavior, teams waste time on communication, argue over results, and struggle with validation. Even when documentation exists, it must stay up-to-date or becomes misleading.
High cost of building mocks — In a microservice architecture, difficult and expensive mock creation makes testing impossible or prohibitively costly.
Difficult result validation — Even when an API call succeeds, obtaining verification points to confirm expected behavior remains hard.
No idempotency — Internal logic depending on unresolved factors (time, unpredictable inputs, background jobs, random variables) breaks idempotency.
Overly complex parameters exposing internal details — Many internal parameters should never appear in an API signature. Good API design keeps the surface simple.
Custom proprietary protocols — Non-standard private protocols make testing harder because general-purpose tools cannot interact directly.

At the code-under-test level:

Calling private functions — In code-level tests, private functions cannot be invoked directly.
Accessing private variables — Without read or modify access to private variables, result validation becomes impossible.
Functions that do too much — A function implementing several features at once is harder to test: more parameters mean harder setup, more validation points, and risk of combinatorial explosion.
Complex dependencies — Code depends on external systems or uncontrollable components — third-party services, network calls, databases.
Poor readability — Clever tricks and obscure constructs, especially without comments, make code hard to read and test.
Code duplication — Duplicate logic means duplicate test burden. Changes require testing in every location the logic appears.
High cyclomatic complexity — High complexity generally means high testing cost.
Missing hooks or injection points — Lack of reserved hooks or injection points makes debugging and extending harder later.

1.2. How to Improve Testability

1.2.1. Improving Observability

Observability means how easily a program’s behavior, inputs, and outputs can be observed — how easily external systems obtain important state and information.

Every operation or input should produce a clear, predictable response or output. That output must be both visible and queryable. Invisible or unqueryable means undiscoverable, harming observability and therefore testability.

Visibility starts with output. Improve observability by emitting more — structured event logs, distributed tracing information, aggregated metrics. Provide testability interfaces to expose internal state and report system self-checks. When something goes wrong, output should be easy to recognize through automated log analysis or UI highlighting.

In our project, we focused on:

1) Converging API return status codes

More downstream dependencies mean more potential failure points. Direct dependencies add failure points linearly. Indirect dependencies multiply them. Passing every downstream error verbatim to the client is impractical — clients rarely understand all errors or know how to react differently. Status codes must be converged.

2) Always propagating failures upstream

The upstream caller does not need the exact failure point — end-to-end return information may lack precision. But it must receive the failure. Swallowing failures internally leaves callers unsure if the request succeeded or what action to take.

In tRPC services, an error consists of a code and a msg string. Use the framework’s errs.New to return both. If a downstream service returns an error without errs.New, the upstream receives code 999.

func (s *helloServerImpl) SayHello(ctx context.Context, req *pb.HelloRequest, rsp *pb.HelloReply) error {  
    if failed { // business logic fails  
        return errs.New(your-int-code, "your business error message")  
    }  
    return nil // success  
}

3) Integrating distributed log collection

Finding the exact failure point requires logs. Record failure points with logging, different error messages under the same error code, or distributed tracing. Distributed log collection maximizes diagnostic information retention. For example, configure tRPC services to report logs to a centralized system like Kibana.

4) Integrating a distributed tracing system

Status codes and messages are client-oriented. They may lack precision for failure location. Distributed tracing is immensely valuable. Any modern backend system should implement OpenTelemetry — its universal protocol ensures wide tool support. Every serious developer should understand tracing. When debugging a tough production issue, you will appreciate it.

After integrating tracing with an OpenTelemetry backend, print the Trace ID to test logs during API and end-to-end tests. When a test fails, use that Trace ID to pinpoint the root cause quickly.

1.2.2. Improving Understandability

Understandability means how easily information about the system-under-test can be obtained, how complete that information is, and how easy it is to comprehend. For example, does the system have documentation, and is that documentation readable and up-to-date?

Key aspects include:

User documentation (manuals), engineering documentation (design docs), source code, comments, and quality information (test reports)
Documentation, processes, code, comments, and messages that are easy to understand
Whether the system has a single, clearly defined task (separation of concerns)
Whether behavior is deterministic and predictable
Whether design patterns are well-understood and follow industry conventions

Our practical experience in this area remains limited.

1.2.3. Improving Controllability

Controllability means how easy it is to control a program’s behavior, inputs, and outputs — whether the system-under-test can be set to a desired state for testing. Highly controllable systems are easier to test and automate.

Controllability includes:

Business level: Processes and scenarios should break down easily, allowing segmented control and validation. Define reasonable decomposition points for complex flows.
Architecture level: Modular design allows independent deployment and testing of modules with good isolation for easy mocking.
Data level: Test data itself must be controllable — building diverse datasets at low cost for different scenarios.
Technical implementation level: Provide ways to directly or indirectly control state and variables externally, facilitate API calls, access private functions and variables, enable runtime injection and lightweight instrumentation, and use techniques like AOP or framework filters (e.g., tRPC-filter).

To improve middleware isolation and test data construction, we implemented:

1) Using naming services for addressing

In a microservice architecture, fixed ip:port addressing for middleware is inflexible — it cannot handle scaling or cluster management. Use naming services with uniform addressing via namespace + env, eliminating per-environment ip:port configuration.

2) Standardizing access clients

Use a consistent internal middleware client module (e.g., trpc-database). Benefits include covering most middleware types, reducing bugs from feature/usage variations across community implementations, providing built-in observability (monitoring, tracing), and allowing filters for flexible traffic manipulation like route modification.

3) Physically isolating middleware instances between production and test environments

Strictly separate middleware used in production (Production), baseline development (Development), and automated testing environments. Physical isolation is the only reliable way to prevent test behavior from affecting production.

We have further work to do on controllability and will share more as we gain experience.

2. Automated Testing

2.1. Overview

In a microservice architecture, testing typically has three levels:

End-to-End (E2E) Testing — Covers the whole system, integrating multiple services, usually simulating user actions through the access layer.
API Testing — Tests service interfaces in isolation.
Unit Testing — Tests individual code units.

Ease of implementation increases from E2E down to unit tests, but effectiveness decreases. E2E tests are most expensive but provide highest confidence when they pass. Unit tests are easiest and fastest but cannot guarantee the whole system works correctly.

No silver bullet exists. All three levels must be combined.

The real question: when should each type be written, and how many?

2.2. Writing Tests

Our practice suggests:

Write E2E tests for core functional scenarios defined as:
- Main flows (failure would block users)
- Certain critical flows (failure would cause significant loss)
Write API tests for all externally exposed service interfaces:
- For existing services: interfaces in the top 60% of call volume
- For new services: every external interface
Write unit tests for exported functions:
- For existing code: exported functions in packages being refactored or heavily changed
- For new code: all exported functions

2.2.1 Writing Unit Tests

We use manual writing and tools like TestOne that auto-generate unit test cases. Manual methods are well covered elsewhere. We follow five principles from the PCG Testability certification: focus on behavior, explicit dependencies, encapsulation, single responsibility, and readability.

For legacy codebases with few or no unit tests — code lacking regression safety nets when logic changes — we use tools like TestOne to improve unit test efficiency, quality, and automation coverage.

1) New code scenarios

For incremental new code, scaffolding tools generate unit test templates. Compared to basic generators like gotests, these provide dependency analysis, call chain analysis, mock generation, and pointer type assertion analysis. This simplifies test data, improves test effectiveness and readability, and boosts overall efficiency and quality.

Example: For business code adding a user, the generated scaffolding expands test data fields one by one for manual filling, analyzes dependencies and prompts assertions for input parameters that are written to, auto-generates mock frameworks for detected tRPC calls, and adds //FIXME comments to remind developers to verify test logic.

2) Legacy code scenarios

For legacy codebases where unit tests are scarce, auto-generation quickly builds a quality safety net. This provides basic protection when code changes later. LogReplay’s unit tests now cover most lines of code and run both locally and in CI pipelines daily.

2.2.2 Writing API Tests

Key lessons from our practice:

Test code must follow the same language standards as production code.
Structure: setup (prepare protocol data) → invoke (send request) → assert (check return code and protocol data) → teardown (restore/release data).
Each test case should have independent test data in a separate file — not shared across cases.
Cases involving accounts should obtain test accounts from a test data management system, never by hardcoding.
For write APIs, either stain the request or run tests in an isolated environment with isolated middleware instances.
Limit the scope of API tests to the service’s own correctness and availability. Mock downstream services and middleware dependencies. Leave realistic validation for integration or end-to-end tests.

Getting started

Here is a simple API test example in Go using TestOne SDK to bridge internal network restrictions:

func TestDemo(t *testing.T) {  
    // client options omitted
    request := &pb.HelloRequest{Msg: "my test message"}  
    rsp, err := pb.NewHelloClientProxy().SayHello(context.Background(), request, opts...)  
    assert.NoError(t, err)  
    assert.NotEmpty(t, rsp.Msg)  
}

Using mocks for stability

When running API tests in MR stages — where runs are frequent and failures highly visible — and when dependencies are under development or unstable, we encountered problems:

MR failures amplified
Dependencies often not ready
Concurrent test runs causing data conflicts

Solutions: improve test case quality, use sandboxed test environments (e.g., TestOne Sandbox), leverage mocking capabilities from the TestOne API Test SDK, and apply middleware governance.

Mocking an HTTP downstream:

m := mock.NewHTTP("hello.world.com", env)  
err := m.URI("/path/hello").  
        Rule(mock.Any()).  
        Return(`{"status": "ok", "token": 1, "value": "2"}`)

Mocking a tRPC downstream: Configure mock rules so the downstream service interface always returns needed data, avoiding issues from unready or changing dependencies.

Mocking middleware (e.g., MySQL): When the test environment’s MySQL is unstable, data is frequently modified, or specific data (like a large count) is hard to trigger, mock the middleware (e.g., making count(*) return 9).

Sandboxed environments dramatically improve stability for high-frequency MR runs. Mocking solves the dependency-not-ready problem and enables earlier test writing.

Improving efficiency with auto-generation

Convert traffic to test cases — Record production traffic and automatically generate API test cases.
Generate cases from API debug tools — Use backend API debug tools to debug new interfaces and auto-generate test cases from successful debug data. This improves efficiency, increases coverage for new APIs, and makes constructed test data reusable.

Using API coverage to set strategy

Use API coverage metrics to set goals: prioritize high-call-volume interfaces, use traffic-to-case tools, and mock downstream dependencies for stability. Results: high API coverage, over half of cases using mocks or sandbox environments, significantly better stability for cases with mocks.

2.2.3 Writing End-to-End Tests

Writing E2E tests is similar to API tests with differences:

Test data acquisition — May require applying for test data (e.g., test user accounts) through the TestOne API Test SDK.
Realistic simulation — Send requests to the access layer, not directly to the tRPC service, to mimic real user behavior.
Multiple requests per test case — A single E2E case often chains several API calls, using data from one response as input to the next.

Challenges faced:

Difficult failure localization — E2E tests span multiple services.
Unstable staging environments causing random failures.
Erosion of confidence — Accumulated failures unrelated to test code or actual service bugs.
Finger-pointing — Each service owner blames others; test case authors stop maintaining tests.

Solutions:

Integrate distributed tracing for better observability and easier localization.
Govern environments and improve stability.
Use techniques to enhance E2E test reliability (e.g., flakiness detection).

The bottom line: Do not write too many E2E tests. Cover only the most critical core scenarios. Replace everything else with simpler, more maintainable API tests. After adopting this principle, our E2E tests remain highly stable while covering most core scenarios.

2.3. Debugging and Execution

2.3.1 Direct go test Execution

All test types run directly with go test.

2.3.2 Using a CLI (e.g., TestOne Guitar CLI)

For API testing, a CLI automatically creates a stable sandbox environment, runs tests, destroys the environment, and generates a report. Define a TESTPLAN file (suite name, case path, plan details like type, sandbox config, app info, build method), then run:

guitar test -p //TESTPLAN -n api_test

2.3.3 Using an IDE Plugin (e.g., TestOne Guitar IDE Plugin)

Run tests directly from the IDE while writing code without commands. The plugin displays test reports automatically after execution.

2.4. Failure Localization

When a test fails, first check logs. If the error originated downstream, use distributed tracing to find the last service returning an error. For frequent errors over time, aggregate error codes. For failures after refactoring, use request/response diffing.

2.4.1 Log Localization

Test execution logs show three error types:

Assertion errors — Assert err and return codes. Error info (e.g., code 10002) points to the source. Search the service code for that error code to find the failing logic branch.
Non-timeout panics — Provide a stack trace pinpointing the test line.
Timeouts — More complex. Check for infinite loops, overly long requests, or test cases with too many steps requiring longer timeout.

2.4.2 Common Framework Error Localization

In tRPC, business errors typically use codes > 10000. Framework errors use 1–200 and 999.

Error Code	Meaning	Common Cause
141	tcp client transport ReadFrame...	Protocol mismatch — client using tRPC to talk to an HTTP endpoint
111	service timeout	Service timeout, client timeout, or upstream context exhausted timeout
999	Generic error	Downstream returned errors.New(msg) without status code instead of errs.New(code, msg)

2.4.3 Distributed Tracing Localization

With tracing integrated, the Trace ID prints to test logs. On failure, find the Trace ID in the report, click to jump to the tracing UI, and quickly locate the cause — for example, the last service returning an error, often an environment issue or version mismatch.

2.4.4 Error Code Aggregation

For frequent errors over a period, aggregate downstream errors by upstream calling interface to identify recurring downstream problems.

2.4.5 Data Diffing

When a test passes before a refactor but fails after, use a diff tool to compare protocol requests/responses field by field across two runs. This often reveals subtle changes like an extra comma in a returned message.

2.5. Improving Test Effectiveness

Despite high coverage, some logic bugs still escaped — even when covered by automation. A review revealed:

Cases with no assertions
Cases with ineffective assertions (e.g., only checking return codes)
Cases written but never ran in the pipeline
Cases that failed and were simply commented out instead of fixed

Solutions:

2.5.1 Strengthen Code Review (CR)

Test code needs as rigorous review as production code. Require CR approval before merging. Review rules include:

Does the case have assertions, and are they sufficient?
Is removal or commenting out of a test case justified?
Are exported functions covered by unit tests?
Does the test cover enough branch conditions?
Are test cases independent of each other?
Are there obvious performance issues (e.g., sleep calls)?

2.5.2 Post-Mortems for Production Defects

Review production defects and on-call tickets. Ask why detection did not happen earlier and why automated tests did not catch the issue. Then supplement or update test cases accordingly.

2.5.3 Use Effectiveness Scanning Tools

Use tools that detect ineffective tests upfront:

Static scanning — Fast, catches simple issues like missing assertions, compilation errors, or incomplete assertions.
Dynamic code injection (mutation testing) — Slower but more thorough. Modifies code during test execution to simulate errors, revealing missing boundary checks or uncovered condition branches.

Run static scans in MR pipelines for quick feedback on incremental changes. Run scheduled dynamic injection for continuous improvement.

2.5.4 Track Test Execution Metrics

Work with your test platform to provide execution statistics: rates, counts, failure distribution. Review data regularly and optimize.

3. Continuous Integration and Continuous Deployment

3.1. Preparation

3.1.1 Improving System Stability

Unstable microservices cause random test failures that block CI/CD.

Steps taken:

Map service dependencies, remove unnecessary ones, switch common capabilities (gateway, authentication) to stable, unified PaaS services.
Integrate second-level monitoring (CPU, memory, disk I/O, network, QPS, latency, failure rate) with alerts (e.g., on-call tickets) when thresholds exceed.

Continuously optimize based on monitoring. Achieved and sustained >99.99% stability.

3.1.2 Improving Test Stability

Unit test stability:

Avoid sleep
Minimize mocks (use real implementations when fast and deterministic)
Do not modify or depend on system environment (e.g., clock)
Avoid random number inputs
No database, network, or cross-process calls

API test stability:

Mock downstream services and external HTTP dependencies where possible
Initialize data in setup; do not rely on existing data in libraries
Restore modified test data after test finishes
Use isolated test environments

Handling flaky tests (E2E/API)

Flaky tests — sometimes passing, sometimes failing for the same code — destroy confidence. Use a flakiness mitigation scheme (e.g., TestOne Flakiness): monitor each test’s reliability score. If below a threshold, automatically remove the test from the critical path (stop running it or stop treating its result as a gate). This boosted critical-path E2E test stability to over 99%.

3.1.3 Improving Environment Stability

Standardize environments:

Sandbox — Isolated, created per test run
Test — Baseline development environment
Staging — Pre-release for internal experience
Canary — Small-traffic pre-production
Production — Live environment

Define strict entry and exit criteria for promoting changes between environments.

Environment	Entry Criteria	Exit Criteria
Sandbox	Build succeeds, 100% unit tests pass	100% API tests pass
Test	Code merged to trunk, 100% API tests pass	100% API tests pass (regression)
Staging	100% integration/E2E tests pass, on-call integrated	100% integration/E2E tests pass, sufficient duration/traffic (e.g., 6h/100 accesses), no on-call tickets
Canary	Performance tests pass, on-call integrated	100% integration/E2E tests pass, sufficient duration/traffic (e.g., 6h/100 accesses), no on-call tickets

Following these criteria and sequential promotion (Test → Staging → Canary → Prod) keeps environments synchronized and prevents inconsistencies.

3.2. CI Pipeline Configuration

CI continuously merges code to trunk and uses builds plus automated tests to enforce quality.

Pre-merge: Trigger code review, unit tests, code scans, security scans, test effectiveness scans, API tests. All must pass for merge.
Post-merge: Trigger unit tests, code scans, security scans, mutation tests, API tests for regression.

Pipelines use a consistent CLI tool (e.g., TestOne Guitar), keeping configuration minimal — specify the testplan file after checkout. Our CI process is stable.

3.3. CD Pipeline Configuration

CD extends CI, continuously and automatically deploying microservices to test and production without manual intervention.

Trigger: Code merged to trunk
Process: Auto-build, then auto-deploy sequentially to Test → Staging → Canary → Production
Gates: Entry and exit criteria at each environment
Auto-rollback: On failure according to grayscale strategy

Grayscale strategy for Production:

Node Count	Deployment Progression
< 10 nodes	1-2 → 3-5 → 6-9 nodes
≥ 10 nodes	10% → 30% → 60% → remaining nodes

Monitoring during grayscale:

Traffic — Ensure grayscale nodes receive enough traffic for validation
Exceptions — Monitor exception counts (e.g., via 007 metrics)
Resources — Compare CPU/memory curves before and after

Targeted testing: Run API tests safe for production data on grayscale nodes to verify service works correctly with production configurations and data.

Grayscale outcome:

No anomalies or anomalies below threshold → complete full release
Anomalies exceed threshold → auto-rollback (revert grayscale nodes to original image)

Our CD process is stable. Past rollbacks were caused by deployment order issues (e.g., service A deploying before its dependency service B) or configuration changes requiring new production data.

4. Summary

With the LogReplay project, we have largely achieved continuous deployment for microservice code changes. After a code MR merges to trunk, the process runs fully automatically — extensive automated tests, a robust CI/CD pipeline, and auto-rollback when issues occur.

Work remains. While code changes are fully automated, configuration and database changes still require manual steps. We plan to explore continuous, automated deployment for those as well.

Different businesses and scenarios have different needs. Our practices may not apply universally. But the shared goal — higher quality and faster delivery — is universal, and both depend heavily on automation. We hope more teams explore, practice, and share their experiences with backend automated testing and continuous deployment.

Testing tools used — Most tools mentioned are proprietary internal Tencent products (e.g., TestOne: one-stop testing platform).

Terminology

Term	Definition
CI	Continuous Integration
CD	Continuous Deployment
Mock server	A service that implements mocking behavior for other services
Sandbox / Test / Staging / Canary / Production environments	Isolated, baseline, pre-release, canary, and production environments
Flaky test	A test with non-deterministic outcomes — for the same code, sometimes passes, sometimes fails

ci-cd-pipeline backend automated testing

Read Previous Post >>

Are Software Testing Jobs Disappearing in the AI Era? QA Transformation 2026