As business scales continue to grow and the need for stability and scalability increases, backend architectural technology has been driven to constantly innovate. With increasingly complex demands, the concept of distributed systems has gradually become integral to backend developers. In 2013, I began exploring distributed systems during the mobile gaming boom. Over nearly three years of trial and error, I encountered numerous challenges and learned from the industry's technological advancements, ultimately shaping my current design philosophy. I would like to share some of my insights with you. While I cannot claim that they hold immense reference value, I hope they can catalyze further discussion.
The initial motivation for using distributed systems was quite straightforward: to address the issues of disaster recovery and capacity expansion challenges caused by a single-point structure in client game development. A simplistic idea was to consider processes with identical functionality as a unified entity providing external services. Below is a brief overview of the basic framework:
This architecture consists of three basic components:
Client API, service requester API:
- Retrieves the service provider address from the Cluster Center Server
- Registers with all instances within the Server cluster; if registration is successful, it is considered available
- Selects a Server instance for communication using a load-balancing algorithm
- Monitors the operational status of each instance within the Server cluster
Server API, service provider API:
- Reports its status, access address, etc., to the Cluster Center Server
- Accepts registration from the Client API and provides services
- Periodically reports the status to the successfully registered Clients
Cluster Center Server, cluster center process:
- Receives reports from the Server Cluster, determines the structure of the service cluster, and the status of each instance
- Receives requests from the Client Cluster and returns a list of available service clusters
This architecture has the basic prototype of a cluster and can meet the fundamental needs of disaster recovery and capacity expansion. However, there are several issues, and I will summarize a few points here:
1. Inefficient service discovery implementation
The implementation of the Cluster Center Server is a single point, and Client requests will fail when a fault occurs. There is no monitoring mechanism, and the Client can only obtain the latest status of the service through periodic requests.
2. Inflexibility of the CS's Request/Response communication method
In real-world applications, services often involve mutual requests, and a one-to-one response is far from sufficient. Full-duplex communication must be supported.
3. Flawed keep-alive mechanism
There are two issues with the Server's regular unilateral heartbeat to the Client: different Clients may have varying keep-alive requirements, some 5 seconds, some perhaps 1 second, and if the heartbeat initiation is all on the Server, it cannot meet the differentiated requirements; it is unwise for the service side, as the passive party, to bear the responsibility of monitoring the requester's survival.
4. Unclear hierarchy in architectural design
The architecture's hierarchy and module division have not been well planned, such as the clear definition of interfaces for the communication layer, service discovery, cluster detection, and keep-alive, which leads to mutual coupling and makes replacement and maintenance more difficult.
The problems mentioned earlier primarily stem from a limited perspective and not keeping up with the pace of industry technology development by working on our own solutions. In recent years, the development of microservices architecture has been rapid. Compared to traditional service-oriented architecture, it no longer overemphasizes the enterprise service bus but focuses on the componentization within individual business systems. Here, I will introduce my research findings.
Service collaboration is a core component of distributed systems, which can be summarized as: multiple process nodes provide services to the external world as a whole, services can discover each other, and service followers can promptly obtain the changes of the followed services to complete collaboration. The specific operation process includes: service registration and service discovery. The implementation involves the following aspects:
- Unified naming: Centralized, unified naming for services and their nodes, making it easier to distinguish and access each other.
- Monitoring: Determine the availability and status of services. When the service status changes, followers should have a way to be informed.
- Access policy: Services usually consist of multiple nodes and exist in a cluster form. Clients need a strategy to determine the communication node each time they make a request. The policy objectives may vary, such as load balancing, stable mapping, etc.
- Availability: Disaster recovery handling and dynamic capacity expansion.
Some mature implementations in the industry are shown in the following table:
|By utilizing the Zab protocol to ensure node consistency, it offers a centralized service framework for user-maintained configuration information, naming, distributed synchronization, and group services.
|This highly available Key-Value storage system uses the Raft consistency protocol, provides an HTTP+JSON API, and supports reading information from non-leader nodes.
|Employing gossip to form a dynamic cluster, it offers a hierarchical key/value storage method, capable of storing data, and registering events, and tasks.
Also known as message queues, these are widely used in distributed systems to establish channels between nodes requiring network communication, efficiently and reliably enabling platform-independent data exchange. Architecturally, there are mainly two types: Broker-Based (agent) and Brokerless (agentless). The former requires deploying an intermediate layer for message forwarding, providing secondary processing and reliability guarantees. The latter is lightweight and directly embedded in communication nodes. Some mature implementations in the industry are shown in the following table:
|Written in Erlang, this Broker architecture is a leading implementation of AMQP, suitable for scenarios with high requirements for data consistency, stability, and reliability, but with average performance and throughput.
|Written in C++, this Brokerless architecture is more like a low-level network API, offering high performance and flexibility, but lacking support for high-level requirements and not guaranteeing message reliability.
|Written in Java, this solution can be deployed as a proxy or in a peer-to-peer mode, providing a wealth of features, making it easy to implement advanced application scenarios.
For communication between services, it is necessary to convert data structures/objects and binary streams during the transmission process, generally referred to as serialization/deserialization. Different programming languages or application scenarios have different definitions and implementations of data structures/objects. When choosing, consider the following aspects:
- Universality: Whether it supports cross-platform, cross-language; whether it is widely popular or supported in the industry
- Readability: Text streams have a natural advantage, and pure binary streams without convenient visualization tools will make debugging extremely challenging
- Performance: Space overhead - storage space occupation; time overhead - speed of serialization/deserialization
- Scalability: The unchanging principle of business is that it is always changing, so it must have the ability to handle compatibility between new and old data
Components for implementing serialization/deserialization generally include: IDL (Interface Description Language), IDL Compiler, Stub/Skeleton. Currently popular serialization protocols in the industry include: XML, JSON, ProtoBuf, Thrift, Avro, etc. For the implementation and comparison of these protocols, you can refer to the article "Serialization and Deserialization". Here, I will excerpt the selection conclusions from the original text for everyone:
- For businesses that allow high latency, such as 100ms or more, and have frequent and complex content changes, consider the SOAP protocol based on XML.
- For Ajax-based web browsers and communication between mobile apps and servers; for scenarios with not too high performance requirements or mainly dynamic type languages, JSON can be considered.
- For scenarios with extremely high requirements for performance and simplicity, Protobuf, Thrift, and Avro are all similar.
- For Terabyte-level data persistence application scenarios, Protobuf and Avro are the primary choices. If the persisted data is stored in a Hadoop subproject, or mainly uses dynamic type languages, Avro would be a better choice; for non-Hadoop projects, with mainly static type languages, Protobuf is the first choice.
- If you don't want to reinvent the wheel for RPC, consider Thrift.
- If different transport layer protocols need to be supported after serialization, or high-performance scenarios requiring cross-firewall access, Protobuf can be considered first.
After researching the surrounding environment, in 2015, we began working on our second mobile game. Learning from previous lessons, the basic principles of this design are:
- System decomposition and decoupling, clearly defining interfaces between systems, and hiding internal implementation
- The overall framework should be as general as possible, with subsystems that can be replaced in different scenarios
First, let's define the services, then introduce the overall framework and the internal division of services.
Using a mobile game as an example, let's explain it with a diagram:
Service Cluster refers to a group of instances with the same function, serving as a whole and providing external services. For example, Lobby provides lobby services, Battle provides battle services, Club provides guild services, and Trade provides trading services.
Service Instance is the smallest unit providing a specific service function, existing in the form of a process. For example, in the Club cluster, there are two instances 184.108.40.206 and 220.127.116.11 with the same functionality.
Service Node is the basic unit managed by the service discovery component, which can be a cluster, instance, hierarchical relationship, or a business-related concept.
Service Key is the globally unique identifier for service nodes. The design of the key needs to reflect the hierarchical relationship, at least showing the containment relationship between Cluster and Instance. Both etcd and zookeeper support hierarchical organization of keys, similar to the tree structure of the file system. Etcd has mkdir to directly create directories, while zookeeper uses paths to describe parent-child relationships. However, the conceptual path structure can be used in any case.
In the above diagram, the complete path of the Service Instance can be described as: /AppID/Area/Platform/WorldID/GroupID/ClusterName/InstanceName. It has the following characteristics:
- The cluster path must be the parent path of each instance.
- From a functional completeness perspective, the cluster is the basic granularity of the service.
- Clusters with the same function have different meanings under different prefix paths, and their service targets can also be different. For example:
/Example/wechat/android/w_1/g_1/Lobby and /Example/wechat/android/w_3/g_2/Lobby both represent lobby services in terms of function, but one serves the group 1 in region 1, and the other serves group 2 in region 3.
First, let's abstract a few basic operations. The APIs of different service discovery components may vary slightly, but they should have corresponding functions:
- Create: Create a Service Node in the service discovery component with a globally unique identifier corresponding to the Key.
- Delete: Delete the node corresponding to the Key in the service discovery component.
- Set: Set the Value corresponding to the Key, such as security access policies or node basic properties.
- Get: Obtain the data of the corresponding node based on the Key. If it is a parent node, you can get its list of child nodes.
- Watch: Set a monitor for the node. When the node itself and the nested child node data change, the service discovery component will actively notify the change event to the monitor.
When a Service Instance starts, it follows the following process:
- Generate its own Service Path, which is the path of the service instance.
- Create a node with the Service Path as the key, and Set the data: the externally accessible address, security access policy, etc.
- Generate the Service Path of the service cluster to be accessed, and obtain the cluster data through the Get method. If it is not found, it means the service does not exist. If it is found, there are two situations:
- There are no child nodes under the path. This means there are currently no available service instances. Set a watcher for the cluster path and wait for new available instances.
- There are child nodes under the path. Get the list of all child nodes, and further Get the access method and other data of the child nodes. Also, set a watcher on the cluster path to detect changes in the cluster, such as adding or reducing instances.
When a Service Instance closes, it follows the following process:
- Delete its corresponding node using the Delete method. Some service discovery components can delete the node automatically when the instance's lifecycle ends, such as zookeeper's temporary nodes. For etcd directories or zookeeper parent paths, they cannot be deleted if they are not empty.
Based on the above abstraction, we can define the basic interface of service discovery. The specific implementation of the interface can be developed for different components with different wrappers, but it can be decoupled from the business.
Ultimately, all architectures need to be implemented at the process level. Currently, our project's distributed architecture component is called DMS (Distributed Messaging System), which is provided in the form of a DMS Library. Integrating this library can achieve service-oriented distributed communication. The following is the overall structure of the DMS design:
|Message middleware is used to implement inter-process communication, encapsulating IPCs such as sockets and shared memory into a unified interface.
|Based on the message middleware, DMS implements distributed cluster message forwarding, node state control, business message forwarding, and other functions through several simple and clear underlying protocols.
The core logic of DMS implements service collaboration, message routing, disaster recovery expansion, and other core functions, hiding implementation details and providing a simple interface for the application layer.
There is mutual access between DMS and APP. Here, the mutual access interface between DMS and APP is defined.
Data is sent and received through DMS, and serialization/deserialization is used to convert network data and memory data.
APP contains the business logic, which sends and receives business messages through DMS to implement the basic functions of the service.
For Serialize/Deserialize, APP services have a high degree of freedom of choice. The specific implementation of other layers is described below:
3.3.1 There are many options for Message Middleware.
DMS uses ZeroMQ. The starting point is that it is lightweight, powerful, and low-level, so it is flexible and controllable. The resulting cost is that advanced application scenarios need a lot of secondary development, and more than 80 pages of data also need a lot of time. There are too many articles about ZeroMQ, and there is no plan to popularize science here, so the design scheme is given directly. The choice of communication mode ZeroMQ socket has many types. Different combinations can form different communication modes. List several common ones: REQ/REP one response one response, and a request must wait for a response
PUB/SUB publish subscription
PUSH/PULL pipeline processing, upstream pushing data, downstream pulling data
DEALER/ROUTER full duplex asynchronous communication
Seeing this, you may think that selecting PUB/SUB and DEALER/ROUTER should meet most application scenarios. In fact, DMS only uses one socket type, that is, ROUTER. There is only one ROUTER/ROUTER communication mode. A socket, a communication mode, sounds simple, but can it really meet the requirements?
DEALER/ROUTER is a traditional asynchronous mode. One side is connected and the other side is bind. If the front end wants to connect to multiple back ends, it must establish multiple sockets. In the cluster service mode described earlier, a node will act as both a client and a server, and there will be multiple incoming edges (passive receiving connections) and outgoing edges (active initiating connections). This is exactly the concept of routing. A ROUTER socket can establish multiple paths and send or receive messages for each path.
PUB/SUB focuses on scalability and scale. According to the author of ZeroMQ, when you need to broadcast millions of messages to thousands of nodes every second, you should consider using PUB/SUB. Well, I'm afraid the business scale will not reach this level in the foreseeable future. Now let's put simplicity first.
3.3.2 DMS Protocol
The DMS protocol implements the basic functions of cluster management, message forwarding, etc. ZeroMQ messages can be composed of frames. A frame can be empty or a byte stream. A complete message can contain multiple frames, called Multipart Messages. Based on this feature, the protocol defined in DMS can divide the content into different basic units. Each unit is described by a frame, and different meanings are expressed by unit combination. This is more flexible than the traditional way: a protocol is a structure, and different unit combinations need to be defined as a structure.
Let's take a look at the basic components of DMS Protocol. The first frame must be the peer ID. After receiving, the peer must also obtain the ID of the sender. The second frame contains DMS control information. The third and fourth frames are all service defined transmission information, which is only valid for REQ-REP:
PIDF has two meanings: the tag of the service cluster and its own instance tag. These tags are consistent with the definition of node key in Service Discovery. There are two forms of string and integer. The former is readable and easy to understand, and the latter is hash of the former to improve transmission efficiency.
Protocol command word
DMS protocol is implemented in the second frame of each message, namely Control Frame. The command word is defined as:
|Actively probe the other end, PONG responds to the probe. This is a responsive heartbeat initiated by the active party, PONG The interval is determined by the requester.
|Notify the other party that you are ready and can provide services. Send after probing the other party's survival, and no need to reply.
|Unconditionally notify the other party to disconnect, suitable for any mode, any communication end with an existing connection.
|Actively send messages to the other end, it may be a request for some service expecting a return package, or just a notification. But it must be sent actively
|Passively respond to the other end's request and send the message
Communication process - establish connection
After the server is found through Service Discovery, do not connect immediately, but send a probe packet. The reasons are as follows: Although service discovery can reflect whether the node is alive or not, it is generally delayed, so the nodes obtained from service discovery are only candidate nodes.
The underlying network mechanisms are quite different. Some are based on connections, such as raw sockets, and some are not connected, such as shared memory. It is better to solve whether the connection is successful in the high-level protocol. This is like sonar, throwing stones to ask for directions. If there is a response, it means that you can connect. If there is no response, it means that the connection is not available at present.
Communication process - business message sending
General message If PIDF indicates that the peer instance is directly connected to the current process, send the message
If the PIDF indicates that the peer instance is not directly connected to the current process, the routing message can be forwarded through the directly connected instance. The routing mechanism will be introduced later
If the PIDF InstanceID is negative, broadcast the message to all instances in the specified cluster
Routing and broadcasting can be mixed. The above process is automatically completed by DMS. The business does not have to participate, but it can intercept and intervene.
Communication process - keeping alive mechanism
After the connection is established, the requester will continue to send probe packets to the server at its own interval. If the requester has not received the PONG reply from the server for several consecutive times, the requester considers that the connection with the server has been disconnected.
If the server receives any data packet from the requester, it considers that the requester is alive. If it does not receive it (including PING) after a certain time, it considers that the requester is disconnected. This timeout is included in the READY protocol, and the requester informs the server.
Communication process - disconnection
After receiving DISCONNECT, either party will consider that the other party is actively disconnected and will not actively communicate with the other party in any form.
3.3.3 DMS Kernel
The following describes how the DMS Kernel implements the relevant logic according to the DMS Protocol and how it interacts with the business.
Self determines its own service path, implements service registration and registration with the target communication link for use by the routing table
The targets obtain and monitor the data and running status of the target service
ACL access control management
Encapsulates the service discovery layer interface. Different SERVICE DISCOVERY functions may be different
After each service instance actively and successfully connects to the peer service, it writes the connection to SERVICE DISCOVERY in the form of edges through SERVICE MANAGER, so that a complete graph structure, namely routing table, will be generated in the form of adjacent edges. For example, if Service 1 and Service 2, Service 3, and Service 4 are connected, record the edges (1,2), (1,3), and (1,4). SERVICE DISCOVERY The record of the route adjacency linked list can use a public key, such as/AppID/Area/Platform/routing_ Table. Then all service instances can update and access the path to obtain a consistent routing table. There are two basic functions: Updater is used to add edges to the routing table, delete edges, set edge attributes (such as weight), and monitor the changes of edges
The Calculator calculates the route according to the graph structure formed by adjacent edges. The starting point is the current instance. Given the target point, it judges whether the target is reachable. If it is reachable, it determines the path and transmits it to the next node for forwarding. Dijkstra algorithm is selected by default, and services can be customized.
Manage Frontends, that is, the connections that the front end requests to enter, and Backends, that is, the connections that the back end actively initiates. The target of Backends comes from the Service Manager.
Sentinel can obtain the deactivation standard of the connection initiated by the front-end through the READY protocol, and judge whether the incoming connection is alive through the front-end active packet. If it is deactivated, set the connection to the disconnected state and no longer send active packets to the corresponding front end.
Prober establishes and maintains the connection of back-end services.
Dispatcher message is used to determine the communication peer instance when sending. Connections are instance based, but services are generally service-oriented clusters. Therefore, Dispatcher needs to implement a certain allocation mechanism to forward messages to a specific instance in the service cluster. Note that only direct connected unicasts exist here. When allocating, load balancing should use the consistent hash algorithm by default, and services can be customized according to specific application scenarios.
3.3.4 DMS Interface
DMS API is the service interface provided by DMS for business, which can manage services, communication and other basic functions; The DMS APP Interface is an interface that DMS requires services to implement, such as the Dispatcher's load balancing policy, notification of changes in peer service status, and customized routing algorithms for services.
The following is a list of three typical application scenarios of DMS. Other scenarios can be realized through the combination of these three examples: no broker communication
The most basic communication mode - full connection of Instances between two clusters, suitable for simple businesses with few services and uncomplicated logic. Broker communication
For an internally aggregated subsystem, there may be N services, and these services interact strongly with each other. If the broker free mode is used, there may be two problems: too many links: the communication layer occupies a large amount of memory; Difficulty in operation and maintenance; The service is not decoupled and directly depends on the existence of the opposite end;
● At this time, the broker cluster can take on the role of message forwarding and complete some centralized logical processing. Note that Broker is just a name and can be directly implemented through DMS Library. Broker level Unicom
Multiple subsystems communicate with each other. It is estimated that no designer is willing to completely expose the internal details to each other. At this time, two broker clusters are equivalent to portals: first, internal subsystems can communicate with each other and centralize logic; Secondly, it can be used as the external interface of the subsystem to shield details. In this way, different subsystems only need to provide external services through their respective broker clusters.
This article mainly introduces several basic structures of DMS: service discovery, message middleware and communication architecture. The basic idea is that the framework is layered and the interfaces between layers are clearly defined, so that different specific implementations can be used for replacement in different scenarios. Zookeeper and ZeroMQ are just examples of current implementation methods. In different scenarios, different components can be selected as long as the interface is met.