How to Make Your Go Program Run Faster?

Testing Tech Hub 2023-09-19 16:55 165

This article is about the Go language. It primarily focuses on runtime speed, rather than development speed – these two types of speed are distinct.

I have worked with many intelligent people in the past. Many of us were fascinated by performance issues, and what we used to do was try to push the boundaries of what could be expected in terms of performance. Application engines have some very strict performance requirements, which is why we made changes. Since using the Go language, we have learned a lot about improving performance and making Go work smoothly in system programming.

The simplicity and native concurrency of Go make it an attractive language for backend development, but a bigger concern is how it handles latency-sensitive application scenarios. Is it worth sacrificing the language's simplicity to make it faster? Let's take a look at several aspects of Go language performance optimization: language features, memory management, and concurrency, and make appropriate optimization decisions based on these. All the test codes introduced here can be found at this link.

Channels

Channels in the Go language have received a lot of attention because they are a convenient tool for concurrency, but it is also important to understand their impact on performance. In most scenarios, their performance is "good enough," but they can become a bottleneck in some latency-sensitive situations. Channels are not some kind of black magic. In the underlying implementation of channels, locks are still used. They work well in single-threaded applications without lock contention, but performance drops dramatically in multi-threaded scenarios. We can easily replace the functionality of channels with lock-free queues, such as ring buffers.

The first performance test compares the single-threaded buffer channel and ring buffer (one producer and one consumer). Let's take a look at the single core (GOMAXPROCS=1)

BenchmarkChannel 3000000 512 ns/op

BenchmarkRingBuffer 20000000 80.9 ns/op

As you can see, the ring buffer is about 6 times faster (if you are not familiar with Go's performance testing tool, the number in the middle represents the number of executions, and the last array represents the time spent on each execution).

Next, let's look at the adjustment of GOMAXPROCS=8.

BenchmarkChannel-8 3000000 542 ns/op

BenchmarkRingBuffer-8 10000000 182 ns/op

The ring buffer is nearly three times faster.

Channel is usually used to assign tasks to workers. In the following test, we compare the situation in which multiple readers read the same channel or ring buffer. The test result of setting GOMAXPROCS=1 shows that the channel performs particularly well in single-core applications.

BenchmarkChannelReadContention 10000000 148 ns/op

BenchmarkRingBufferReadContention 10000 390195 ns/op

However, ring buffer is faster in the case of multi-core:

BenchmarkChannelReadContention-8 1000000 3105 ns/op

BenchmarkRingBufferReadContention-8 3000000 411 ns/op

Finally, let's take a look at the scenario of multiple readers and multiple writers. The following comparison also shows that the ring buffer is better in multi-core.

BenchmarkChannelContention 10000 160892 ns/op

BenchmarkRingBufferContention 2 806834344 ns/op

BenchmarkChannelContention-8 5000 314428 ns/op

BenchmarkRingBufferContention-8 10000 182557 ns/op

The ring buffer only uses CAS operations to achieve thread safety. We can see that the choice of channel or ring buffer depends largely on the number of cores in the system. For most systems, GOMAXPROCS>1, so the ring buffer without a lock is often a better choice. Channel is a poor choice in multi-core systems.

defer

Defer is a very useful keyword to improve readability and avoid resources not being released. For example, when we open a file for reading, we need to close it at the end of reading. Without the defer keyword, we must ensure that the file is closed before each return point of the function.

This is easy to make mistakes because it is easy to forget to close the file before any return statement. Defer solves this problem with a line of code.

At first glance, people would think deferring might be completely optimized by the compiler. If I only use the defer statement at the beginning of the function, the compiler can implement it by inserting defer content before each return statement. But the actual situation is often more complicated. For example, we can add defer to conditional statements or loops. The first case may require the compiler to find the conditional branch where the defer statement is applied The compiler also needs to check the situation of panic because this is also a situation where the function exits execution. It seems (at least on the surface) impossible to provide this function (defer) through static compilation.

derfer is not a zero-cost keyword. We can take a look through the performance test. In the following test, we compared the direct unlocking and the use of defer statements to unlock a mutex lock after it is locked in the loop body.

BenchmarkMutexDeferUnlock-8 20000000 96.6 ns/op

BenchmarkMutexUnlock-8 100000000 19.5 ns/op

Using defer is almost five times slower. To be fair, 77ns may not be so important, but it does affect performance in a cycle. It is usually up to the developer to make a trade-off between performance and code readability. Optimization always requires cost.

Section and JSON

Reflection is usually slow and should be avoided in delay-sensitive services. JSON is a common data exchange format, but Go's encoding/json library relies on reflection to serialize and deserialize json. With ffjson, we can avoid the use of reflection by using code generation. Here is a performance comparison:

BenchmarkJSONReflectionMarshal-8 200000 7063 ns/op

BenchmarkJSONMarshal-8 500000 3981 ns/opBenchmarkJSONReflectionUnmarshal-8 200000 9362 ns/op

BenchmarkJSONUnmarshal-8 300000 5839 ns/op

(ffjson) The generated JSON serialization and deserialization is about 38% faster than the reflection-based standard library. Of course, we should avoid using JSON if we want to improve the performance of encoding and decoding.

MessagePack is a better choice for serialization code. In this test, we used the msgplibrary to compare with JSON.

BenchmarkMsgpackMarshal-8 3000000 555 ns/op

BenchmarkJSONReflectionMarshal-8 200000 7063 ns/op

BenchmarkJSONMarshal-8 500000 3981 ns/opBenchmarkMsgpackUnmarshal-8 20000000 94.6 ns/op

BenchmarkJSONReflectionUnmarshal-8 200000 9362 ns/op

BenchmarkJSONUnmarshal-8 300000 5839 ns/op

The difference here is significant. Even compared with the code generated by (JSON), MessagePack is still much faster.

If we care about minor optimizations, we should also avoid using interface types, which need to do some additional processing during serialization and deserialization. In some dynamic invocation scenarios, runtime invocation will also add some additional overhead. The compiler cannot replace these calls with inline calls.

BenchmarkJSONReflectionUnmarshal-8 200000 9362 ns/op

BenchmarkJSONReflectionUnmarshalIface-8 200000 10099 ns/op

Let's take a look at calling lookup, that is, converting an interface variable to its real type. This test calls the same method of the same struct. The difference is that the second variable is a pointer to the structure.

BenchmarkStructMethodCall-8 2000000000 0.44 ns/op

BenchmarkIfaceMethodCall-8 1000000000 2.97 ns/op

Sorting is a more practical example, which shows the performance difference well. In this test, we compared 1000000 structures and 1000000 interfaces pointing to the same structure. Sorting the structure is 92% faster than sorting the interface.

BenchmarkSortStruct-8 10 105276994 ns/op

BenchmarkSortIface-8 5 286123558 ns/op

In conclusion, avoid using JSON if possible. If you need to use JSON, generate serialization and deserialization code. In general, it is better to avoid relying on reflection and interface, and instead write specific types for use. Unfortunately, this often leads to a lot of duplicate code, so it's better to generate abstract code. Third, weigh the gains and losses.

Memory Management

Go does not actually expose the heap or directly allocate the stack to users. The words "heap" and "stack" do not appear anywhere in the Go language specification. This means that stack and heap are only technically relevant. Each goroutine does have its heap and stack. The compiler can't escape analysis to determine whether objects are allocated on the stack or in the heap.

As expected, avoiding heap allocation can become the main direction of optimization. By allocating space in the stack (that is, create objects by using A {} instead of new (A)), we avoid expensive malloc calls, as shown in the test below.

BenchmarkAllocateHeap-8 20000000 62.3 ns/op 96 B/op 1 allocs/op

BenchmarkAllocateStack-8 100000000 11.6 ns/op 0 B/op 0 allocs/op

Naturally, it is faster to pass values through pointers than through objects, because the former needs to copy a unique pointer, while the latter needs to copy the entire object. The difference in the following test results is almost negligible because this difference largely depends on the type of objects being copied. Note that some compilers may optimize the compilation of this test.

BenchmarkPassByReference-8 1000000000 2.35 ns/op

BenchmarkPassByValue-8 200000000 6.36 ns/op

However, the biggest problem with heap space allocation is GC (garbage collection). If we generate many objects with short life cycles, we will trigger GC work. In this scenario, object pooling is useful. In the following test, we compared heap allocation with sync. Pool. Object pooling improves performance by 5x.

BenchmarkConcurrentStructAllocate-8 5000000 337 ns/op

BenchmarkConcurrentStructPool-8 20000000 65.5 ns/op

It should be noted that Go's sysc.The pool will also be recycled during garbage collection. The function of sync. Pool is to reuse the memory between garbage collection operations. We can also maintain our free object list so that objects will not be recycled, but this may make garbage collection lose its due role. Go's approval tool is very useful for analyzing memory usage. It must be used for analysis before blindly doing memory optimization.

False Sharing

When performance is really important, you have to start thinking at the hardware level. The famous Formula One driver Jackie Stewart once said, "You don't need to be an engineer to become a racing driver, but you must have mechanical knowledge." A deep understanding of the internal working principle of a car can make you a better driver. Similarly, understanding how computers work can make you a better programmer. For example, how is the memory laid out? How does CPU caching work? How does the hard disk work?

Memory bandwidth is still a limited resource for modern CPUs, so caching is extremely important to prevent performance bottlenecks. Today's multi-core processors cache data in a cache line, usually 64 bytes in size, to reduce the main memory access with high overhead. To ensure cache consistency, a small write to memory will make the cache line obsolete. Read operations on adjacent addresses cannot hit the corresponding cache line. This phenomenon is called false sharing. This problem becomes obvious when multiple threads access different data in the same cache line.

Imagine how a struct in the Go language is stored in memory. Let's use the previous buffer as an example. The structure may be as follows:

The queue and dequeue fields are used to determine the location of producers and consumers, respectively. The size of these fields is 8 bytes, and they are simultaneously accessed and modified by multiple threads to implement queue insertion and deletion operations. Because these fields are stored continuously in memory, they only use 16 bytes of memory, and they are likely to be stored in the same cache line. Therefore, modifying any of these fields will lead to the elimination of other field caches, which means that the next read operation will be slower. In other words, adding and deleting elements in the ring buffer will result in many CPU cache failures.

We can add padding directly to the fields of the structure. Each padding is the same size as a CPU cache line, which can ensure that the ring buffer fields are cached in different cache lines. The following is the modified structure:

How much difference will it make in actual operation? Like other optimizations, the optimization effect depends on the actual scene. It is related to the number of CPU cores, the number of resource competitions, and the layout of memory. Although there are many factors to consider, we still need to use data. We can compare ring buffers with and without padding.

First, we test the situation of a producer and a consumer. They run in the same gorouting. In this test, the difference between the two is very small, with less than 15% performance improvement:

BenchmarkRingBufferSPSC-8 10000000 156 ns/op

BenchmarkRingBufferPaddedSPSC-8 10000000 132 ns/op

However, when we have multiple producers and multiple consumers, such as 100 each, the difference will be more obvious. In this case, the populated version is about 36% faster.

BenchmarkRingBufferMPMC-8 100000 27763 ns/op

BenchmarkRingBufferPaddedMPMC-8 100000 17860 ns/op

False sharing is a very realistic problem. Add Padding to mitigate the impact of concurrency and memory contention. These numbers may seem insignificant, but they have played an optimization role, especially when considering the clock cycles.

Lock-free Sharing

Lockless data structures are very important for making full use of multi-core. Considering that Go is committed to high concurrency scenarios, it does not encourage the use of locks. It encourages more use of channels than mutexes.

This means that the standard library does provide common memory-level atomic operations, such as the atomic package. It provides atomic comparison and exchange, and atomic pointer access. However, the use of atomic packets is greatly discouraged:

We generally don’t want sync/atomic to be used at all…Experience has shown us again and again that very very few people are capable of writing correct code that uses atomic operations…If we had thought of internal packages when we added the sync/atomic package, perhaps we would have used that. Now we can’t remove the package because of the Go 1 guarantee.

How difficult is it to achieve lock-free? Is it OK to just use some CAS implementation? After learning enough knowledge, I realized that this is a double-edged sword. Lockless code can be very complex to implement. The atomic and unsafe packages are not easy to use. Moreover, writing thread-safe lock-free code is very tricky and error-prone. A simple lockless data structure like a ring buffer is relatively simple to maintain, but it is easy to cause problems in other scenarios.

Ctrie is a lock-free data structure implementation I wrote. It is detailed here. Although it is easy to understand in theory, it is very complex to implement. The main reason for complex implementation is the lack of dual CAS, which can help us automatically compare nodes (to detect node mutations on the tree), and also help us generate node snapshots. Because there is no hardware to provide such an operation, we need to simulate it ourselves.

The implementation of the first version of Ctree was very unsuccessful, not because I mistakenly used the Go synchronization mechanism, but because I made wrong assumptions about the Go language. Each node in the Ctree has a peer node associated with it. When taking a snapshot, the root node will be copied to a new node. When a node in the tree is accessed, it will also be inert and copied to a new node (persistent data structure). This snapshot operation is constant and time-consuming.

To avoid integer overflow, we use allocating objects on the heap to distinguish between old and new nodes. In the Go language, we use an empty struct. In Java, the two newly generated empty objects are different. Because their memory addresses are different, I assume that the Go rules are the same. However, the result is cruel. Please refer to the following documents:

A struct or array type has a size of zero if it contains no fields (or elements, respectively) that have a size greater than zero. Two distinct zero-size variables may have the same address in memory.So the tragedy happens. The two newly generated nodes are equal when compared, so the dual CAS always succeeds. This bug is very interesting, but tracking this bug in a high-concurrency, lock-free environment is hell. If these methods are not used correctly the first time, it will take a lot of time to solve the hidden problems later, and it is not the first time that you do it right, but it will always be right later.

But obviously, it makes sense to write complex lock-free algorithms. Otherwise, why would anyone do so? Compared with a synchronous map or jump table, the insertion operation of Ctree is more time-consuming because there are more addressing operations. The real advantage of Ctree is memory consumption. Unlike most Hash tables, it always has a series of keys in the tree. Another performance advantage is that it can complete linear snapshots in constant time. We compared the snapshot of the synchronized map and Ctree under 100 concurrent conditions:

BenchmarkConcurrentSnapshotMap-8 1000 9941784 ns/op

BenchmarkConcurrentSnapshotCtrie-8 20000 90412 ns/opUnder specific access modes, lockless data structures can provide better performance in multithreaded systems. For example, NATS Message Queuing uses a synchronized map-based data structure to complete subscription matching. If the lock-free Ctree is used, the throughput will be greatly improved. In the following figure, the blue line indicates the implementation of lock-based data structures, and the red line indicates the implementation of lock-free data structures:

Avoiding locks in specific scenarios can lead to good performance improvements. From the comparison between the ring buffer and channel, we can see the obvious advantages of a lock-free structure. However, we need to balance the complexity of coding with the benefits gained. Sometimes a lockless structure does not provide any real benefits.

Precautions for Optimization

As we can see from the above discussion, performance optimization always has costs. Understanding and understanding optimization methods is only the first step. It is more important to understand when and where they should be used. Quote a famous saying of C. A. R. Hoare, which has become a classic maxim applicable to all programmers:

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.However, the view of this sentence is not against optimization, but let us learn to balance speed - algorithm speed, response speed, maintenance speed, and system speed. This is a very subjective topic, and there is no simple standard. Is premature optimization the root of the error? Should I implement the function first and then optimize it? Or does it not need optimization at all? There is no standard answer. Sometimes it is OK to implement the function first and then increase the speed.

However, my suggestion is to optimize only critical paths. The farther you go on the critical path, the lower your optimization return will be, which is almost a waste of time. It is important to be able to make a correct judgment on whether the performance meets the standard. Don't waste time outside. It should be data-driven - speak with experience, not out of whim. The other thing is to pay attention to reality. It is meaningless to optimize tens of nanoseconds of code that is not very sensitive for some time. There are more areas to be optimized than this.

Summary

If you have read this, congratulations, but you may have made some mistakes. We have learned that in software we have two speeds - response speed and execution speed. Users want the first, developers want the second, and CTOs want both. At present, the first speed is the most important, as long as you want users to use your products. The second speed requires you to schedule and iterate. They often conflict with each other.

Perhaps more thought-provoking, we discussed some ways to improve Go's performance and make it more available in low-latency systems. Go language is born for conciseness, but this conciseness sometimes has a price. Just like the trade-off between the previous two speeds, there is also a trade-off between code maintainability and code performance. Speed often means sacrificing code simplicity, more development time, and later maintenance costs. Make a wise choice.

About WeTest

WeTest Quality Open Platform is the official one-stop testing service platform for game developers. We are a dedicated team of experts with more than ten years of experience in quality management. We are committed to the highest quality standards of game development and product quality and tested over 1,000 games.

WeTest integrates cutting-edge tools such as automated testing, compatibility testing, functionality testing, remote device, performance testing and security testing, covering all testing stages of games throughout their entire life cycle.

perfdog performance-testing

Read Previous Post >>

Understanding External Storage Features and Implementation Across Various Android Versions