For the mobile performance test, the CPU and GPU usages of mobile phones are always a concern of people. Especially, the GPU running status in the game performance test being the focus more concerned. This time, the PerfDog gets to support collecting the detailed GPU information for the first time in the industry (supporting Mali machine in Phase 1). Compared with only the GPU usage and GPU frequency that could be recorded before, the new version of PerfDog adds the information including Mali GPU Utilization, Mali Pixels Info, and Mali Memory & Bus Bandwidth. It displays all the detailed information of the GPU running to provide more substantial data support for both targeted optimization of game GPU and game performance evaluation.
Below we will systematically explain the meaning of various newly added GPU performance indices as well as how to use these indices to analyze and optimize the GPU performance.
Mali GPU Utilization includes two performance indices of Non Fragment Utilization and Fragment Utilization. The Non Fragment Utilization refers to the percentage of the non-fragment processing time in the total GPU processing time. The Fragment Utilization refers to the percentage of the fragment processing time in the total GPU processing time.
The figure above displays the basic processing pipeline data paths to process various types of workloads by GPU as well as the performance indices of each processing module in every hierarchical structure. The workloads run on Mali GPU are coordinated by the job manager. This job manager is responsible for scheduling the workloads to each processing unit inside the GPU. It opens two FIFO work queues (referred to as job slots) to the graphics drive program. One slot is used for the Non Fragment workloads including vertex shading, tiling, geometry shading, tessellation shading and compute shading. Another slot is used for the Fragment shading workloads which mainly include rasterization, EarlyZ, FPK, Fragment shading, Blender and Tile write, etc.
It is used to address whether the GPU bottleneck is in Non Fragment processing phase or Fragment processing phase and can be used to guide the direction of program optimization. In case of GPU bottleneck, normally at least Non Fragment Utilization or Fragment Utilization will approach 100%. If both of them are lower than 100%, it is possible that there is a data dependency relation between Non Fragment and Fragment.
Reasons for too high Non Fragment Utilization and suggestions for optimization:
a) Too many vertices:
b) Lots of vertex attribute data. Suggestions for optimization: use the medium-precision attributes and delete the useless attributes.
c) Too complex Vertex Shader. Suggestions for optimization: prevent sampling texture in Vertex Shader and try to use low-precision variables to perform calculation.
d) Use complex computer shader or geometry shader, tessellation shader.
Reasons for too high Fragment Utilization and suggestions for optimization:
a) Too many fragments:
b) Too complex fragment Shader:
a) Refers to the GPU Cycle averagely consumed by each shaded Pixel, including Non Fragment processing Cycle and Fragment processing Cycle. Assuming that the GPU maximum frequency is 800MHz, the GPU usage is 100%, the game running resolution is 1080*2340 and FPS is 60, then.
b) Shaded Pixel per second (considering no OverDraw) = 1080 * 2340 * 60 = 151.6M.
c) PixelThrought = 800M / 151.6 = 5.27 Cycle.
d) It indicates that it takes 5.27 Cycles averagely to render each Pixel in the circumstances.
Because this index measures the GPU Cycle averagely consumed by each shaded Pixel, this index is usually related with the complexity of Vertex Shader or Fragment Shader. The two indices of Non Fragment Utilization and Fragment Utilization can be based to determine which part is the bottleneck. If Fragment is the processing bottleneck, it indicates that Fragment Shader under current scene is complex which results in long Cycle consumed to process a single Pixel.
a) Test cases: keeping the vertex count and Fragment quantity unchanged, use loop statement and times to control the calculated amount of VS and FS so as to verify the influence on PixelThroughput.
b) Test model: Oppo R15 (GPU model: Mali G72 MP3).
c) Test data:
d) Test conclusion: the higher the cycle number in Shader, the larger the calculated amount and the higher the PixelThroughput.
a) Overdraw is the number of times that the same pixel is drawn repeatedly in one frame. The OverDraw in PerfDog is the average OverDraw per second, i.e.
b) OverDraw = Shaded Fragments per second / Screen Pixels per second.
c) Assuming that the game running FPS is 60, the game running resolution is 10802340 and the number of Shaded Fragments per second is 273M, then OverDraw = 2731000000 / (1080234060) = 1.8.
d) From the equation above, it can be found that with a fixed resolution and frame rate, the higher the OverDraw, the larger the number of the processed Fragments of each frame and the higher the load. Once the load exceeds the GPU’s maximum processing capacity, it will cause dropped frame rate.
a) When we analyze GPU performance, it needs to first address the performance bottleneck. If Fragment Utilization rises suddenly in a certain period, it may be from the following two reasons:
b) Reasons for too high Overdraw and suggestions for optimization:
c) Suggestions for optimization of Overdraw by AlphaTest object:
d) Suggestions for optimization of Overdraw by AlphaBlend object:
Test case 1: verify the effect on OverDraw from AlphaBlend object overlay and rendering.
a) Start from the camera to draw a full-screen quad from near to far. Use the slider to control N layers (totally 20 layers, 0<N<=20) near the camera to be a translucent full-screen quad and the latter (20-N) layers to be opaque full-screen quad, so as to verify whether the value of Fragments/pixel satisfies the definition of OverDraw.
b) Test model: Oppo R15 (GPU model: Mali G72 MP3).
c) Test data:
Test conclusion: because GPU is designed with EarlyZ and FKP mechanisms, the shaded opaque triangles are culled. Thus, the shaded opaque triangles will not increase OverDraw. However, because the uppermost translucent triangle cannot write depth, the upper layer of opaque triangle will not shade the lower layer of opaque triangle and will increase OverDraw, resulting in OverDraw not being proportional to the number of opaque triangle layers.
Test case 2: verify the effect on OverDraw from AlphaTest object rendering & sorting mode.
a) Place 50 2D bulletin board grasses in the scene. Then adjust the rendering and sorting mode of the grasses and test OverDraw.
b) Test model: Oppo R15 (GPU model: Mali G72 MP3).
c) Test data:
The BusRead/BusWrite respectively denotes the number of bytes read from and written into the external shared memory per second by the GPU via system bus. It is very power consuming for the GPU to read from and write to an external DDR memory. Generally, the power consumption every GB/s bandwidth is 100mW. In addition, compared with the internal Cache inside the GPU, the latency will be longer to read from and write to an external memory.
The BusRead bandwidth is mainly provided by three processing units of Load/Store Unit, Texture Unit and Tile Unit of GPU. They are used to read the vertex input attribute data, Uniform data, TileList data, texture data, color/depth data. The BusRead size depends on the data amount per second read by those units of the GPU as well as the hit rate of L1 and L2 caches. Under the condition that the total data amount is unchanged, the higher the cache hit rate, the smaller the BusRead.
The BusWrite bandwidth is mainly provided by two processing units of Load/Store Unit and Tile Unit of the GPU. They are used to save the vertex output attribute data, TileList data and color/depth data.
Reasons for too high BusRead bandwidth and suggestions for optimization:
a) Vertex attribute bandwidth:
b) Texture bandwidth
c) Color/depth buffer bandwidth
Reasons for too high BusWrite bandwidth and suggestions for optimization:
a) Vertex output attribute bandwidth
b) Tile List bandwidth
c) Color/depth buffer bandwidth
Test case 1: verify the effect from the format of FrameBuffer color buffer area on memory bandwidth:
a) Render one full-screen quad and only modify the format of the color buffer area to verify the effect on memory bandwidth.
b) Test model: Oppo R15 (GPU model: Mali G72 MP3).
c) Test data:
Test conclusion: the 24-bit format is used in the color buffer area. Compared with the 16-bit format, both BusRead and BusWrite bandwidth are increased obviously, a rise by 24% for BusRead and 28% for BusWrite.
Test case 2: verify whether FrameBuffer will append the effect of depth buffer area on memory bandwidth:
a) Render one full-screen quad and only modify whether to append the depth buffer area to verify the effect on memory bandwidth.
b) Test model: Oppo R15 (GPU model: Mali G72 MP3)
c) Test data:
Test conclusion: compared with no depth buffer area appended, both BusRead and BusWrite bandwidth are increased because the appended depth buffer area realizes the depth reading and writing.
Test case 3: verify the effect of texture filtering method on memory bandwidth
a) Render one full-screen quad and only modify the texture filtering method to verify the effect on memory bandwidth.
b) Test model: Oppo R15 (GPU model: Mali G72 MP3).
c) Test data:
Test conclusion: because the trilinearity needs to sample two MipMap, both L2Texture bandwidth and BusRead bandwidth are increased.