In the world of computer graphics GPUs are these little black boxes that we always talk to over layers and layers of software abstraction. In this regard GPUs are nothing like CPUs - the latter are exposed down to their clockwork - instructions, clocks, and even errata. No so with GPUs - they're locker boxes keeping wealths of arcane IP (intellectual property). In this regard, getting close and intimate with a GPU is not unlike reading a black box.
So it's a bit of fortune's smile in the world of GPUs (let alone SoC GPUs) when we get something like the famous (or notorious, depending how you look at it)
AMD_performance_monitor extension in an actually usable form. Or a form close to usable, anyway.
AMD_performance_monitor from the GL stack of our Z430 offers some 807 performance counters, grouped in 14 groups, by category, or sub-system in the GPU. A great deal of those counters carry little useful information to anybody but a driver programer, yet
select few of the remaining counters can be of great use to us - for profiling purposes, or plain black-box reading. Unfortunately for us, we also have to figure out which those registers are. For that we have to rely on the string names of the individual counters and their respective groups.
Here I'll take the opportunity to express my gratitude to the authors of the current iMX515 GL stack, for not deliberately obfuscating the names of the AMD_performance_monitor counters. *shakes fist in the general direction of a few desktop GL stacks*.
Ok, enough ado. Let's get to business.
The last, 14th group of performance counters available in the 2010.07.11 stack is dubbed 'RB_SAMPLES', which stands for 'Render-Block Samples'. Here's its human-readable query:
group 13: RB_SAMPLES, id: 13
number of counters: 4
max number of active counters: 4
counter 0: RB_TOTAL_SAMPLES, id: 0, type: UNSIGNED_INT64_AMD, range: [0, 2^64 - 1]
counter 1: RB_ZPASS_SAMPLES, id: 1, type: UNSIGNED_INT64_AMD, range: [0, 2^64 - 1]
counter 2: RB_ZFAIL_SAMPLES, id: 2, type: UNSIGNED_INT64_AMD, range: [0, 2^64 - 1]
counter 3: RB_SFAIL_SAMPLES, id: 3, type: UNSIGNED_INT64_AMD, range: [0, 2^64 - 1]
Remember last time when we meant to count some pixels for the sake of measuring fillrate workloads, and we had to use basic shapes to make their pixel area easily computable? Well, the above counters give us the
exact pixel stats from the POV of the GPU. Hello there, definitive pixel stats!
So, in order of appearance:
- RB_TOTAL_SAMPLES: pixels output by the rasterizer block, for the duration of the monitor session
- RB_ZPASS_SAMPLES: of the above, pixels that passed the depth-test
- RB_ZFAIL_SAMPLES: the complement of the above
- RB_SFAIL_SAMPLES: pixels that failed the stencil test
In a scenario where we have both depth- and stencil testing in place, our visible pixels would be RB_ZPASS_SAMPLES - RB_SFAIL_SAMPLES. Without stencil testing, our visible pixel would be RB_ZPASS_SAMPLES, and the sum of RB_ZPASS_SAMPLES and RB_ZFAIL_SAMPLES should be equal to RB_TOTAL_SAMPLES, while RB_SFAIL_SAMPLES should stay at constant zero.
So, let's see what our new favorite pixel-counting facility would show under some basic scenarios. First, the little draw-call analysis primer from last time - the one that does N full-viewport draws per frame, in a viewport of, say, 256x256:
tracing AMD_performance_monitor counters:
group id: 13, counter id: 0
group id: 13, counter id: 1
group id: 13, counter id: 2
group id: 13, counter id: 3
total frames rendered: 1000
draw calls per frame: 4
traced AMD_performance_monitor counters:
group id: 13, counter id: 0, value: 262144000
group id: 13, counter id: 1, value: 262144000
group id: 13, counter id: 2, value: 0
group id: 13, counter id: 3, value: 0
What do you know - it works as advertised: 256^2 * 4 * 1000 = 262144000.
Let's now try something a tad more complex - our original complex-shader iMX515 test - the one that draws a rotating sphere with some fancy and overly-expensive bump mapping:
tracing AMD_performance_monitor counters:
group id: 13, counter id: 0
group id: 13, counter id: 1
group id: 13, counter id: 2
group id: 13, counter id: 3
total frames rendered: 100
traced AMD_performance_monitor counters:
group id: 13, counter id: 0, value: 46766800
group id: 13, counter id: 1, value: 46766800
group id: 13, counter id: 2, value: 0
group id: 13, counter id: 3, value: 0
For a viewport of 512x512, an orthographically-projected unit-radius sphere spanning NDC space should have a pixel area of R^2 * Pi = 256^2 * Pi = 205887 pixels, over 100 frames - 20588700 pixels. Even if we consider that's an approximate number due to the non-trivial shape and the employed edge rasterization rules, that's still 2.27 times less than what our pixel counter reports. Something is well off here. But how come the counter works down to the single pixel in the first test, and is so off in this second test? The differences between those two tests are negligible outside of the shaders - both primers do neither depth- nor stencil-testing. But one of them also does not clear the color buffer (the viewport overdraw test), while the spinning-sphere one clears its color buffer due to the pursued animation effect. Could it be?.. Hmm.
For a viewport of 512x512, a clear-viewport operation covers exactly 262144 pixels. If we add those to our expected sphere area, we get 468031, over 100 frames - 46803100 pixels. That's
suspiciously close to what the performance counter reports as total pixels coming out of the rasterizer (and passing depth-testing) - 46766800. Time to return to our easy-to-calculate-pixel-area viewport overdraw test.
Let's tweak it to draw 1/4 of the 512x512 viewport, but this time clear the color buffer too.
tracing AMD_performance_monitor counters:
group id: 13, counter id: 0
group id: 13, counter id: 1
group id: 13, counter id: 2
group id: 13, counter id: 3
total frames rendered: 1000
draw calls per frame: 4
traced AMD_performance_monitor counters:
group id: 13, counter id: 0, value: 524288000
group id: 13, counter id: 1, value: 524288000
group id: 13, counter id: 2, value: 0
group id: 13, counter id: 3, value: 0
Surprise surprise. 512x512 + 256x256 * 4 = 524288 pixels, over 1000 frames - 524288000 pixels.
So what did we just observe? That the pixel counters from the RB_SAMPLES group also count our clear-buffer operation toward the total count, and also toward the depth-test-passed samples.
Wild, unsubstantiated speculation:
Z430 does not have dedicated buffer-clear logic. I would go even further and suggest that the part does not have dedicated ROPs either, but everything is done in the shader ALUs. Something that is not unheard of among mobile SoC GPUs.
/wild, unsubstantiated speculation
So, did our superficial touch to the AMD_performance_monitor extension on the iMX515 pay off? I'd say yes. Imagine what it would be if we actually knew how to use even a tenth of the remaining 803 performance counters ; )
For example code on using the AMD_performance_monitor extension, an exhaustive query of all counters, and perhaps a stepping stone for running some tests on your own - see the
test-es primer.