Wednesday, August 14, 2013

Z-Prepass Considered Irrelevant

The G3D Innovation Engine directly supports easy switching between forward, forward+, and deferred rendering. Most programs, including the starter sample, begin rendering with a z-prepass regardless of which mode they are operating in. This ensures that (except for translucent surfaces and partial coverage) shading occurs once per sample. The z-prepass operation is a depth-only render pass that bypasses the fragment shader (except when a surface has an alpha mask), and it processes fragment-bound scenes in about 1/3 the time of a regular shading pass on NVIDIA GPUs, which have special fixed-function support for depth-only rendering.



The problem with a z-prepass is that it requires submitting the entire scene an extra time to the GPU. This brings the z-prepass into question as a performance optimization. Specifically:

Is doubling the cost of transformation, tessellation, and rasterizer setup less than the cost of overshading?

The prepass involves traversal of the CPU surface array, which is nontrivial in G3D because there are many surface subclasses with their own rendering strategies and shaders. It also can involve fairly heavy vertex transformations for skinning, geometry shading, and a lot of rasterizer setup in the case where a tessellation shader is enabled. Finally, along the way are an awful lot of state changes for passing shader parameters (bindless or uniform blocks are ways of minimizing these). These are all common concerns in rendering engines on multiple platforms and APIs.

G3D sorts opaque surfaces for front-to-back rendering. That generates much of the value of a z-prepass. G3D also spends a substantial amount of its per-pixel shading (even for forward rendering) in full-screen post-processing passes such as ambient obscurance, bloom, color grading, depth of field, and motion blur that are not improved by a z-prepass.

I performed a quick test by simply removing the z-prepass from the system in forward+ mode. This means that the first rendering pass is a G-buffer pass that writes to multiple render targets simultaneously, foregoing depth-only z-prepass. ATCS and the Minecraft model had relatively high depth complexity (and thus stand to benefit the most from a z-prepass), but they also have a lot of materials and thus many draw calls. Sponza and the smoke test have low depth complexity. I measured performance by looking at the full-frame rendering time ("1/fps") with vsync off.I tested on Windows 7 64-bit with a low-end GeForce 650M (in a 2012 MacBook Pro) for four scenes at 720p + 64-pixel guard band: Crytek Sponza, the ATCS Quake3 map from Tremulous, a Minecraft model, and the G3D smoke test that uses all G3D model subclasses and many draw calls. All had ambient occlusion off and few skinned characters, to test the worst case for disabling z-prepass.


For each of these scenes, there was no significant performance difference with or without a z-prepass. Sponza and ATCS rendered in 16ms (61 fps). Minecraft was 38 ms with a prepass and 37 ms without it. The smoketest took 26 ms in both cases.

This left me with the conclusion that the complexity of the z-prepass in my system was not justified--the minor amount of overshading it reduced seemed nearly equalled by the increase in rendering time. In other words, the z-prepass may be irrelevant in modern rendering systems that submit many draw calls for well-sorted objects, and is potentially harmful as tessellation (and thus rasterizer setup) and skinning workloads increase. For a renderer that doesn't perform particuarly good front-to-back sorting (because it uses large meshes, for example), has a lot of alpha-testing, or in which the front half of the pipeline is relatively lightweight, z-prepass may still be important.

One caveat is that G3D actually uses two guard bands: a 64-pixel for depth (used to provide samples for SAO) and a 16-pixel one for color (for screen-space refraction, motion blur, and depth of field), as shown below:


When rendering the forward+ G-buffer without a prepass, it would be wasteful to compute per-pixel properties other than depth in the purple "trim band." So, I made the G-buffer shaders return immediately (but not discard) if the fragment coordinate is inside of that band. This substantially reduces bandwidth in those regions and slightly reduces the total amount of computation. The test results in this post were made before this optimization, but I expect it to now make running without the z-prepass actually faster.



Morgan McGuire is a professor of Computer Science at Williams College and a professional game developer. He is the author of The Graphics Codex, an essential reference for computer graphics that runs on iPhone, iPad, and iPod Touch.