I'm Morgan McGuire (@CasualEffects). I've been working on computer graphics and games for 20 years at great places including NVIDIA, Williams College, Brown University, and Activision.

See my home page for a full index of my blog posts, books, research, and projects.

Saturday, August 9, 2014

Screen Space Ray Tracing

Pretty good reflections computed by our
screen space ray tracer in ripply water
This post describes a screen space ray tracing implementation that I wrote with Mike Mara for use in our indie games and graphics research. A full paper is freely available at the Journal of Computer Graphics Techniques. The properties of this method match those sketched out by other developers for some of the latest game effects. We found it tricky to understand and get exactly right from their presentations, so we're sharing our full code and the derivation in the paper to help others implement it.

Games march 3D rays across the height field defined by a depth buffer to create very approximate screen-space reflections. You can see this technique in action in Crysis 3 and Just Cause 2. When the point where the 3D ray hits the 3D surface in camera space is discovered, the game projects that point back to 2D and uses the color of the pixel for the color of the reflection. This is obviously limited to reflections of objects that are on the screen, so most games fall back to an environment cube map when no hit point is found in the depth buffer because the ray went off screen or behind an object (see Bart Wronski's detailed explanation of failure cases).

Although it is primarily used for reflections today, this technique can do much more. Many computer graphics algorithms are based on casting rays. For example, we experimented with screen space rays for approximate reflection, refraction, glossy reflection, ambient occlusion, and even global illumination (much more efficient approximations are available for the last two).

Different effects computed by our screen-space ray tracer. The rightmost two are not real-time.

The Problem with 3D Marching

In current games, the ray march is usually limited to a small number of iterations to ensure real-time performance. Due to perspective, marching in regular steps along a 3D ray doesn't generally give 2D steps across the screen. This means that the limited number of samples are distributed poorly--especially in the case of water reflections like the ones shown in the dragon picture above. For that kind of camera and reflective surface, the ray will skip away from the surface too quickly, missing near-by objects. It will then bunch up the remainder of the samples on a few pixels, limiting how far away reflecting objects can be and wasting computation sampling the same pixels repeatedly.

The diagram below visualizes the problem. The over-sampled pixels are colored red for the 3D ray march. The under-sampled pixels are obvious gaps.

This limitation of the ray march technique is why the water reflections of distant mountains fade out in my terrain prototype video:

The 2D Marching Solution

Fast, exact thin-line and conservative 2D rasterization algorithms have been known for over 20 years. We built our ray caster by trivially extending the "Digital Differential Analyzer" (DDA) line algorithm with perspective-correct interpolation (i.e., how hardware rasterization handles interpolators) and then optimizing the implementation for the particular concerns of GPU programming, such as minimizing divergence, memory bandwidth, and peak register count. The optimized implementation is below. Our upcoming paper explains the optimizations in more detail includes extension to deep G-buffers. It also includes code for operating directly on a depth buffer instead of requiring a linear-z buffer, although we don't advise it for performance. This kind of approach has already proven reliable in production: Assassin's Creed 4 (see page 80) and Killzone 4 (page 90) presentations both describe it.

// By Morgan McGuire and Michael Mara at Williams College 2014
// Released as open source under the BSD 2-Clause License
// http://opensource.org/licenses/BSD-2-Clause
#define point2 vec2
#define point3 vec3

float distanceSquared(vec2 a, vec2 b) { a -= b; return dot(a, a); }

// Returns true if the ray hit something
bool traceScreenSpaceRay1(
 // Camera-space ray origin, which must be within the view volume
 point3 csOrig, 

 // Unit length camera-space ray direction
 vec3 csDir,

 // A projection matrix that maps to pixel coordinates (not [-1, +1]
 // normalized device coordinates)
 mat4x4 proj, 

 // The camera-space Z buffer (all negative values)
 sampler2D csZBuffer,

 // Dimensions of csZBuffer
 vec2 csZBufferSize,

 // Camera space thickness to ascribe to each pixel in the depth buffer
 float zThickness, 

 // (Negative number)
 float nearPlaneZ, 

 // Step in horizontal or vertical pixels between samples. This is a float
 // because integer math is slow on GPUs, but should be set to an integer >= 1
 float stride,

 // Number between 0 and 1 for how far to bump the ray in stride units
 // to conceal banding artifacts
 float jitter,

 // Maximum number of iterations. Higher gives better images but may be slow
 const float maxSteps, 

 // Maximum camera-space distance to trace before returning a miss
 float maxDistance, 

 // Pixel coordinates of the first intersection with the scene
 out point2 hitPixel, 

 // Camera space location of the ray hit
 out point3 hitPoint) {

    // Clip to the near plane    
    float rayLength = ((csOrig.z + csDir.z * maxDistance) > nearPlaneZ) ?
        (nearPlaneZ - csOrig.z) / csDir.z : maxDistance;
    point3 csEndPoint = csOrig + csDir * rayLength;

    // Project into homogeneous clip space
    vec4 H0 = proj * vec4(csOrig, 1.0);
    vec4 H1 = proj * vec4(csEndPoint, 1.0);
    float k0 = 1.0 / H0.w, k1 = 1.0 / H1.w;

    // The interpolated homogeneous version of the camera-space points  
    point3 Q0 = csOrig * k0, Q1 = csEndPoint * k1;

    // Screen-space endpoints
    point2 P0 = H0.xy * k0, P1 = H1.xy * k1;

    // If the line is degenerate, make it cover at least one pixel
    // to avoid handling zero-pixel extent as a special case later
    P1 += vec2((distanceSquared(P0, P1) < 0.0001) ? 0.01 : 0.0);
    vec2 delta = P1 - P0;

    // Permute so that the primary iteration is in x to collapse
    // all quadrant-specific DDA cases later
    bool permute = false;
    if (abs(delta.x) < abs(delta.y)) { 
        // This is a more-vertical line
        permute = true; delta = delta.yx; P0 = P0.yx; P1 = P1.yx; 

    float stepDir = sign(delta.x);
    float invdx = stepDir / delta.x;

    // Track the derivatives of Q and k
    vec3  dQ = (Q1 - Q0) * invdx;
    float dk = (k1 - k0) * invdx;
    vec2  dP = vec2(stepDir, delta.y * invdx);

    // Scale derivatives by the desired pixel stride and then
    // offset the starting values by the jitter fraction
    dP *= stride; dQ *= stride; dk *= stride;
    P0 += dP * jitter; Q0 += dQ * jitter; k0 += dk * jitter;

    // Slide P from P0 to P1, (now-homogeneous) Q from Q0 to Q1, k from k0 to k1
    point3 Q = Q0; 

    // Adjust end condition for iteration direction
    float  end = P1.x * stepDir;

    float k = k0, stepCount = 0.0, prevZMaxEstimate = csOrig.z;
    float rayZMin = prevZMaxEstimate, rayZMax = prevZMaxEstimate;
    float sceneZMax = rayZMax + 100;
    for (point2 P = P0; 
         ((P.x * stepDir) <= end) && (stepCount < maxSteps) &&
         ((rayZMax < sceneZMax - zThickness) || (rayZMin > sceneZMax)) &&
          (sceneZMax != 0); 
         P += dP, Q.z += dQ.z, k += dk, ++stepCount) {
        rayZMin = prevZMaxEstimate;
        rayZMax = (dQ.z * 0.5 + Q.z) / (dk * 0.5 + k);
        prevZMaxEstimate = rayZMax;
        if (rayZMin > rayZMax) { 
           float t = rayZMin; rayZMin = rayZMax; rayZMax = t;

        hitPixel = permute ? P.yx : P;
        // You may need hitPixel.y = csZBufferSize.y - hitPixel.y; here if your vertical axis
        // is different than ours in screen space
        sceneZMax = texelFetch(csZBuffer, int2(hitPixel), 0);
    // Advance Q based on the number of steps
    Q.xy += dQ.xy * stepCount;
    hitPoint = Q * (1.0 / k);
    return (rayZMax >= sceneZMax - zThickness) && (rayZMin < sceneZMax);

All of the images on this page used the new method, except for the terrain video that I used to demonstrate the old ray marching method.

Reflecting teapots in water
Our code can execute 25 iterations for every pixel on the screen at 1920x1080 in less than 1 ms on a high-end GPU like NVIDIA GeForce Titan and in about 5 ms on a low-end GPU like NVIDIA GeForce 650M (i.e., on my two-year old Macbook Pro). If you don't expect every single pixel on the screen to require a ray cast and use stride to stretch those samples over many pixels, this can provide large, high-quality reflection and refraction. The full variant of the code on this page will ship in the open source G3D Innovation Engine version 10, which we use for most of our research and indie game projects and was the framework for these experiments.

You can eliminate the division from the inner loop by performing the depth comparison in homogeneous clip space. The problem with doing so is that it is impractical to apply a depth interval to the depth buffer, since any constant interval in homogeneous clip space grows hyperbolically away from the near plane. We believe that developers who use this approach simply make the interval infinite, i.e., assume that the depth buffer is a continuous sheet. The drawback to that approach is that it makes objects that should have finite z extent appear infinite. An example is below.

The long streaks are artifacts caused by assigning infinite depth extent objects. This is much faster, but may be an unacceptable approximation for some scenes, like this one. Our recommended approach of finite depth gives the better quality result shown at the top of this article.
There are some natural extensions which can further improve quality and utility. MIP-mapping the color buffer is an inexpensive way to approximate cone-traced glossy reflections: just read from higher MIP levels when farther from the reflective object. Bilinear interpolation when sampling the color buffer smooths out jaggies (as does post-processed antialiasing, which we used for our results). For the ray cast itself, when using a large stride the final interval can be refined using binary search, as is done for linear 3D ray marching and parallax mapping. For really long ray casts a min-max MIPped depth buffer allows iterations in large steps across areas of the screen that are entirely far or near (like an oct-tree). Chris Oat and I implemented this min-max method in 2008 using Amanatides and Woo/Musgraves's heightfield algorithm for the conservative iteration. We found it too slow at the time, but I think that it should be revisited on a modern GPU.

A lot of game engines are moving to temporal filtering to reduce the cost of computing effects. This kind of filtering is hard to apply to reflection and refraction because they are view dependent. Mike and I briefly experimented with the filters in Unreal Engine 4 for screen-space reflections and did not like the quality, so we don't yet have a recommendation on how to do this well. However, we implemented some nice temporal filters for our deep G-buffer paper and include the source on that page to help others explore the problem.

Morgan McGuire (@morgan3d) is a professor of Computer Science at Williams College, visiting professor at NVIDIA Research, and a professional game developer. His most recent games are Rocket Golfing and work on the Skylanders series. He is the author of the Graphics Codex, an essential reference for computer graphics now available in iOS and Web Editions.