Tuesday, March 14, 2017

Stingray Renderer Walkthrough #8: stingray-renderer & mini-renderer

Stingray Renderer Walkthrough #8: stingray-renderer & mini-renderer

Introduction

In the last post we looked at our systems for doing data-driven rendering in Stingray. Today I will go through the two default rendering pipes we ship as templates with Stingray. Both are entirely described in data using two render_config files and a bunch of shader_source files.

We call them the “stingray renderer” and the “mini renderer”

Stingray Renderer

The “stingray renderer” is the default rendering pipe and is used in almost all template and sample projects. It’s a fairly standard “high-end” real-time rendering pipe and supports the regular buzzword features.

The render_config file is approx 1500 lines of sjson. While 1500 might sound a bit massive it’s important to remember that this configuration is highly configurable, pretty much all features can be dynamically switched on/off. It also run on a broad variety of different platforms (mobile -> consoles -> high-end PC), supports a bunch of different debug visualization modes, and features four different stereo rendering paths in addition to the default mono path.

If you are interested in taking a closer look at the actual implementation you can download stingray and you’ll find it under core/stingray_renderer/renderer.render_config.

Going through the entire file and all the implementation details would require multiple blog posts, instead I will try to do a high-level break down of the default layer_configuration and talk a bit about the feature set. Before we begin, please keep in mind that this rendering pipe is designed to handle lots of different content and run on lots of different platforms. A game project would typically use it as a base and then extend, optimize and simplify it based on the project specific knowledge of the content and target platforms.

Here’s a somewhat simplified dump of the contents of the layer_configs/default array found in core/stingray_renderer/renderer.render_config in Stingray v1.8:

// run any render_config_extensions that have requested to insert work at the insertion point named "first"
{ extension_insertion_point = "first" }

// kick resource generator for rendering all shadow maps
{ resource_generator="shadow_mapping" profiling_scope="shadow mapping" }

// kick resource generator for assigning light sources to clustered shading structure
{ resource_generator="clustered_shading" profiling_scope="clustered shading" }

// special layer, only responsible for clearing hdr0, gbuffer2 and the depth_stencil_buffer
{ render_targets=["hdr0", "gbuffer2"] depth_stencil_target="depth_stencil_buffer" 
    clear_flags=["SURFACE", "DEPTH", "STENCIL"] profiling_scope="clears" }      

// if vr is supported kick a resource generator laying down a stencil mask to reject pixels outside of the lens shape
{ type="static_branch" platforms=["win"] render_settings={ vr_supported=true }
    pass = [
        { resource_generator="vr_mask" profiling_scope="vr_mask" }
    ]
}

// g-buffer layer, bulk of all materials renders into this
{ name="gbuffer" render_targets=["gbuffer0", "gbuffer1", "gbuffer2", "gbuffer3"] 
    depth_stencil_target="depth_stencil_buffer" sort="FRONT_BACK" profiling_scope="gbuffer" }

{ extension_insertion_point = "gbuffer" }

// linearize depth into a R32F surface
{ resource_generator="stabilize_and_linearize_depth" profiling_scope="linearize_depth" }

// layer for blending decals into the gbuffer0 and gbuffer1
{ name="decals" render_targets=["gbuffer0" "gbuffer1"] depth_stencil_target="depth_stencil_buffer" 
    profiling_scope="decal" sort="EXPLICIT" }

{ extension_insertion_point = "decals" }

// generate and merge motion vectors for non written pixels with motion vectors in gbuffer
{ type="static_branch" platforms=["win", "xb1", "ps4", "web", "linux"]
    pass = [
        { resource_generator="generate_motion_vectors" profiling_scope="motion vectors" }
    ]
}

// render localized reflection probes into hdr1
{ name="reflections" render_targets=["hdr1"] depth_stencil_target="depth_stencil_buffer" 
    sort="FRONT_BACK" profiling_scope="reflections probes" }

{ extension_insertion_point = "reflections" }

// kick resource generator for screen space reflections
{ type="static_branch" platforms=["win", "xb1", "ps4"]
    pass = [
        { resource_generator="ssr_reflections" profiling_scope="ssr" }
    ]
}

// kick resource generator for main scene lighting
{ resource_generator="lighting" profiling_scope="lighting" }
{ extension_insertion_point = "lighting" }

// layer for emissive materials
{ name="emissive" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer" 
    sort="FRONT_BACK" profiling_scope="emissive" }

// kick debug visualization
{ type="static_branch" render_caps={ development=true }
    pass=[
        { resource_generator="debug_visualization" profiling_scope="debug_visualization" }
    ]
}

// kick resource generator for laying down fog 
{ resource_generator="fog" profiling_scope="fog" }

// layer for skydome rendering
{ name="skydome" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer" 
    sort="BACK_FRONT" profiling_scope="skydome" }
{ extension_insertion_point = "skydome" }

// layer for transparent materials 
{ name="hdr_transparent" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer" 
    sort="BACK_FRONT" profiling_scope="hdr_transparent" }
{ extension_insertion_point = "hdr_transparent" }

// kick resource generator for reading back any requested render targets / buffers to the CPU
{ resource_generator="stream_capture_buffers" profiling_scope="stream_capture" }

// kick resource generator for capturing reflection probes
{ type="static_branch" platform=["win"] render_caps={ development=true }
    pass = [
        { resource_generator="cubemap_capture" }
    ]
}

// layer for rendering object selections from the editor
{ type="static_branch" platforms=["win", "ps4", "xb1"]
    pass = [
        { type = "static_branch" render_settings={ selection_enabled=true }
            pass = [
                { name="selection" render_targets=["gbuffer0" "ldr1_dev_r"] 
                    depth_stencil_target="depth_stencil_buffer_selection" sort="BACK_FRONT" 
                    clear_flags=["SURFACE" "DEPTH"] profiling_scope="selection"}
            ]
        }
    ]
}

// kick resource generators for AA resolve and post processing
{ resource_generator="post_processing" profiling_scope="post_processing" }
{ extension_insertion_point = "post_processing" }

// layer for rendering LDR materials, primarily used for rendering HUD and debug rendering
{ name="transparent" render_targets=["output_target"] depth_stencil_target="stable_depth_stencil_buffer_alias" 
    sort="BACK_FRONT" profiling_scope="transparent" }

// kick resource generator for rendering shadow map debug overlay
{ type="static_branch" render_caps={ development=true }
    pass = [
        { resource_generator="debug_shadows" profiling_scope="debug_shadows" }
    ]
}

// kick resource generator for compositing left/right eye
{ type="static_branch" platforms=["win"] render_settings={ vr_supported=true }
    pass = [
        { resource_generator="vr_present" profiling_scope="present" }
    ]
}

{ extension_insertion_point = "last" }

So what we have above is a fairly standard breakdown of a rendered frame, if you have worked with real-time rendering before there shouldn’t be much surprises in there. Something that is kind of cool with having the frame flow in this representation and pairing that with the hot-reloading functionality of render_configs, is that it really encourages experimentations: move things around, comment stuff out, inject new resource generators, etc.

Let’s go through the frame in a bit more detail:

Extension insertion points

First of all there are a bunch of extension_insertion_point at various locations during the frame, these are used by render_config_extensions to be able to schedule work into an existing render_config. You could argue that an extensions system to the render_configs is a bit superfluous, and for an in-house game engine targeting a specific industry that might very well be the case. But for us the extension system allows building features a bit more modular, it also encourages sharing of various rendering features across teams.

Shadows

// kick resource generator for rendering all shadow maps
{ resource_generator="shadow_mapping" profiling_scope="shadow mapping" }

We start off by rendering shadow maps. As we want to handle shadow receiving on alpha blended geometry there’s no simple way to reuse our shadow maps by interleaving the rendering of them into the lighting code. Instead we simply gather all shadow casting lights, try to prioritize them based on screen coverage, intensity, etc. and then render all shadows into two shadow maps.

One shadow map is dedicated to handle a single directional light which uses a cascaded shadow map approach, rendering each cascade into a region of a larger shadow map atlas. The other shadow map is an atlas for all local light sources, such as spot and point lights (interpreted as 6 spot lights).

Clustered shading

// kick resource generator for assigning light sources to clustered shading structure
{ resource_generator="clustered_shading" profiling_scope="clustered shading" }

We separate local light sources into two kinds: “simple” and “custom”. Simple lights are either spot lights or point lights that don’t have a custom material graph assigned. Simple light sources, which tend to be the bulk of all visible light sources in a frame, get inserted into a clustered shading acceleration structure.

While simple lights will affect both opaque and transparent materials, custom lights will only affect opaque geometry as they run a more traditional deferred shading path. We will touch on the lighting a bit more soon.

Clearing & VR mask

// special layer, only responsible for clearing hdr0, gbuffer2 and the depth_stencil_buffer
{ render_targets=["hdr0", "gbuffer2"] depth_stencil_target="depth_stencil_buffer" 
    clear_flags=["SURFACE", "DEPTH", "STENCIL"] profiling_scope="clears" }      

// if vr is supported kick a resource generator laying down a stencil mask to reject pixels outside of the lens shape
{ type="static_branch" platforms=["win"] render_settings={ vr_supported=true }
    pass = [
        { resource_generator="vr_mask" profiling_scope="vr_mask" }
    ]
}

Here we use the layer system to record a bind and a clear for a few render targets into a RenderContext generated by the LayerManager.

Then, depending on if the vr_supported render setting is true or not we kick a resource generator that marks in the stencil buffer any pixels falling outside of the lens region. This resource generator only does something if the renderer is running in stereo mode. Also note that the branch above is a static_branch so if vr_supported is set to false the execution of the vr_mask resource generator will get eliminated completely during boot up of the renderer.

G-buffer

// g-buffer layer, bulk of all materials renders into this
{ name="gbuffer" render_targets=["gbuffer0", "gbuffer1", "gbuffer2", "gbuffer3"] 
    depth_stencil_target="depth_stencil_buffer" sort="FRONT_BACK" profiling_scope="gbuffer" }

{ extension_insertion_point = "gbuffer" }

// linearize depth into a R32F surface
{ resource_generator="stabilize_and_linearize_depth" profiling_scope="linearize_depth" }

// layer for blending decals into the gbuffer0 and gbuffer1
{ name="decals" render_targets=["gbuffer0" "gbuffer1"] depth_stencil_target="depth_stencil_buffer" 
    profiling_scope="decal" sort="EXPLICIT" }

{ extension_insertion_point = "decals" }

// generate and merge motion vectors for non written pixels with motion vectors in gbuffer
{ type="static_branch" platforms=["win", "xb1", "ps4", "web", "linux"]
    pass = [
        { resource_generator="generate_motion_vectors" profiling_scope="motion vectors" }
    ]
}

Next we lay down the gbuffer. We are using a fairly fat “floating” gbuffer representation. By floating I mean that we interpret the gbuffer channels differently depending on material. I won’t go into details of the gbuffer layout in this post but everything builds upon a standard metallic PBR material model, same as most modern engines runs today. We also stash high precision motion vectors to be able to do accurate reprojection for TAA, RGBM encoded irradiance from light maps (if present, else irradiance is looked up from an IBL probe), high precision normals, AO, etc. Things quickly add up, in the default configuration on PC we are looking at 192 bpp for the color targets (i.e not counting depth/stencil). The gbuffer layout could use some love, I think we should be able to shrink it somewhat without losing any features.

We then kick a resource generator called stabilize_and_linerize_depth, this resource generator does two things:

  1. It linearizes the depth buffer and stores the result in an R32F target using a fullscreen_pass.
  2. It does a hacky TAA resolve pass for depth in an attempt to remove some intersection flickering for materials rendering after TAA resolve. We call the output of this pass stable_depth and use it when rendering editor selections, gizmos, debug lines, etc. We also use this buffer during post processing for any effects that depends on depth (e.g. depth of field) as those runs after AA resolve.

After that we have another more minimalistic gbuffer layer for splatting deferred decals.

Last but not least we kick another resource generator that calculates per pixel velocity for any pixels that haven’t been rendered to during the gbuffer pass (i.e skydome).

Reflections & Lighting

// render localized reflection probes into hdr1
{ name="reflections" render_targets=["hdr1"] depth_stencil_target="depth_stencil_buffer" 
    sort="FRONT_BACK" profiling_scope="reflections probes" }

{ extension_insertion_point = "reflections" }

// kick resource generator for screen space reflections
{ type="static_branch" platforms=["win", "xb1", "ps4"]
    pass = [
        { resource_generator="ssr_reflections" profiling_scope="ssr" }
    ]
}

// kick resource generator for main scene lighting
{ resource_generator="lighting" profiling_scope="lighting" }
{ extension_insertion_point = "lighting" }

At this point we are fully done with the gbuffer population and are ready to do some lighting. We start by laying down the indirect specular / reflections into a separate buffer. We use a rather standard three-step fallback scheme for our reflections: screen-space reflections, falling back to localized parallax corrected pre-convoluted radiance cubemaps, falling back to a global pre-convoluted radiance cubemap.

The reflections layer is the target layer for all cubemap based reflections. We are naively rendering the cubemap reflections by treating each reflection probe as a light source with a custom material. These lights gets picked up by a resource generator performing traditional deferred shading - i.e it renders proxy volumes for each light. One thing that some people struggle to wrap their heads around is that the resource generator responsible for running the deferred shading modifier isn’t kicked until a few lines down (in the lighting resource generator). If you’ve paid attention in my previous posts this shouldn’t come as a surprise for you, as what we describe here is the GPU scheduling of a frame, nothing else.

When the reflection probes are laid down we move on and run a resource generator for doing Screen-Space Reflections. As SSR typically runs in half-res we store the result in a separate render target.

We then finally kick the lighting resource generator, which is responsible for the following:

  1. Build a screen space mask for sun shadows, this is done by running multiple fullscreen_passes. The fullscreen_passes transform the pixels into cascaded shadow map space and perform PCF. Stencil culling makes sure the shader only runs for pixels within a certain cascade.
  2. SSAO with a bunch of different quality settings.
  3. A fullscreen pass we refer to as the “global lighting” pass. This is the pass that does most of the heavy lifting when it comes to the lighting. It handles mixing SSR with probe reflections, mixing of SSAO with material AO, lighting from all simple lights looked up from the clustered shading structure as well as calculates sun lighting masked with the result from sun shadow mask (step 1).
  4. Run a traditional deferred shading modifier for all light sources that has a material graph assigned. If the shader doesn’t target a specific layer the lights proxy volume will be rendered at this point, else it will be scheduled to render into whatever layer the shader has specified.

At this point we have a fully lit HDR output for all of our opaque materials.

Various stuff

// layer for emissive materials
{ name="emissive" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer" 
    sort="FRONT_BACK" profiling_scope="emissive" }

// kick debug visualization
{ type="static_branch" render_caps={ development=true }
    pass=[
        { resource_generator="debug_visualization" profiling_scope="debug_visualization" }
    ]
}

// kick resource generator for laying down fog 
{ resource_generator="fog" profiling_scope="fog" }

// layer for skydome rendering
{ name="skydome" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer" 
    sort="BACK_FRONT" profiling_scope="skydome" }
{ extension_insertion_point = "skydome" }

// layer for transparent materials 
{ name="hdr_transparent" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer" 
    sort="BACK_FRONT" profiling_scope="hdr_transparent" }
{ extension_insertion_point = "hdr_transparent" }

// kick resource generator for reading back any requested render targets / buffers to the CPU
{ resource_generator="stream_capture_buffers" profiling_scope="stream_capture" }

// kick resource generator for capturing reflection probes
{ type="static_branch" platform=["win"] render_caps={ development=true }
    pass = [
        { resource_generator="cubemap_capture" }
    ]
}

// layer for rendering object selections from the editor
{ type="static_branch" platforms=["win", "ps4", "xb1"]
    pass = [
        { type = "static_branch" render_settings={ selection_enabled=true }
            pass = [
                { name="selection" render_targets=["gbuffer0" "ldr1_dev_r"] 
                    depth_stencil_target="depth_stencil_buffer_selection" sort="BACK_FRONT" 
                    clear_flags=["SURFACE" "DEPTH"] profiling_scope="selection"}
            ]
        }
    ]
}

Next follows a bunch of layers for doing various stuff, most of this is straightforward:

  • emissive - Layer for adding any emissive material influences to the light accumulation target (hdr0)
  • debug_visualization - Kick of a resource generator for doing debug rendering. When debug rendering is enabled, the post processing pipe is disabled so we can render straight to the output target / back buffer here. Note: This doesn’t need to be scheduled exactly here, it could be moved later down the pipe.
  • fog - Kick of a resource generator for blending fog into the accumulation target.
  • skydome - Layer for rendering anything skydome related.
  • hdr_transparent - Layer for rendering transparent materials, traditional forward shading using the clustered shading acceleration structure for lighting. VFX with blending usually also goes into this layer.
  • stream_capture_buffer - Arbitrary location for capturing various render targets and dumping them into system memory.
  • cubemap_capture - Capturing point for reflection cubemap probes.
  • selection - Layer for rendering selection outlines.

So basically a bunch of miscellaneous stuff that needs to happen before we enter post processing…

Post Processing

// kick resource generators for AA resolve and post processing
{ resource_generator="post_processing" profiling_scope="post_processing" }
{ extension_insertion_point = "post_processing" }

Up until this point we’ve been in linear color space accumulating lighting into a 4xf16 render target (hdr0). Now its time to take that buffer and push it through the post processing resource generator.

The post processing pipe in the Stingray Renderer does:

  1. Temporal AA resolve
  2. Depth of Field
  3. Motion Blur
  4. Lens Effects (chromatic aberration, distortion)
  5. Bloom
  6. Auto exposure
  7. Scene Combine (exposure, tone map, sRGB, LUT color grading)
  8. Debug rendering

All steps of the post processing pipe can dynamically be enabled/disabled (not entirely true, we will always have to run some variation of step 7 as we need to output our result to the back buffer).

Final touches

// layer for rendering LDR materials, primarily used for rendering HUD and debug rendering
{ name="transparent" render_targets=["output_target"] depth_stencil_target="stable_depth_stencil_buffer_alias" 
    sort="BACK_FRONT" profiling_scope="transparent" }

// kick resource generator for rendering shadow map debug overlay
{ type="static_branch" render_caps={ development=true }
    pass = [
        { resource_generator="debug_shadows" profiling_scope="debug_shadows" }
    ]
}

// kick resource generator for compositing left/right eye
{ type="static_branch" platforms=["win"] render_settings={ vr_supported=true }
    pass = [
        { resource_generator="vr_present" profiling_scope="present" }
    ]
}

Before we present we allow rendering of unlit geometry in LDR (mainly used for HUDs and debug rendering), potentially do some more debug rendering and if we’re in VR mode we kick a resource generator that handles left/right eye combining (if needed).

That’s it - a very high-level breakdown of a rendered frame when running Stingray with the default “Stingray Renderer” render_config file.

Mini Renderer

We also have a second rendering pipe that we ship with Stingray called the “Mini Renderer” - mini as in minimalistic. It is not as broadly used as the Stingray Renderer so I won’t walk you through it, just wanted to mention it’s there and say a few words about it.

The main design goal behind the mini renderer was to build a rendering pipe with as little overhead from advanced lighting effects and post processing as possible. It’s primarily used for doing mobile VR rendering. High-resolution, high-performance rendering on mobile devices is hard! You pretty much need to avoid all kinds of fullscreen effects to hit target frame rate. Therefore the mini renderer has a very limited feature set:

  • It’s a forward renderer. While it’s capable of doing per pixel lighting through clustered shading it rarely gets used, instead most applications tend to bake their lighting completely or run with only a single directional light source.
  • No post processing.
  • While all lighting is done in linear color space we don’t store anything in HDR, instead we expose, tonemap and output sRGB directly into an LDR target (usually directly to the back buffer).

The mini_renderer.render_config file is ~400 lines, i.e. less than 1/3 of the stingray renderer. It is still in a somewhat experimental state but is the fastest way to get up and running doing mobile VR. I also feel that it makes sense for us to ship an example of a more lightweight rendering pipe; it is simpler to follow than the render_config for the full stingray renderer, and it makes it easy to grasp the benefits of data-driven rendering compared to a more static hard-coded rendering pipe (especially if you don’t have source access to the full engine as then the hard-coded rendering pipe would likely be a complete black box for the user).

Wrap up

I realize that some of you might have hoped for a more complete walkthrough of the various lighting and post processing techniques we use in the Stingray renderer. Unfortunately that would have become a very long post and also it feels a bit out of context as my goal with this blog series has been to focus on the architecture of the stingray rendering pipe rather than specific rendering techniques. Most of the techniques we use can probably be considered “industry standard” within real-time rendering nowadays. If you are interested in learning more there are lots of excellent information available, to name a few:

In the next and final post of this series we will take a look at the shader and material system we have in Stingray.

Thursday, March 9, 2017

Stingray Renderer Walkthrough #7: Data-driven rendering

Stingray Renderer Walkthrough #7: Data-driven rendering

Introduction

With all the low-level stuff in place it’s time to take a look at how we drive rendering in Stingray, i.e how a final frame comes together. I’ve covered this in various presentations over the years but will try do go through everything again to give a more complete picture of how things fit together.

Stingray features what we call a data-driven rendering pipe, basically what we mean by that is that all shaders, GPU resource creation and manipulation, as well as the entire flow of a rendered frame is defined in data. In our case the data is a set of different json files.

These json-files are hot-reloadable on all platforms, providing a nice workflow with fast iteration times when experimenting with various rendering techniques. It also makes it easy for a project to optimize the renderer for its specific needs (in terms of platforms, features, etc.) and/or to push it in other directions to better suit the art direction of the project.

There are four different types of json-files driving the Stingray renderer:

  • .render_config - the heart of a rendering pipe.
  • .render_config_extension - extensions to an existing .render_config file.
  • .shader_source - shader source and meta data for compiling statically declared shaders.
  • .shader_node - shader source and meta data used by the graph based shader system.

Today we will be looking at the render_config, both from a user’s perspective as well as how it works on the engine side.

Meet the render_config

The render_config is a sjson file describing everything from which render settings to expose to the user to the flow of an entire rendered frame. It can be broken down into four parts: render settings, resource sets, layer configurations and resource generators. All of which are fairly simple and minimalistic systems on the engine side.

Render Settings & Misc

Render settings is a simple key:value map exposed globally to the entire rendering pipe as well as an interface for the end user to peek and poke at. Here’s an example of how it might look in the render_config file:

render_settings = {
    sun_shadows = true
    sun_shadow_map_size = [ 2048, 2048 ]
    sun_shadow_map_filter_quality = "high"  
    local_lights_shadow_atlas_size = [ 2048, 2048 ]
    local_lights_shadow_map_filter_quality = "high"

    particles_local_lighting = true
    particles_receive_shadows = true

    debug_rendering = false
    gbuffer_albedo_visualization = false
    gbuffer_normal_visualization = false
    gbuffer_roughness_visualization = false
    gbuffer_specular_visualization = false
    gbuffer_metallic_visualization = false
    bloom_visualization = false
    ssr_visualization = false
}

As you will see we have branching logics for most systems in the render_config which allows the renderer to take different paths depending on the state of properties in the render_settings. There is also a block called render_caps which is very similar to the render_settings block except that it is read only and contains knowledge of the capabilities of the hardware (GPU) running the engine.

On the engine side there’s not that much to cover about the render_settings and render_caps, keys are always strings getting murmur hashed to 32 bits and the value can be a bool, float, array of floats or another hashed string.

When booting the renderer we populate the render_settings by first reading them from the render_config file, then looking in the project specific settings.ini file for potential overrides or additions, and last allowing to override certain properties again from the user’s configuration file (if loaded).

The render_caps block usually gets populated when the RenderDevice is booted and we’re in a state where we can enumerate all device capabilities. This makes the keys and values of the render_caps block somewhat of a black box with different contents depending on platform, typically they aren’t that many though.

So that covers the render_settings and render_caps blocks, we will look at how they are actually used for branching in later sections of this post.

There are also a few other miscellaneous blocks in the render_config, most important being:

  • shader_pass_flags - Array of strings building up a bit flag that can be used to dynamically turn on/off various shader passes.
  • shader_libraries - Array of what shader_source files to load when booting the renderer. The shader_source files are libraries with pre-compiled shader libraries mainly used by the resource generators.

Resource Sets

We have the concept of a RenderResourceSet on the engine side, it simply maps a hashed string to a GPU resource. RenderResourceSets can be locally allocated during rendering, creating a form of scoping mechanism. The resources are either allocated by the engine and inserted into a RenderResourceSet or allocated through the global_resources block in a render_config file.

The RenderInterface owns a global RenderResourceSet populated by the global_resources array from the render_config used to boot the renderer.

Here’s an example of a global_resources array:

global_resources = [
    { type="static_branch" platforms=["ios", "android", "web", "linux"]
        pass = [
            { name="output_target" type="render_target" depends_on="back_buffer" 
                    format="R8G8B8A8" }
        ]
        fail = [
            { name="output_target" type="alias" aliased_resource="back_buffer" }
        ]
    }

    { name="depth_stencil_buffer" type="render_target" depends_on="output_target" 
            w_scale=1 h_scale=1 format="DEPTH_STENCIL" }
    { name="gbuffer0" type="render_target" depends_on="output_target" 
            w_scale=1 h_scale=1 format="R8G8B8A8" }
    { name="gbuffer1" type="render_target" depends_on="output_target" 
            w_scale=1 h_scale=1 format="R8G8B8A8" } 
    { name="gbuffer2" type="render_target" depends_on="output_target" 
            w_scale=1 h_scale=1 format="R16G16B16A16F" }

    { type="static_branch" render_settings={ sun_shadows = true }
        pass = [
            { name="sun_shadow_map" type="render_target" size_from_render_setting="sun_shadow_map_size" 
                format="DEPTH_STENCIL" }
        ]
    }
    
    { name="hdr0" type="render_target" depends_on="output_target" w_scale=1 h_scale=1 
        format="R16G16B16A16F" }
]

So while the above example mainly shows how to create what we call DependentRenderTargets (i.e render targets that inherit its properties from another render target and then allow overriding properties locally), it can also create other buffers of various kinds.

We’ve also introduced the concept of a static_branch, there are two types of branching in the render_config file: static_branch and dynamic_branch. In the global_resource block only static branching is allowed as it only runs once, during set up of the renderer. (Note: The branch syntax is far from nice and we nowadays have come up with a much cleaner syntax that we use in the shader system, unfortunately it hasn’t made its way back to the render_config yet.)

So basically what this example boils down to is the creation of a set of render targets. The output_target is a bit special though, on PC and consoles we simply just setup an alias for an already created render target - the back buffer, while on gl based platforms we create a new separate render target. (This is because we render the scene up-side-down on gl-platforms to get consistent UV coordinate systems between all platforms.)

The other special case from the example above is the sun_shadow_map which grabs the resolution from a render_setting called sun_shadow_map_size. This is done because we want to expose the ability to tweak the shadow map resolution to the user.

When rendering a frame we typically pipe the global RenderResourceSet owned by the RenderInterface down to the various rendering systems. Any resource declared in the RenderResourceSet is accessible from the shader system by name. Each rendering system can at any point decide to create its own local version of a RenderResourceSet making it possible to scope shader resource access.

Worth pointing out is that the resources declared in the global_resource block of the render_config used when booting the engine are all allocated in the set up phase of the renderer and not released until the renderer is closed.

Layer Configurations

A render_config can have multiple layer_configurations. A Layer Configuration is essentially a description of the flow of a rendered frame, it is responsible for triggering rendering sub-systems and scheduling the GPU work for a frame. Here’s a simple example of a deferred rendering pipe:


layer_configs = {
    simple_deferred = [
        { name="gbuffer" render_targets=["gbuffer0", "gbuffer1", "gbuffer2"] 
            depth_stencil_target="depth_stencil_buffer" sort="FRONT_BACK" profiling_scope="gbuffer" }

        { resource_generator="lighting" profiling_scope="lighting" }

        { name="emissive" render_targets=["hdr0"] 
            depth_stencil_target="depth_stencil_buffer" sort="FRONT_BACK" profiling_scope="emissive" }

        { name="skydome" render_targets=["hdr0"] 
            depth_stencil_target="depth_stencil_buffer" sort="BACK_FRONT" profiling_scope="skydome" }

        { name="hdr_transparent" render_targets=["hdr0"] 
            depth_stencil_target="depth_stencil_buffer" sort="BACK_FRONT" profiling_scope="hdr_transparent" }

        { resource_generator="post_processing" profiling_scope="post_processing" }

        { name="ldr_transparent" render_targets=["output_target"] 
            depth_stencil_target="depth_stencil_buffer" sort="BACK_FRONT" profiling_scope="transparent" }
    ]
}


Each line in the simple_deferred array specifies either a named layer that the shader system can reference to direct rendering into (i.e a renderable object, like e.g. a mesh, has shaders assigned and the shaders know into which layer they want to render - e.g gbuffer), or it can trigger a resource_generator.

The order of execution is top->down and the way the GPU scheduling works is that each line increments a bit in the “Layer System” bit range covered in the post about sorting.

On the engine side the layer configurations are managed by a system called the LayerManager, owned by the RenderInterface. It is a tiny system that basically just maps the named layer_config to an array of “Layers”:

struct Layer {
    uint64_t sort_key;

    IdString32 name;
    render_sorting::DepthSort depth_sort;
    IdString32 render_targets[MAX_RENDER_TARGETS];
    IdString32 depth_stencil_target;
    IdString32 resource_generator;
    uint32_t clear_flags;   

    #if defined(DEVELOPMENT)
        const char *profiling_scope;
    #endif  
};

  • sort_key - As mentioned above and in the post about how we do sorting, each layer gets a sort_key assigned from the “Layer System” bit range. By looking up the layer’s sort_key and using that when recording Commands to RenderContexts we get a simple way to reason about overall ordering of a rendered frame.
  • name - the shader system can use this name to look up the layer’s sort_key to group draw calls into layers.
  • depth_sort - describes how to encode the depth range bits of the sort key when recording a RenderJobPackage to a RenderContext. depth_sort is an enum that indicates if sorting should be done front-to-back or back-to-front.
  • render_targets - array of named render target resources to bind for this layer
  • depth_stencil_target - named render target resource to bind for this layer
  • resource_generator -
  • clear_flags - bit flag hinting if color, depth or stencil should be cleared for this layer
  • profiling_scope - used to record markers on the RenderContext that later can be queried for GPU timings and statistics.

When rendering a World (see: RenderInterface) the user passes a viewport to the render_world function, the viewport knows which layer_config to use. We look up the array of Layersfrom the LayerManager and record a RenderContext with state commands for binding and clearing render targets using the sort_keys from the Layer. We do this dynamically each time the user calls render_world but in theory we could cache the RenderContext between render_world calls.

The name Layer is a bit misleading as a layer also can be responsible for making sure that a ResourceGenerator runs, in practice a Layer is either a target for the shader system to render into or it is the execution point for a ResourceGenerator. It can in theory be both but we never use it that way.

Resource Generators

The Resource Generators is a minimalistic framework for manipulating GPU resources and triggering various rendering sub-systems. Similar to a layer configuration a resource generator is described as an array of “modifiers”. Modifiers get executed in the order they were declared. Here’s an example:

auto_exposure = {
    modifiers = [
        { type="dynamic_branch" render_settings={ auto_exposure_enabled=true } profiling_scope="auto_exposure"
            pass = [
                { type="fullscreen_pass" shader="quantize_luma" inputs=["hdr0"] 
                    outputs=["quantized_luma"]  profiling_scope="quantize_luma" }

                { type="compute_kernel" shader="compute_histogram" thread_count=[40 1 1] inputs=["quantized_luma"] 
                    uavs=["histogram"] profiling_scope="compute_histogram" }

                { type="compute_kernel" shader="adapt_exposure" thread_count=[1 1 1] inputs=["quantized_luma"] 
                    uavs=["current_exposure" "current_exposure_pos" "target_exposure_pos"] profiling_scope="adapt_exposure" }
            ]
        }
    ]   
}

First modifier in the above example is a dynamic_branch. In contrast to a static_branch which gets evaluated during loading of the render_config, a dynamic_branch is evaluated each time the resource generator runs making it possible to take different paths through the rendering pipeline based on settings and other game context that might change over time. Dynamic branching is also supported in the layer_config block.

If the branch is taken (i.e if auto_exposure_enabled is true) the modifiers in the pass array will run.

The first modifier is of the type fullscreen_pass and is by far the most commonly used modifier type. It simply renders a single triangle covering the entire viewport using the named shader. Any resource listed in the inputs array is exposed to the shader. Any resource(s) listed in the outputs array are bound as a render target(s).

The second and third modifiers are of the type compute_kernel and will dispatch a compute shader. inputs array is the same as for the fullscreen_pass and uavs lists resources to bind as UAVs.

This is obviously a very basic example, but the idea is the same for more complex resource generators. By chaining a bunch of modifiers together you can create interesting rendering effects entirely in data.

Stingray ships with a toolbox of various modifiers, and the user can also extend it with their own modifiers if needed. Here’s a list of some of the other modifiers we ship with:

  • cascaded_shadow_mapping - Renders a cascaded shadow map from a directional light.
  • atlased_shadow_mapping - Renders a shadow map atlas from a set of spot and omni lights.
  • generate_mips - Renders a mip chain for a resource by interleaving a resource generator that samples from sub-resource n-1 while rendering into sub-resource n.
  • clustered_shading - Assign a set of light sources to a clustered shading structure (on CPU at the moment).
  • deferred_shading - Renders proxy volumes for a set of light sources with specified shaders (i.e. traditional deferred shading).
  • stream_capture - Reads back the specified resource to CPU (usually multi-buffered to avoid stalls).
  • fence - Synchronization of graphics and compute queues.
  • copy_resource - Copies a resource from one GPU to another.

In Stingray we encourage building all lighting and post processing using resource generators. So far it has proved very successful for us as it gives great per project flexibility. To make sharing of various rendering effects easier we also have a system called render_config_extension that we rolled out last year, which is essentially a plugin system to the render_config files.

I won’t go into much detail how the resource generator system works on the engine side, it’s fairly simple though; There’s a ResourceGeneratorManager that knows about all the generators, each time the user calls render_world we ask the manager to execute all generators referenced in the layer_config using the layers sort key. We don’t restrain modifiers in any way, they can be implemented to do whatever and have full access to the engine. E.g they are free to create their own ResourceContexts, spawn worker threads, etc. When the modifiers for all generators are done executing we are handed all RenderContexts they’ve created and can dispatch them together with the contexts from the regular scene rendering. To get scheduling between modifiers in a resource generators correct we use the 32-bit “user defined” range in the sort key.

Future improvements

Before we wrap up I’d like to cover some ideas for future improvements.

The Stingray engine has had a data-driven renderer from day one, so it has been around for quite some time by now. And while the render_config has served us good so far there are a few things that we’ve discovered that could use some attention moving forward.

Scalability

The complexity of the default rendering pipe continues to increase as the demand for new rendering features targeting different industries (games, design visualization, film, etc.) increases. While the data-driven approach we have addresses the feature set scalability needs decently well, there is also an increasing demand to have feature parity across lots of different hardware. This tends to result in lots of branching in render_config making it a bit hard to follow.

In addition to that we also start seeing the need for managing multiple paths through the rendering pipe on the same platform, this is especially true when dealing with stereo rendering. On PC we currently we have 5 different paths through the default rendering pipe:

  • Mono - Traditional mono rendering.
  • Stereo - Old school stereo rendering, one render_world call per eye. Almost identical to the mono path but still there are some stereo specific work for assembling the final image that needs to happen.
  • Instanced Stereo - Using “hardware instancing” to do stereo propagation to left/right eye. Single scene traversal pass, culling using a uber-frustum. A bunch of shader patch up work and some branching in the render_config.
  • Nvidia Single Pass Stereo (SPS) - Somewhat similar to instanced stereo but using nvidia specific hardware for doing multicasting to left/right eye.
  • Nvidia VRSLI - DX11 path for rendering left/right eye on separate GPUs.

We estimate that the number of paths through the rendering pipe will continue to increase also for mono rendering, we’ve already seen that when we’ve experimented with explicit multi-GPU stuff under DX12. Things quickly becomes hairy when you aren’t running on a known platform. Also, depending on hardware it’s likely that you want to do different scheduling of the rendered frame - i.e its not as simple as saying: here are our 4 different paths we select from based on if the user has 1-4 GPUs in their systems, as that breaks down as soon as you don’t have the exact same GPUs in the system.

In the future I think we might want to move to an even higher level of abstraction of the rendering pipe that makes it easier to reason about different paths through it. Something that decouples the strict flow through the rendering pipe and instead only reasons about various “jobs” that needs to be executed by the GPUs and what their dependencies are. The engine could then dynamically re-schedule the frame load depending on hardware automatically… at least in theory, in practice I think it’s more likely that we would end up with a few different “frame scheduling configurations” and then select one of them based on benchmarking / hardware setup.

Memory

As mentioned earlier our system for dealing with GPU resources is very static, resources declared in the global_resource set are allocated as the renderer boots up and not released until the renderer is closed. On last gen consoles we had support for aliasing memory of resources of different types but we removed that when deprecating those platforms. With the rise of DX12/Vulkan and the move to 4K rendering this static resource system is in need of an overhaul. While we can (and do) try to recycle temporary render targets and buffers throughout the a frame it is easy to break some code path without noticing.

We’ve been toying with similar ideas to the “Transient Resource System” described in Yuriy O’Donnell’s excellent GDC2017 presentation: FrameGraph: Extensible Rendering Architecture in Frostbite but have so far not got around to test it out in practice.

DX12 improvements

Today our system implicitly deals with binding of input resources to shader stages. We expose pretty much everything to the shader system by name and if a shader stage binds a resource for reading we don’t know about it until we create the RenderJobPackage. This puts us in a somewhat bad situation when it comes to dealing with resource transitions as we end up having to do some rather complicated tracking to inject resource barriers at the right places during the dispatch stage of the RenderContexts (See: RenderDevice).

We could instead enforce declaration of all writable GPU resources when they get bound as input to a layer or resource generator. As we already have explicit knowledge of when a GPU resource gets written to by a layer or resource generator, adding the explicit knowledge of when we read from one would complete the circle and we would have all the needed information to setup barriers without complicated tracking.

Wrap up

Last week at GDC 2017 there were a few presentations (and a lot of discussions) around the concepts of having more high-level representations of a rendered frame and what benefits that brings. If you haven’t already I highly encourage you to check out both Yuriy O’Donnell’s presentation “FrameGraph: Extensible Rendering Architecture in Frostbite” and Aras Pranckevičius’s presentation: “Scriptable Render Pipeline”.

In the next post I will briefly cover the feature set of the two render_configs that we ship as template rendering pipes with Stingray.

Wednesday, February 22, 2017

Stingray Renderer Walkthrough #6: RenderInterface

Stingray Renderer Walkthrough #6: RenderInterface

Today we will be looking at the RenderInterface. I’ve struggled a bit with deciding if it is worth covering this piece of the code or not, as most of the stuff described will likely feel kind of obvious. In the end I still decided to keep it to give a more complete picture of how everything fits together. Feel free to skim through it or sit tight and wait for the coming two posts that will dive into the data-driven aspects of the Stingray renderer.

The glue layer

The RenderInterface is responsible for tying together a bunch of rendering sub-systems. Some of which we have covered in earlier posts (like e.g the RenderDevice) and a bunch of other, more high-level, systems that forms the foundation of our data-driven rendering architecture.

The RenderInterface has a bunch of various responsibilities, including:

  • Tracking of windows and swap chains.

    While windows are managed by the simulation thread, swap chains are managed by the render thread. The RenderInterface is responsible for creating the swap chains and keep track of the mapping between a window and a swap chain. It is also responsible for signaling resizing and other state information from the window to the renderer.

  • Managing of RenderWorlds.

    As mentioned in the Overview post, the renderer has its own representation of game Worlds called RenderWorlds. The RenderInterface is responsible for creating, updating and destroying the RenderWorlds.

  • Owner of the four main building blocks of our data-driven rendering architecture: LayerManager, ResourceGeneratorManager, RenderResourceSet, RenderSettings

    Will be covered in the next post (I’ve talked about them in various presentations before [1] [2]).

  • Owner of the shader manager.

    Centralized repository for all available/loaded shaders. Controls scheduling for loading, unload and hot-reloading of shaders.

  • Owner of the render resource streamer.

    While all resource loading is asynchronous in Stingray (See [3]), the resource streamer I’m referring to in this context is responsible for dynamically loading in/out mip-levels of textures based on their screen coverage. Since this streaming system piggybacks on the view frustum culling system, it is owned and updated by the RenderInterface.

The interface

In addition to being the glue layer, the RenderInterface is also the interface to communicate with the renderer from other threads (simulation, resource streaming, etc.). The renderer operates under its own “controller thread” (as covered in the Overview post), and exposes two different types of functions: blocking and non-blocking.

Blocking functions

Blocking functions will enforce a flush of all outstanding rendering work (i.e. synchronize the calling thread with the rendering thread), allowing the caller to operate directly on the state of the renderer. This is mainly a convenience path when doing bigger state changes / reconfiguring the entire renderer, and should typically not be used during game simulation as it might cause stuttering in the frame rate.

Typical operations that are blocking:

  • Opening and closing of the RenderDevice.

    Sets up / shuts down the graphics API by calling the appropriate functions on the RenderDevice.

  • Creation and destruction of the swap chains.

    Creating and destroying swap chains associated to a Window. Done by forwarding the calls to the RenderDevice.

  • Loading of the render_config / configuring the data-driven rendering pipe.

    The render_config is a configuration file describing how the renderer should work for a specific project. It describes the entire flow of a rendered frame and without it the renderer won’t know what to do. It is the RenderInterface responsibility to make sure that all the different sub-systems (LayerManager, ResourceGeneratorManager, RenderResourceSet, RenderSettings) are set up correctly from the loaded render_config. More on this topic in the next post.

  • Loading, unloading and reloading of shaders.

    The shader system doesn’t have a thread safe interface and is only meant to be accessed from the rendering thread. Therefor any loading, unloading and reloading of shaders needs to synchronize with the rendering thread.

  • Registering and unregistering of Worlds

    Creates or destroys a corresponding RenderWorld and sets up mapping information to go from World* to RenderWorld*.

Non-blocking functions

Non-blocking functions communicates by posting messages to a ring-buffer that the rendering thread consumes. Since the renderer has its own representation of a “World” there is not much communication over this ring-buffer, in a normal frame we usually don’t have more than 10-20 messages posted.

Typical operations that are non-blocking:

  • Rendering of a World.

    void render_world(World &world, const Camera &camera, const Viewport &viewport, 
        const ShadingEnvironment &shading_env, uint32_t swap_chain);
    

    Main interface for rendering of a world viewed from a certain Camera into a certain Viewport. The ShadingEnvironment is basically just a set of shader constants and resources defined in data (usually containing a description of the lighting environment, post effects and similar). swap_chain is a handle referencing which window that will present the final result.

    When the user calls this function a RenderWorldMsg will be created and posted to the ring buffer holding handles to the rendering representations for the world, camera, viewport and shading environment. When the message is consumed by rendering thread it will enter the first of the three stages described in the Overview post - Culling.

  • Reflection of state from a World to the RenderWorld.

    Reflects the “state delta” (from the last frame) for all objects on the simulation thread over to the render thread. For more details see [4].

  • Synchronization.

    uint32_t create_fence();
    void wait_for_fence(uint32_t fence);
    

    Synchronization methods for making sure the renderer is finished processing up to a certain point. Used to handle blocking calls and to make sure the simulation doesn’t run more than one frame ahead of the renderer.

  • Presenting a swap chain.

    void present_frame(uint32_t swap_chain = 0);
    

    When the user is done with all rendering for a frame (i.e has no more render_world calls to do), the application will present the result by looping over all swap chains touched (i.e referenced in a previous call to render_world) and posting one or many PresentFrameMsg messages to the renderer.

  • Providing statistics from the RenderDevice.

    As mentioned in the RenderContext post, we gather various statistics and (if possible) GPU timings in the RenderDevice. Exactly what is gathered depends on the implementation of the RenderDevice. The RenderInterface is responsible for providing a non blocking interface for retrieving the statistics. Note: the statistics returned will be 2 frames old as we update them after the rendering thread is done processing a frame (GPU timings are even older). This typically doesn’t matter though as usually they don’t fluctuate much from one frame to another.

  • Executing user callbacks.

    typedef void (*Callback)(void *user_data);
    void run_callback(Callback callback, void *user, uint32_t user_data_size);
    

    Generic callback mechanics to easily inject code to be executed by the rendering thread.

  • Creation, dispatching and releasing of RenderContexts and RenderResourceContexts.

    While most systems tends to create, dispatch and release RenderContexts and RenderResourceContexts from the rendering thread there can be use cases for doing it from another thread (e.g. the resource thread creates RenderResourceContexts). The RenderInterface provides the necessary functions for doing so in a thread-safe way without having to block the rendering thread.

Wrap up

The RenderInterface in itself doesn’t get more interesting than that. Something needs to be responsible for coupling of various rendering systems and manage the interface for communicating with the controlling thread of the renderer - the RenderInterface is that something.

In the next post we will walk through the various components building the foundation of the data-driven rendering architecture and go through some examples of how to configure them to do something fun from the render_config file.

Stay tuned.

Friday, February 17, 2017

Stingray Renderer Walkthrough #5: RenderDevice

Stingray Renderer Walkthrough #5: RenderDevice

Overview

The RenderDevice is essentially our abstraction layer for platform specific rendering APIs. It is implemented as an abstract base class that various rendering back-ends (D3D11, D3D12, OGL, Metal, GNM, etc.) implement.

The RenderDevice has a bunch of helper functions for initializing/shutting down the graphics APIs, creating/destroying swap chains, etc. All of which are fairly straightforward so I won’t cover them in this post, instead I will put my focus on the two dispatch functions consuming RenderResourceContexts and RenderContexts:


class RenderDevice {
public: 
    virtual void dispatch(uint32_t n_contexts, RenderResourceContext **rrc, 
        uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;

    virtual void dispatch(uint32_t n_contexts, RenderContext **rc, 
        uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
};

Resource Management

As covered in the post about RenderResourceContexts, they provide a free-threaded interface for allocating and deallocating GPU resources. However, it is not until the user has called RenderDevice::dispatch() handing over the RenderResourceContexts as their representation gets created on the RenderDevice side.

All implementations of a RenderDevice have some form of resource management that deals with creating, updating and destroying of the graphics API specific representations of resources. Typically we track the state of all various types of resources in a single struct, here’s a stripped down example from the DX12 RenderDevice implementation called D3D12ResourceContext:


struct D3D12VertexBuffer
{
    D3D12_VERTEX_BUFFER_VIEW view;
    uint32_t allocation_index;
    int32_t size;
};

struct D3D12IndexBuffer
{
    D3D12_INDEX_BUFFER_VIEW view;
    uint32_t allocation_index;
    int32_t size;
};

struct D3D12ResourceContext 
{
    Array<D3D12VertexBuffer> vertex_buffers;
    Array<uint32_t> unused_vertex_buffers;

    Array<D3D12IndexBuffer> index_buffers;
    Array<uint32_t> unused_index_buffers;

    // .. lots of other resources

    Array<uint32_t> resource_lut;
};

As you might remember, the linking between the engine representation and the RenderDevice representation is done using the RenderResource::render_resource_handle. It encodes both the type of the resource as well as a handle. The resource_lut is an indirection to go from the engine handle to a local index for a specific type (e.g vertex_buffers or index_buffers in the sample above). We also track freed indices for each type (e.g. unused_vertex_buffers) to simplify recycling of slots.

The implementation of the dispatch function is fairly straight forward. We simply iterate over all the RenderResourceContexts and for each context iterate over its commands and either allocate or deallocate resources in the D3D12ResourceContext. It is important to note that this is a synchronous operation, nothing else is peeking or poking on the D3D12ResourceContext when the dispatch of RenderResourceContexts is happening, which makes our life a lot easier.

Unfortunately that isn’t the case when we dispatch RenderContexts as in that case we want to go wide (i.e. forking the workload and process it using multiple worker threads) when translating the commands into API calls. While we don’t allow allocating and deallocating new resources from the RenderContexts we do allow updating them which mutates the state of the RenderDevice representations (e.g. a D3D12VertexBuffer).

At the moment our solution for this isn’t very nice, basically we don’t allow asynchronous updates for anything else than DYNAMIC buffers. UPDATABLE buffers are always updated serially before we kick the worker threads no matter what their sort_key is. All worker threads access resources through their own copy of something we call a ResourceAccessor, it is responsible for tracking the worker threads state of dynamic buffers (among other things). In the future I think we probably should generalize this and treat UPDATABLE buffers in a similar way.

(Note: this limitation doesn’t mean you can’t update an UPDATABLE buffer more than once per frame, it simply means you cannot update it more than once per dispatch).

Shaders

Resources in the D3D12ResourceContext are typically buffers. One exception that stands out is the RenderDevice representation of a “shader”. A “shader” on the RenderDevice side maps to a ShaderTemplate::Context on the engine side, or what I guess we could call a multi-pass shader. Here’s some pseudo code:


struct ShaderPass
{
    struct ShaderProgram
    {
        Array<uint8_t> bytecode;
        struct ConstantBufferBindInfo;
        struct ResourceBindInfo;
        struct SamplerBindInfo;
    };
    ShaderProgram vertex_shader;
    ShaderProgram domain_shader;
    ShaderProgram hull_shader;
    ShaderProgram geometry_shader;
    ShaderProgram pixel_shader;
    ShaderProgram compute_shader;

    struct RenderStates;
};

struct Shader
{
    Vector<ShaderPass> passes;
    enum SortMode { IMMADIATE, DEFERRED };
    uint32_t sort_mode;
};

The pseudo code above is essentially the RenderDevice representation of a shader that we serialize to disk during data compilation. From that we can create all the necessary graphics API specific objects expressing an executable shader together with its various state blocks (Rasterizer, Depth Stencil, Blend, etc.).

As discussed in the last post the sort_key encodes the shader pass index. Using Shader::sort_mode, we know which bit range to extract from the sort_key as pass index, which we then use to look up the ShaderPass from Shader::passes. A ShaderPass contains one ShaderProgram per active shader stage and each ShaderProgram contains the byte code for the shader to compile as well as “bind info” for various resources that the shader wants as input.

We will look at this in a bit more detail in the post about “Shaders & Materials”, for now I just wanted to familiarize you with the concept.

Render Context translation

Let’s move on and look at the dispatch for translating RenderContexts into graphics API calls:

class RenderDevice {
public: 
    virtual void dispatch(uint32_t n_contexts, RenderContext **rc, 
        uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
};

The first thing all RenderDevice implementation do when receiving a bunch of RenderContexts is to merge and sort their Commands. All implementations share the same code for doing this:

void prepare_command_list(RenderContext::Commands &output, unsigned n_contexts, RenderContext **contexts);

This function basically just takes the RenderContext::Commands from all RenderContexts and merges them into a new array, runs a stable radix sort, and returns the sorted commands in output. To avoid memory allocations the RenderDevice implementation owns the memory of the output buffer.

Now we have all the commands nicely sorted based on their sort_key. Next step is to do the actual translation of the data referenced by the commands into graphics API calls. I will explain this process with the assumption that we are running on a graphics API that allows us to build graphics API command lists in parallel (e.g. DX12, GNM, Vulkan, Metal), as that feels most relevant in 2017.

Before we start figuring out our per thread workloads for going wide, we have one more thing to do; “instance merging”.

Instance Merging

I’ve mentioned the idea behind instance merging before [1,2], basically we want to try to reduce the number of RenderJobPackages (i.e. draw calls) by identifying packages that are similar enough to be merged. In Stingray “similar enough” basically means that they must have identical inputs to the input assembler as well as identical resources bound to all shader stages, the only thing that is allowed to differ are constant buffer variables. (Note: by todays standards this can be considered a bit old school, new graphics APIs and hardware allows to tackle this problem more aggressively using “bindless” concepts. )

The way it works is by filtering out ranges of RenderContexts::Commands where the “instance bit” of the sort_key is set and all bits above the instance bit are identical. Then for each of those ranges we fork and go wide to analyze the actual RenderJobPackage data to see if the instance_hash and the shader are the same, and if so we know its safe to merge them.

The actual merge is done by extracting the instance specific constants (these are tagged by the shader author) from the constant buffers and propagating them into a dynamic RawBuffer that gets bound as input to the vertex shader.

Depending on how the scene is constructed, instance merging can significantly reduce the number of draw calls needed to render the final scene. The instance merger in itself is not graphics API specific and is isolated in its own system, it just happens to be the responsibility of the RenderDevice to call it. The interface looks like this:

namespace instance_merger {

struct ProcessMergedCommandsResult
{
    uint32_t n_instances;
    uint32_t instanced_batches;
    uint32_t instance_buffer_size;
};

ProcessMergedCommandsResult process_merged_commands(Merger &instance_merger, 
    RenderContext::Commands &merged_commands);

}

Pass in a reference to the sorted RenderContext::Commands in merged_commands and after the instance merger is done running you hopefully have fewer commands in the array. :)

You could argue that merging, sorting and instance merging should all happen before we enter the world of the RenderDevice. I wouldn’t argue against that.

Prepare workloads

Last step before we can start translating our commands into state / draw / dispatch calls is to split the workload into reasonable chunks and prepare the execution contexts for our worker threads.

Typically we just divide the number of RenderContext::Commands we have to process with the number of worker threads we have available. We don’t care about the type of different commands we will be processing and trying to load balance differently. The reasoning behind this is that we anticipate that draw calls will always represent the bulk of the commands and the rest of the commands can be considered as unavoidable “noise”. We do, however, make sure that we don’t do less than x-number of commands per worker threads, where x can differ a bit depending on platform but is usually ~128.

For each execution context we create a ResourceAccessors (described above) as well as make sure we have the correct state setup in terms of bound render targets and similar. To do this we are stuck with having to do a synchronous serial sweep over all the commands to find bigger state changing commands (such as RenderContext::set_render_target).

This is where the Command::command_flags bit-flag comes into play, instead of having to jump around in memory to figure out what type of command the Command::head points to, we put some hinting about the type in the Command::command_flags, like for example if it is a “state command”. This way the serial sweep doesn’t become very costly even when dealing with large number of commands. During this sweep we also deal with updating of UPDATABLE resources, and on newer graphics APIs we track fences (discussed in the post about Render Contexts).

The last thing we do is to set up the execution contexts with create graphics API specific representations of command lists (e.g. ID3D12GraphicsCommandList in DX12),

Translation

When getting to this point doing the actual translation is fairly straight forward. Within each worker thread we simply loop over its dedicated range of commands, fetch its data from Command::head and generate any number of API specific commands necessary based on the type of command.

For a RenderJobPackage representing a draw call it involves:

  • Look up the correct shader pass and, unless already bound, bind all active shader stages
  • Look up the state blocks (Rasterizer, Depth stencil, Blending, etc.) from the shader and bind them unless already bound
  • Look up and bind the resources for each shader stage using the RenderResource::render_resource_handle translated through the D3D12ResourceAccessor
  • Setup the input assembler by looping over the RenderResource::render_resource_handles pointed to by the RenderJobPackage::resource_offset and translated through the D3D12ResourceAccessor
  • Bind and potentially update constant buffers
  • Issue the draw call

The execution contexts also holds most-recently-used caches to avoid unnecessary binds of resources/shaders/states etc.

Note: In DX12 we also track where resource barriers are needed during this stage. After all worker threads are done we might also end up having to inject further resource barriers between the command lists generated by the worker threads. We have ideas on how to improve on this by doing at least parts of this tracking when building the RenderContexts but haven’t gotten around looking into it yet.

Execute

When the translation is done we pass the resulting command lists to the correct queues for execution.

Note: In DX12 this is a bit more complicated as we have to interleave signaling / waiting on fences between command list execution (ExecuteCommandList).

Next up

I’ve deliberately not dived into too much details in this post to make it a bit easier to digest. I think I’ve manage to cover the overall design of a RenderDevice though, enough to make it easier for people diving into the code for the first time.

With this post we’ve reached half-way through this series, we have covered the “low-level” aspects of the Stingray rendering architecture. As of next post we will start looking at more high-level stuff, starting with the RenderInterface which is the main interface for other threads to talk with the renderer.

Tuesday, February 14, 2017

Stingray Renderer Walkthrough #4: Sorting

Stingray Renderer Walkthrough #4: Sorting

Introduction

This post will focus on ordering of the commands in the RenderContexts. I briefly touched on this subject in the last post and if you’ve implemented a rendering engine before you’re probably not new to this problem. Basically we need a way to make sure our RenderJobPackages (draw calls) end up on the screen in the correct order, both from a visual point of view as well as from a performance point of view. Some concrete examples,

  1. Make sure g-buffers and shadow maps are rendered before any lighting happens.
  2. Make sure opaque geometry is rendered front to back to reduce overdraw.
  3. Make sure transparent geometry is rendered back to front for alpha blending to generate correct results.
  4. Make sure the sky dome is rendered after all opaque geometry but before any transparent geometry.
  5. All of the above but also strive to reduce state switches as much as possible.
  6. All of the above but depending on GPU architecture maybe shift some work around to better utilize the hardware.

There are many ways of tackling this problem and it’s not uncommon that engines uses multiple sorting systems and spend quite a lot of frame time getting this right.

Personally I’m a big fan of explicit ordering with a single stable sort. What I mean by explicit ordering is that every command that gets recorded to a RenderContext already has the knowledge of when it will be executed relative to other commands. For us this knowledge is in the form of a 64 bit sort_key, in the case where we get two commands with the exact same sort_key we rely on the sort being stable to not introduce any kind of temporal instabilities in the final output.

The reasons I like this approach are many,

  1. It’s trivial to implement compared to various bucketing schemes and sorting of those buckets.
  2. We only need to visit renderable objects once per view (when calling their render() function), no additional pre-visits for sorting are needed.
  3. The sort is typically fast, and cost is isolated and easy to profile.
  4. Parallel rendering works out of the box, we can just take all the Command arrays of all the RenderContexts and merge them before sorting.

To make this work each command needs to know its absolute sort_key. Let’s breakdown the sort_key we use when working with our data-driven rendering pipe in Stingray. (Note: if the user doesn’t care about playing nicely together with our system for data-driven rendering it is fine to completely ignore the bit allocation patterns described below and roll their own.)

sort_key breakdown

Most significant bit on the left, here are our bit ranges:

MSB [ 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ] LSB
      ^ ^       ^  ^                                   ^^                 ^
      | |       |  |                                   ||                 |- 3 bits - Shader System (Pass Immediate)
      | |       |  |                                   ||- 16 bits - Depth
      | |       |  |                                   |- 1 bit - Instance bit
      | |       |  |- 32 bits - User defined
      | |       |- 3 bits - Shader System (Pass Deferred)
      | - 7 bits - Layer System
      |- 2 bits - Unused

2 bits - Unused

Nothing to see here, moving on… (Not really sure why these 2 bits are unused, I guess they weren’t at some point but for the moment they are always zero) :)

7 bits - Layer System

This 7-bits range is managed by the “Layer system”. The Layer system is responsible for controlling the overall scheduling of a frame and is set up in the render_config file. It’s a central part of the data-driven rendering architecture in Stingray. It allows you to configure what layers to expose to the shader system and in which order these layers should be drawn. We will look closer at the implementation of the layer system in a later post but in the interest of clarifying how it interops with the sort_key here’s a small example:


default = [
  // sort_key = [ 00000000 10000000 00000000 00000000 00000000 00000000 00000000 00000000 ]
  { name="gbuffer" render_targets=["gbuffer0", "gbuffer1", "gbuffer2", "gbuffer3"]
     depth_stencil_target="depth_stencil_buffer" sort="FRONT_BACK" profiling_scope="gbuffer" }

  // sort_key = [ 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ]
  { name="decals" render_targets=["gbuffer0" "gbuffer1"] depth_stencil_target="depth_stencil_buffer"
     profiling_scope="decal" sort="EXPLICIT" }

  // sort_key = [ 00000001 10000000 00000000 00000000 00000000 00000000 00000000 00000000 ]
  { resource_generator="lighting" profiling_scope="lighting" }

  // sort_key = [ 00000010 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ] LSB
  { name="emissive" render_targets=["hdr0"] depth_stencil_target="depth_stencil_buffer"
    sort="FRONT_BACK" profiling_scope="emissive" }
]

Above we have three layers exposed to the shader system and one kick of a resource_generator called lighting (more about resource_generators in a later post). The layers are rendered in the order they are declared, this is handled by letting each new layer increment the 7 bits range belonging to the Layer System with 1 (as can be seen in the sort_key comments above).

The shader author dictates into which layer(s) it wants to render. When a RenderJobPackage is recorded to the RenderContext (as described in the last post) the correct layer sort_keys are looked up from the layer system and the result is bitwise ORed together with the sort_key value piped as argument to RenderContext::render().

3 bits - Shader System (Pass Deferred)

The next 3 bits are controlled by the Shader System. These three bits encode the shader pass index within a layer. When I say shader in this context I refer to our ShaderTemplate::Context which is basically a wrapper around multiple linked shaders rendering into one or many layers. (Nathan Reed recently blogged about “The Many Meanings of “Shader””, in his analogy our ShaderTemplate is the same as an “Effect”)

Since we can have a multi-pass shader rendering into the same layer we need to encode the pass index into the sort_key, that is what this 3 bit range is used for.

32 bits - User defined

We then have 32 user defined bits, these bits are primarily used by our “Resource Generator” system (I will be covering this system in the post about render_config & data-driven rendering later), but the user is free to use them anyway they like and still maintain compatibility with the data-driven rendering system.

1 bit - Instance bit

This single bit also comes from the Shader System and is set if the shader implements support for “Instance Merging”. I will be covering this in a bit more detail in my next post about the RenderDevice but essentially this bit allows us to scan through all commands and find ranges of commands that potentially can be merged together to fewer draw calls.

16 bits - Depth

One of the arguments piped to RenderContext::render() is an unsigned normalized depth value (0.0-1.0). This value gets quantized into these 16 bits and is what drives the front-to-back vs back-to-front sorting of RenderJobPackages. If the sorting criteria for the layer (see layer example above) is set to back-to-front we simply flip the bits in this range.

3 bits - Shader System (Pass Immediate)

A shader can be configured to run in “Immediate Mode” instead of “Deferred Mode” (default). This forces passes in a multi-pass shader to run immediately after each other and is achieved by moving the pass index bits into the least significant bits of the sort_key. The concept is probably easiest to explain with an artificial example and some pseudo code:

Take a simple scene with a few instances of the same mesh, each mesh recording one RenderJobPackages to one or many RenderContexts and all RenderJobPackages are being rendered with the same multi-pass shader.

In “Deferred Mode” (i.e pass indices encoded in the “Shader System (Pass Deferred)” range) you would get something like this:

foreach (pass in multi-pass-shader)
  foreach (render-job in render-job-packages)
    render (render-job)
  end
end

If shader is configured to run in “Immediate Mode” you would instead get something like this:

foreach (render-job in render-job-packages)
  foreach (pass in multi-pass-shader)
    render (render-job)
  end
end

As you probably can imagine the latter results in more shader / state switches but can sometimes be necessary to guarantee correctly rendered results. A typical example is when using multi-pass shaders that does alpha blending.

Wrap up

The actual sort is implemented using a standard stable radix sort and happens immediately after the user has called RenderDevice::dispatch() handing over n-number of RenderContexts to the RenderDevice for translation into graphics API calls.

Next post will cover this and give an overview of what a typical rendering back-end (RenderDevice) looks like in Stingray. Stay tuned.

Friday, February 10, 2017

Stingray Renderer Walkthrough #3: Render Contexts

Stingray Renderer Walkthrough #3: Render Contexts

Render Contexts Overview

In the last post we covered how to create and destroy various GPU resources. In this post we will go through the system we have for recording a stream of rendering commands/packages that later gets consumed by the render backend (RenderDevice) where they are translated into actual graphics API calls. We call this interface RenderContext and similar to RenderResourceContext we can have multiple RenderContexts in flight at the same time to achieve data parallelism.

Let’s back up and reiterate a bit what was said in the Overview post. Typically in a frame we take the result of the view frustum culling, split it up into a number of chunks, allocate one RenderContext per chunk and then kick one worker thread per chunk. Each worker thread then sequentially iterates over its range of renderable objects and calls their render() function. The render() function takes the chunk’s RenderContext as one of its argument and is responsible for populating it with commands. When all worker threads are done the resulting RenderContexts gets “dispatched” to the RenderDevice.

So essentially the RenderContext is the output data structure for the second stage Render as discussed in the Overview post.

The RenderContext is very similar to the RenderResourceContext in the sense that it’s a fairly simple helper class for populating a command buffer. There is one significant difference though; the RenderContext also has a mechanics for reasoning about the ordering of the commands in the buffer before they get translated into graphics API calls by the RenderDevice.

Ordering & Buffers

We need a way to reorder commands in one or many RenderContexts to make sure triangles end up on the screen in the right order, or more generally speaking; to schedule our GPU work.

There are many ways of dealing with this but my favorite approach is to just associate one or many commands with a 64 bit sort key and when all commands have been recorded simply sort them on this key before translating them into actual graphics API calls. The approach we are using in Stingray is heavily inspired by Christer Ericsson’s blog post “Order your graphics draw calls around!”. I will be covering our sorting system in more details in my next post, for now the only thing important to grasp is that while the RenderContext records commands it does so by populating two buffers. One is a simple array of a POD struct called Command:

struct Command
{
    uint64_t sort_key;
    void *head;
    uint32_t command_flags;
};
  • sort_key - 64 bit sort key used for reordering commands before being consumed by the RenderDevice, more on this later.
  • head - Pointer to the actual data for this command.
  • command_flags - A bit flag encoding some hinting about what kind of command head is actually pointing to. This is simply an optimization to reduce pointer chasing in the RenderDevice, it will be covered in more detail in a later post.

Render Package Stream

The other buffer is what we call a RenderPackageStream and is what holds the actual command data. The RenderPackageStream class is essentially just a few helper functions to put arbitrary length commands into memory. The memory backing system for RenderPackageStreams is somewhat more complex than a simple array though, this is because we need a way to keep its memory footprint under control. For efficiency, we want to recycle the memory instead of reallocating it every frame, but depending on workload we are likely to get some RenderContexts becoming much larger than others. This creates a problem when using simple arrays to store the commands as the workload will shift slightly over time causing all arrays having to grow to fit the worst case scenario, resulting in lots of wasted memory.

To combat this we allocate and return fixed size blocks of memory from a pool. As we know the size of each command before writing them to the buffer we can make sure that a command doesn’t end up spanning multiple blocks; if we detect that we are about to run out of memory in the active block we simply allocate a new block and move on. If we detect that a single command will span multiple blocks we make sure to allocate them sequentially in memory. We return a block to the pool when we are certain that the consumer of the data (in this case the RenderDevice) is done with it. (This memory allocation approach is well described in Christian Gyrling’s excellent GDC 2015 presentation Parallelizing the Naughty Dog Engine Using Fibers)

You might be wondering why we put the sort_key in a separate array instead of putting it directly into the header data of the packages written to the RenderPackageStream, there are a number of reasons for that:

  1. The actual package data can become fairly large even for regular draw calls. Since we want to make the packages self contained we have to put all data needed to translate the command into an graphics API call inside the package. This includes handles to all resources, constant buffer reflections and similar. I don’t know of any way to efficiently sort an array with elements of varying sizes.

  2. Since we allocate the memory in blocks, as described above, we would need to introduce some form of “jump label” and insert that into the buffer to know how and when to jump into the next memory block. This would further complicate the sorting and traversal of the buffers.

  3. It allows us to recycle the actual package data from one draw call to another when rendering multi-pass shaders as we simply can inject multiple Commands pointing to the same package data. (Which shader pass to use when translating the package into graphic API calls can later be extracted from the sort_key.)

  4. We can reduce pointer chasing by encoding hints in the Command about the contents of the package data. This is what we do in command_flags mentioned earlier.

Render Context interface

With the low-level concepts of the RenderContext covered let’s move on and look at how it is used from a users perspective.

If we break down the API there are essentially three different types of commands that populates a RenderContext:

  1. State commands - Commands affecting the state of the rendering pipeline (e.g render target bindings, viewports, scissoring, etc) + some miscellaneous commands.
  2. Rendering commands - Commands used to trigger draw calls and compute work on the GPU.
  3. Resource update commands - Commands for updating GPU resources.

1. State Commands

“State commands” are a series of commands getting executed in sequence for a specific sort_key. The interface for starting/stopping the recording looks like this:

class RenderContext
{
    void begin_state_command(uint64_t sort_key, uint32_t gpu_affinity_mask = GPU_DEFAULT);
    void end_state_command();
};
  • sort_key - the 64 bit sort key.
  • gpu_affinity_mask - I will cover this towards the end of this post but, for now just think of it as a bit mask for addressing one or many GPUs.

Here’s a small example showing what the recording of a few state commands might look like:

rc.begin_state_command(sort_key);
for (uint32_t i=0; i!=MAX_RENDER_TARGETS; ++i)
    rc.set_render_target(i, nullptr);
rc.set_depth_stencil_target(depth_shadow_map);
rc.clear(RenderContext::CLEAR_DEPTH);
rc.set_viewports(1, &viewport);
rc.set_scissor_rects(1, &scissor_rect);
rc.end_state_command();

While state commands primarily are used for doing bigger graphics pipeline state changes (like e.g. changing render targets) they are also used for some miscellaneous things like clearing of bound render targets, pushing/poping timer markers, and some other stuff. There is no obvious reasoning for grouping these things together under the name “state commands”, it’s just something that has happened over time. Keep that in mind as we go through the list of commands below.

Common commands

  • set_render_target(uint32_t slot, RenderTarget *target, const SurfaceInfo& surface_info);

    • slot - Which index of the “Multiple Render Target” (MRT) chain to bind
    • target - What RenderTarget to bind
    • surface_info - SurfaceInfo is a struct describing which surface of the RenderTarget to bind.
    struct SurfaceInfo {
        uint32_t array_index; // 0 in all cases except if binding a texture array
        uint32_t slice;       // 0 for 2D textures, 0-5 for cube maps, 0-n for volume textures
        uint32_t mip_level;   // 0-n depending on wanted mip level
    };
    
  • set_depth_stencil_target(RenderTarget *target, const SurfaceInfo& surface_info); - Same as above but for depth stencil.

  • clear(RenderContext::ClearFlags flags); - Clears currently bound render targets.

    • flags - enum bit flag describing what parts of the bound render targets to clear.
    enum ClearFlags {
        CLEAR_SURFACE   = 0x1,
        CLEAR_DEPTH     = 0x2,
        CLEAR_STENCIL   = 0x4
    };
    
  • set_viewports(uint32_t n_viewports, const Viewport *viewports);

    • n_viewports - Number of viewports to bind.
    • viewports - Pointer to first Viewport to bind. Viewport is a struct describing the dimensions of the viewport:
    struct Viewport {
        float x, y, width, height;
        float min_depth, max_depth;
    };
    

    Note that x, y, width and height are in unsigned normalized [0-1] coordinates to decouple render target resolution from the viewport.

  • set_scissor_rects(uint32_t n_scissor_rects, const ScissorRect *scissor_rects);

    • n_scissor_rects - Number of scissor rectangles to bind
    • scissor_rects - Pointer to the first ScissorRect to bind.
    struct ScissorRect {
        float x, y, width, height;
    };
    

    Note that x, y, width and height are in unsigned normalized [0-1] coordinates to decouple render target resolution from the scissor rectangle.

A bit more exotic commands

  • set_stream_out_target(uint32_t slot, RenderResource *resource, uint32_t offset);
    • slot - Which index of the stream out buffers to bind
    • resource - Which RenderResource to bind to that slot (has to point to a VertexStream)
    • offset - A byte offset describing where to begin writing in the buffer pointed to by resource.
  • set_instance_multiplier(uint32_t multiplier);
    Allows the user to scale the number instances to render for each render() call (described below). This is a convenience function to make it easier to implement things like Instanced Stereo Rendering.

Markers

  • push_marker(const char *name)
    Starts a new marker scope named name. Marker scopes are both used for gathering RenderDevice statistics (number of draw calls, state switches and similar) as well as for creating GPU timing events. The user is free to nestle markers if they want to better group statistics. More on this in a later post.
  • pop_marker(const char *name)
    Stops an existing marker scope named name.

2. Rendering

With most state commands covered let’s move on and look at how to record commands for triggering draw calls and compute work to a RenderContext.

For that we have a single function called render():

class RenderContext
{
    RenderJobPackage *render(const RenderJobPackage* job,
        const ShaderTemplate::Context& shader_context, uint64_t interleave_sort_key = 0,
        uint64_t shader_pass_branch_key = 0, float job_sort_depth = 0.f,
        uint32_t gpu_affinity_mask = GPU_DEFAULT);
};

job

First argument piped to render() is a pointer to a RenderJobPackage, and as you can see the function also returns a pointer to a RenderJobPackage. What is going on here is that the RenderJobPackage piped as argument to render() gets copied to the RenderPackageStream, the copy gets patched up a bit and then a pointer to the modified copy is returned to allow the caller to do further tweaks to it. Ok, this probably needs some further explanation…

The RenderJobPackage is basically a header followed by an arbitrary length of data that together contains everything needed to make it possible for the RenderDevice to later translate it into either a draw call or a compute shader dispatch. In practice this means that after the RenderJobPackage header we also pack RenderResource::render_resource_handle for all resources to bind to all different shader stages as well as full representations of all non-global shader constant buffers.

Since we are building multiple RenderContexts in parallel and might be visiting the same renderable object (mesh, particle system, etc) simultaneously from multiple worker threads, we cannot mutate any state of the renderable when calling its render() function.

Typically all renderable objects have static prototypes of all RenderJobPackages they need to be drawn correctly (e.g. a mesh with three materials might have three RenderJobPackages - one per material). Naturally though, the renderable objects don’t know anything about in which context they will be drawn (e.g. from what camera or in what kind of lighting environment) up until the point where their render() function gets called and the information is provided. At that point their static RenderJobPackages prototypes somehow needs to be patched up with this information (which typically is in the form of shader constants and/or resources).

One way to handle that would be to create a copy of the prototype RenderJobPackage on the stack, patch up the stack copy and then pipe that as argument to RenderContext::render(). That is a fully valid approach and would work just fine, but since RenderContext::render() needs to create a copy of the RenderJobPackage anyway it is more efficient to patch up that copy directly instead. This is the reason for RenderContext::render() returning a pointer to the RenderJobPackage on the RenderPackageStream.

Before diving into the RenderJobPackage struct let’s go through the other arguments of RenderContext::render():

shader_context

We will go through this in more detail in the post about our shader system but essentially we have an engine representation called ShaderTemplate, each ShaderTemplate has a number of Contexts.

A Context is basically a description of any rendering passes that needs to run for the RenderJobPackage to be drawn correctly when rendered in a certain “context”. E.g. a simple shader might declare two contexts: “default” and “shadow”. The “default” context would be used for regular rendering from a player camera, while the “shadow” context would be used when rendering into a shadow map.

What I call a “rendering pass” in this scenario is basically all shader stages (vertex, pixel, etc) together with any state blocks (rasterizer, depth stencil, blend, etc) needed to issue a draw call / dispatch a compute shader in the RenderDevice.

interleave_sort_key

RenderContext::render() automatically figures out what sort keys / Commands it needs to create on it’s command array. Simple shaders usually only render into one layer in a single pass. In those scenarios RenderContext::render() will create a single Command on the command array. When using a more complex shader that renders into multiple layers and/or needs to render in multiple passes; more than one Command will be created, each command referencing the same RenderJobPackage in its Command::head pointer.

This can feel a bit abstract and is hard to explain without giving you the full picture of how the shader system works together with the data-driven rendering system which in turn dictates the bit allocation patterns of the sort keys, for now it’s enough to understand that the shader system somehow knows what Commands to create on the command array.

The shader author can also decide to bypass the data-driven rendering system and put the scheduling responsibility entirely in the hands of the caller of RenderContext::render(), in this case the sort key of all Commands created will simply become 0. This is where the interleave_sort_key comes into play, this variable will be bitwise ORed with the sort key before being stored in the Command.

shader_pass_branch_key

The shader system has a feature for allowing users to dynamically turn on/off certain rendering passes. Again this becomes somewhat abstract without providing the full picture but basically this system works by letting the shader author flag certain passes with a “tag”. A tag is simply a string that gets mapped to a bit within a 64 bit bit-mask. By bitwise ORing together multiple of these tags and piping the result in shader_pass_branch_key the user can control what passes to activate/deactivate when rendering the RenderJobPackage.

job_sort_depth

A signed normalized [0-1] floating point value used for controlling depth sorting between RenderJobPackages. As you will see in the next post this value simply gets mapped into a bit range of the sort key, removing the need for doing any kind of special trickery to manage things like back-to-front / front-to-back sorting of RenderJobPackages.

gpu_affinity_mask

Same as the gpu_affinity_mask parameter piped to begin_state_command().

RenderJobPackage

Let’s take a look at the actual RenderJobPackage struct:

struct RenderJobPackage
{
    BatchInfo batch_info;
    #if defined(COMPUTE_SUPPORTED)
        ComputeInfo compute_info;
    #endif

    uint32_t size;                          // size of entire package including extra data

    uint32_t n_resources;                   // number of resources assigned to job.
    uint32_t resource_offset;               // offset from start of RenderJobPackage to first RenderResource.

    uint32_t shader_resource_data_offset;   // offset to shader resource data
    RenderResource::Handle shader;          // shader used to execute job

    uint64_t instance_hash;                 // unique hash used for instance merging

    #if defined(DEVELOPMENT)
        ResourceID resource_tag;            // debug tag associating job to a resource on disc
        IdString32 object_tag;              // debug tag associating job to an object
        IdString32 batch_tag;               // debug tag associating job to a sub-batch of an object
    #endif
};

batch_info & compute_info

First two members are two nestled POD structs mainly containing the parameters needed for doing any kind of drawing or dispatching of compute work in the RenderDevice:

struct BatchInfo
{
    enum PrimitiveType {
        TRIANGLE_LIST,
        LINE_LIST
        // ...
    };
    enum FrontFace {
        COUNTER_CLOCK_WISE = 0,
        CLOCK_WISE = 1
    };

    PrimitiveType primitive_type;
    uint32_t vertex_offset;         // Offset to first vertex to read from vertex buffer.
    uint32_t primitives;            // Number of primitives to draw
    uint32_t index_offset;          // Offset to the first index to read from the index buffer
    uint32_t vertices;              // Number of vertices in batch (used if batch isn't indexed)
    uint32_t instances;             // Number of instances of this batch to draw
    FrontFace front_face;           // Defines which triangle winding order
};

Most of these are self explanatory, I think the only thing worth pointing out is the front_face enum. This is here to dynamically handle flipping of the primitive winding order when dealing with objects that are negatively scaled on an uneven number of axes. For typical game content it’s rare that we see content creators using mesh mirroring when modeling, for other industries however it is a normal workflow.

struct ComputeInfo
{
    uint32_t thread_count[3];
    bool async;
};

So while BatchInfo mostly holds the parameters needed to render something, ComputeInfo hold the parameters to dispatch a compute shader. The three element array thread_count containing the thread group count for x, y, z. If async is true the graphics API’s “compute queue” will be used instead of the “graphics queue”.

resource_offset

Byte offset from start of RenderJobPackage to an array of n_resources with RenderResource::Handle. Resources found in this array can be of the type VertexStream, IndexStream or VertexDeclaration. Based on the their type and order in the array they get bound to the input assembler stage in the RenderDevice.

shader_resource_data_offset

Byte offset from start of RenderJobPackage to a data block holding handles to all RenderResources as well as all constant buffer data needed by all the shader stages. The layout of this data blob will be covered in the post about the shader system.

instance_hash

We have a system for doing what we call “instance merging”, this system figures out if two RenderJobPackages only differ on certain shader constants and if so merges them into the same draw call. The shader author is responsible but not required to implement support for this feature. If the shader supports “instance merging” the system will use the instance_hash to figure out if two RenderJobPackages can be merged or not. Typically the instance_hash is simply a hash of all RenderResource::Handle that the shader takes as input.

resource_tag & object_tag & batch_tag

Three levels of debug information to make it easier to back track errors/warning inside the RenderDevice to the offending content.

3. Resource updates

The last type of commands are for dynamically updating various RenderResources (Vertex/Index/Raw buffers, Textures, etc).

The interface for updating a buffer with new data looks like this:

class RenderContext
{
    void *map_write(RenderResource *resource, render_sorting::SortKey sort_key,
        const ShaderTemplate::Context* shader_context = 0,
        shader_pass_branching::Flags shader_pass_branch_key = 0,
        uint32_t gpu_affinity_mask = GPU_DEFAULT);
};

resource

This function basically returns a pointer to the first byte of the buffer that will replace the contents of the resource. map_write() figures out the size of the buffer by casting the resource to the correct type (using the type information encoded in the RenderResource::render_resource_handle). It then allocates memory for the buffer and a small header on the RenderPackageStream and returns a pointer to the buffer.

sort_key & shader_context & shader_pass_branch_key

In some rare situations you might need to update the same buffer with different data multiple times within a frame. A typical example could be the vertex buffer of a particle system implementing some kind of level-of-detail system causing the buffers to change depending on e.g camera position. To support that the user can provide a bunch of extra parameters to make sure the contents of the GPU representation of the buffer is updated right before the graphics API draw calls are triggered for the different rendering passes. This works in a similar way how RenderContext::render() can create multiple Commands on the command array referencing the same data.

Unless you need to update the buffer multiple times within the frame it is safe to just set all of the above mentioned parameters to 0, making it very simple to update a buffer:

void *buf = rc.map_write(resource, 0);
// .. fill bits in buffer ..

Note: To shorten the length of this post I’ve left out a few other flavors of updating resources, but map_write is the most important one to grasp.

GPU Queues, Fences & Explicit MGPU programming

Before wrapping up I’d like to touch on a few recent additions to the Stingray renderer, namely how we’ve exposed control for dealing with different GPU Queues, how to synchronize between them and how to control, communicate and synchronize between multiple GPUs.

New graphics APIs such as DX12 and Vulkan exposes three different types of command queues: Graphics, Compute and Copy. There’s plenty of information on the web about this so I won’t cover it here, the only thing important to understand is that these queues can execute asynchronously on the GPU; hence we need to have a way to synchronize between them.

To handle that we have exposed a simple fence API that looks like this:

class RenderContext
{
    struct FenceMessage
    {
        enum Operation { SIGNAL, WAIT };
        Operation operation;
        IdString32 fence_name;
    };
    void signal_fence(IdString32 fence_name, render_sorting::SortKey sort_key,
        uint32_t queue = GRAPHICS_QUEUE, uint32_t gpu_affinity_mask = GPU_DEFAULT);
    void wait_fence(IdString32 fence_name, render_sorting::SortKey sort_key,
        uint32_t queue = GRAPHICS_QUEUE, uint32_t gpu_affinity_mask = GPU_DEFAULT);
};

Here’s a pseudo code snippet showing how to synchronize between the graphics queue and the compute queue:

uint64_t sort_key = 0;

// record a draw call
rc.render(graphics_job, graphics_shader, sort_key++);

// record an asynchronous compute job
// (ComputeInfo::async bool in async_compute_job is set to true to target the graphics APIs compute queue)
rc.render(async_compute_job, compute_shader, sort_key++);

// now lets assume the graphics queue wants to use the result of the async_compute_job,
// for that we need to make sure that the compute shader is done running
rc.wait_fence(IdString32("compute_done"), sort_key++, GRAPHICS_QUEUE);
rc.signal_fence(IdString32("compute_done"), sort_key++, COMPUTE_QUEUE);

rc.render(graphics_job_using_result_from_compute, graphics_shader2, sort_key++);

As you might have noticed all methods for populating a RenderContext described in this post also takes an extra parameter called gpu_affinity_mask. This is a bit-mask used for directing commands to one or many GPUs. The idea is simple, when we boot up the renderer we enumerate all GPUs present in the system and decide which one to use as our default GPU (GPU_DEFAULT) and assign that to bit 1. We also let the user decide if there are other GPUs present in the system that should be available to Stingray and if so assign them bit 2, 3, 4, and so on. By doing so we can explicitly direct control of all commands put on the RenderContext to one or many GPUs in a simple way.

As you can see that is also true for the fence API described above, on top of that there’s also a need for a copy interface to copying resources between GPUs:

class RenderContext
{
    void copy(RenderResource *dst_resource, RenderResource *src_resource,
        render_sorting::SortKey sort_key, Box *src_box = 0, uint32_t dst_offsets[3] = 0,
        uint32_t queue = GRAPHICS_QUEUE, uint32_t gpu_affinity_mask = GPU_DEFAULT,
        uint32_t gpu_source = GPU_DEFAULT, uint32_t gpu_destination = GPU_DEFAULT);
};

Even though this work isn’t fully completed I still wanted to share the high-level idea of what we are working towards for exposing explicit MGPU control to the Stingray renderer. We are actively working on this right now and with some luck I might be able to revisit this with more concrete examples when getting to the post about the render_config & data-driven rendering.

Next up

With that I think I’ve covered the most important aspects of the RenderContext. Next post will dive a bit deeper into bit allocation ranges of the sort keys and the system for sorting in general, hopefully that post will become a bit shorter.