Current state on Global Illumination

Dust Engine currently supports multi-bounce diffuse lighting. Specular materials and emissive voxels are not yet supported.

Lighting passes

Test case

Screenshot 2023-11-05 233632

All timing are obtained from an NVIDIA RTX 3060Ti. This was captured on this commit. Total frame time is 9.03ms.

Primary Rays (GBuffer): 2.36ms

Cast rays from camera into the scene directly. We don't have jittering and TAA just yet. At 2.36ms, there isn't much point to rasterize anything at all. Virtualized Geometry techniques like Nanite will likely offer little to no gain, and you'd have to keep two copies of the same geometry in memory.

Our GBuffer contains:

  • Radiance + Secondary Ray Hit Dist: RGBA16Float. YCoCg in RGB, HitDist in A.
  • Albedo: RGB10A2
  • Normal: RGB10A2
  • Motion: RGBA16Float. Stores world space motion in RGB. A channel is unused.
  • Depth: R32

albedo png

Final Gather: 2.23ms (ambient occlusion) + 0.94ms (final gather)

Final Gather pass happens at full resolution, and it's done in two stages.

Ambient Occlusion

First, we trace ambient occlusion rays from the primary ray hitpoints with a cosine distribution. Those rays traverse geometry at the finest LOD to capture ambient occlusion caused by small-scale geometric details. They have a maximum traversal distance of 8.0, meaning that they will trigger the miss shader if they don't hit anything within the distance of 8 voxels.

We do not run closest hit shader for ambient occlusion rays at this time. This causes the rendered image to bias darker. In the future, we can try to retrieve the irradiance at secondary hit points from the previous frame image.

Profiling proves this slower than expected, so doing some screen space tracing before resorting to world space acceleration structure should help here. As an optimization, we can also sample the previous frame's denoised image when it's visible.

Final Gather

For ambient occlusion rays that missed the geometry, we continue their path with the actual final gather rays. In this pass, we intersect the rays with the rough geometry instead, where each block represents 4x4x4 voxels in the world space. Notably, when the ambient occlusion rays terminate, we do not launch the final gather rays unless all voxels in the current 4x4x4 grid are empty. This is to prevent light leaking problems.

Rough Geometry

When final gather rays miss the rough geometry, we sample the sky dome using the Hosek-Wilkie sky model. But if they hit something, we sample the spatial hash. We would then stochastically schedule that surfel patch to be sampled.

Surfels: 0.73ms

We schedule surfel patches to be sampled by writing the following entry into a buffer of SURFEL_RAY_BUDGET entries. SURFEL_RAY_BUDGET is a constant we can adjust based on the performance of the GPU. The timing shown above is for SURFEL_RAY_BUDGE = 720*480.

struct SurfelEntry {
    position: Vec3, // The coordinates of the center of the box.
    direction: u32, // [0, 6) indicating one of the six faces of the cube
}

Each frame, we launch SURFEL_RAY_BUDGE surfels rays from the surfel patches into the scene with a cosine distribution. The surfel rays intersect with the rough geometry; they sample the spatial hash on hit and the sky dome on miss. The sample gets injected into the spatial hash of the origin surfel. Additionally, the surfel patch that was hit can also get scheduled stocastically. This is how we get multibounce.

One potential optimization here is sampling the previous frame's denoised image when the surfel patches happen to be on screen.

Spatial Hash

The spatial hash is keyed based on the surfel patch position and orientation.

struct SpatialHashKey {
    ivec3 position;
    uint8_t direction; // [0, 6) indicating one of the six faces of the cube
};

Like GI1.0, we leverage the study by Jarzynski and Olano 2020 and pick two fast hash functions pcg and xxhash32 that were found to produce little to no collision between one another. One of the hashes was used to index into the buffer and the other is used as the fingerprint. We store the fingerprint (instead of the whole key) in the spatial hash entry.

struct SpatialHashEntry {
    fingerprint: u32,
    radiance: u32, // Encode as LogLuv
    last_accessed_frame: u16,
    sample_count: u16,
}

When reading the hash map, we perform linear probing for up to 3 entries using the fingerprint. When writing the hash map, we evict the entry that was least recently used. Note that we do not perform synchronization on this write, and we basically say "hehe" to those hazards. As long as the radiance value is a single u32, we should be mostly fine.

Performance Analysis

Nsight Tracing

nsight

Trace file: tracing.zip

Analysis and Next Steps

  • It's surprising to me that the ambient occlusion pass took more than the final gather rays. Incoherent rays spend a long time in the spaces immediately adjacent to the geometry. Ray marching in screen space before tracing rays in world space should help a bit.
  • Tracing shadow rays really aren't that expensive. Can probably afford to do more once we have emissive voxels.
  • 11% of the frametime spent in ambient occlusion intersection shaders. 5% in primary ray intersection shaders. Really indicates the importance of optimizing the ambient occlusion pass.
  • Better sampling strategy for scheduling surfel ray launches would be preferred.