Volumetric Data Structure

When it comes to rendering volumetric content on screens, the storage method employed makes all the difference. This article delves deep into the various volumetric data structures relevant to our use case, describes the data structure we use in Dust Engine, and tries to shed some light on the advantages of using hardware ray tracing for voxel content.

What makes a good volumetric data structure

Leverage sparsity and coherency: Scenes in video games often consist of vast regions of similar values. For example, everything above ground would likely be nothing but air. Everything underground would likely be just dirt. We shouldn't have to store "dirt" millions of times.
Maximize self-similarity: Geometry usually contains self-similar features. For instance, cars of the same model should share common structural components in a scene. Our data structure should be able to easily capture that.
Simplicity in modification: Voxel-based games enable intuitive geometry modifications. This feature should remain unaffected by our choice of data structure.
Render-readiness: The data structure should be effortlessly consumable by the GPU. Any preliminary transformation required should be fast and easy.

Categorizing volumetric data structures

There are different trade-offs to make when selecting a data structure for your volumetric dataset. Depending on the nature of your dataset and what you want to do with it, the trade-offs will be different. Let us first review a few data structures commonly employed.

We can roughly categorize volumetric data structures into two categories based on whether we explicitly store the coordinates of your samples.

Implicit data strctures

For implicit spatial data structures, the spatial coordinates of the nodes are implicit, usually associated with its index or access path within the data structure.

Uniform Grid

Basically a 3D grid with samples stored at each location. Depending on what you store inside this 3D array, you would have a few different methods to visualize it:

Ray Marching: Store distance to the geometry surface in the grid. Cast a ray, sample the grid where it lands, then "jump forward" a certain distance based on the sampled value
Ray Tracing: Use a DDA algorithm to traverse through all cells intersected by the ray. Terminate the ray on the first hit point.
Isosurface extraction: Assuming that you have a Signed Distance Field (SDF), you may create a mesh from the SDF, and then rasterize the mesh. Examples: Marching Cubes, Dual Contouring. The Greedy Meshing algorithm used by Minecraft would also be in this category.

Uniform grids are great when your data is dense. However, when you have a teapot in a stadium problem, a uniform grid won't be able to help you much. Let's say you're rendering a scene containing a small teapot at the center of a large stadium. The teapot is very detailed, with many small features, while the stadium is relatively simple in shape. You will have a hard time selecting an appropriate resolution for your uniform grid without running out of memory. Solving problems like this usually requires some kind of hierarchical data structure with an adaptive level of detail.

Sparse Voxel Octree

A Sparse Voxel Octree (SVO) is one such type of hierarchical data structure. At each level of the tree, it divides a cube into 8 smaller cubes, or voxels. The traversal path from the root to the leaf would then determine the spatial bounding box of the leaf. When building the octree, you can inspect the eight children of a node, and truncate all nodes under the same branch if they contain the same value. This makes it a very efficient way to represent scenes with a lot of empty space or coherent areas with the same values.

You can still do Marching Cubes on the SVO while taking advantage of the sparsity (which I have done in the past) - a divide-and-conquer algorithm would run Marching Cubes on the 3 internal planar surfaces, then to the 8 children recursively. It's also fairly easy to trace rays with your SVO - Efficient Sparse Voxel Octree provides us with such an implementation, and you can achieve fairly high geometric complexity with this technique.

The biggest advantage of SVO is the fact that when you trace rays on it, you don't really need a dual representation on the CPU side, and you should be able to do a lot of simulations directly on the SVO. On systems with integrated GPU this means one copy of the geometry for both the CPU and the GPU - effectively doubling your memory capacity.

That being said, SVOs usually aren't as compact as they should be, mainly due to the sheer number of pointers you have to store. This combined with the low branching factor (8) at each level makes it suboptimal for GPU programs. Memory latency for VRAM is high, so the best acceleration structures for GPU ray tracing usually try to reduce the number of indirections (pointer dereferencing) you have to follow when tracing rays. The cost from those indirections is particularly profound when you look far into the horizon on a flat terrain. When programming the SVO ray traverser, you'd generally set a max iteration count. Rays close to the geometry surface tend to march small distances on each iteration, causing those rays to terminate early without hitting anything. This effect can be seen from this video I posted almost 2 years ago. You can see the white edges around geometric surfaces indicating high traversal count, and it's very typical SVO behavior.

Sparse Voxel DAG

Sparse Voxel DAG optimizes a binary SVO (an SVO storing only a boolean occupancy flag) by merging common subtrees, therefore turning it into a Directed Acyclic Graph (DAG). The compression is achieved by exploiting self-similarity, and it turns out that when you only care about occupancy, the geometry actually contains a lot of self-similarity to be exploited. This makes SVDAG suitable for tracing shadow rays or ambient occlusion rays.

Another realization I made when researching this was the importance of decoupling geometry and material. It turns out that it's still beneficial to use SVDAG even if you care more than the geometric occupancy. When tracing rays, you do a lot of lookups into the geometry of the model, but only once into the material (color) of the hit point. A reasonable optimization would then be "making the common cases fast". Why spend time waiting on loading tons of data from VRAM, when you're just comparing it with zero? Therefore, with a compacted geometry representation, SVDAG should be more efficient to trace - when your tree nodes are smaller, and when you have fewer of them, you're also more likely to hit the cache.

With the material decoupled from the geometry, you can apply material compression techniques to save some video memory. Geometry and Attribute Compression for Voxel Scenes assign each node an index based on the tree traversal order. The attribute data can then be stored separately in a texture, and optionally with block compression like BCn/ASTC to reduce the size.

OpenVDB

I put OpenVDB here because it's basically a generalization of an implicit hierarchical volumetric data structure. It was described as a B+ tree, but really it's just a shallow and wide hierarchical grid with compile-time configurable branching factors at each level, with the root node optionally being a hash map. They also claim O(1) random access, and this came from the fact that the tree spans the entire space of UVec3 in just a few levels. The base level of OpenVDB is usually a HashMap, and the tree has a max depth configurable at compile time.

Why is the compile-time configurable branching factor so important? As mentioned above, one of the biggest downsides of an SVO is the low branching factor of 8. When you have such a low branching factor, traversing down to the levels of individual voxels requires quite a bit of following pointers. GPUs simply hate indirections, but CPUs also suffer from memory latencies (to a smaller degree). We usually have some high-level knowledge about the model we're trying to load, so in a sense, having that compile-time configurability allows us to fit the shape of the hierarchy based on the shape of the model.

Explicit data structures

Explicit volumetric data structures store the coordinates of a node/vertex explicitly within the data structure itself. One such example is actually the 3D mesh we're all familiar with.

The good old mesh

Meshes are basically an array of coordinates, and what matters here is how you interpret those coordinates. Graphics pipelines usually support at least one of the triangle primitive topologies - they may also support Points, Lines, and even Quads. When you interpret each coordinate in the array as a voxel, this could be a perfectly good volumetric dataset, also known as a point cloud. This is also the approach used by the popular voxel CAD program MagicaVoxel. Each model is an array of 256x256x256 blocks, and each block contains just one 8-bit integer which indexes into a color palette. The only difference between the MagicaVoxel format and a point cloud is that all points in MagicaVoxel are assumed to be a cube, and their coordinates are integers.

Visualizing datasets in this class of format is also incredibly easy for the GPU. The GPU really likes to eat big chunks of data, and then do a bit of number crunching on them, instead of following random pointers and indirections. If you don't care about global illumination, you can totally just do instanced rendering on this list of coordinates, rasterizing 6 triangles per cube. Your GPU will be quite happy with that even with up to 30 million blocks.

Compared to an implicit data structure, explicit data structures are "sparse" by nature. When you're allowed to store the spatial coordinates of your blocks, you are no longer constrained by the topology of your data structure, so you're free to have all the teapots you want in a stadium.

On the other hand, dense volumetric datasets represented this way are less space efficient. Take the 256x256x256 MagicaVoxel model as an example. Represented as a 3D grid, it has a constant size of 16MB. As a list of explicitly represented voxels, it is 64MB. The difference becomes more dramatic when you're only storing 1 bit of occupancy data. In general, for a fully populated grid of voxels, if storing the coordinates requires X bits and the data stored at each sample location requires Y bits, a list of voxels with explicit coordinates will be (X + Y) / Y times as large as an equivalent uniform grid.

Bounding Volume Hierarchy

Bounding Volume Hierarchy is especially important because it's the acceleration structure used by hardware RTX. Although the details of the data structure are usually opaque to programmers of the hardware RTX APIs, it's still important to understand how it works. They are explicit because they store the min/max coordinates of the AABB in each node. Different hardware vendors will have different implementations of the BVH best optimized for their architecture (for example the branching factors on each level), and I will refer you to this article for details.

Apart from accelerating ray tracing, BVH is also used for implementing Nanite. The principal idea for Nanite is that instead of walking the BVH per pixel, we instead walk the BVH once per view, then occlusion cull BVH branches using a Hierarchical Z-buffer. This is quite nice when you only need primary hit results.

Putting them all together

From the comparison between explicit and implicit spatial data structure, we can see that explicit datasets are clearly better at managing scene complexity at higher levels of the tree - levels where you're more likely to have teapots in stadiums. However, as we get closer to the leaf levels, the data becomes a lot more dense and high frequency. This hints at the possibility of transitioning to a different acceleration structure at the leaf levels.

If you've read Ray-Tracing Small Voxel Scenes - a chapter in Ray Tracing Gems 2, or NVIDIA's Micromaps, you'll notice parallels between the techniques it describes and the one employed in the Dust Engine. We're going to discuss the rationale for adopting this particular approach and examine how our approach build upon and refine the findings of the original author in an effort to scale it up. Additionally, we will explore strategies for the real-time modification of geometry, which is required to enable the level of interactivity promised by voxel games.

Why Hardware RTX for voxels?

I've heard people arguing that a well-written software ray tracer tracing against a non-BVH implicit data structure (Like SVDAG) will outperform hardware RTX. However, given the widespread adoption and standardization of hardware RTX, one would need a particularly persuasive reason to forgo its benefits. Importantly, hardware RTX gives you a couple of things for free:

Fast BVH traversal. If you're going to have to trace BVHs anyway, you'd have a really hard time beating the hardware. Depending on the GPU, AMD/Intel/NVIDIA will select the best type of BVH most suitable to their architecture.
Reduced Maintenance Responsibility: By leveraging Hardware RTX, the job of performance optimization shifts somewhat to the hardware manufacturer. They are in the best position to do this task because they likely know a lot better about the intricacy of the GPU architecture.
Two Level Acceleration Structure. This is very important for any non-trivial games because you most definitely want instancing and relative movements between game objects.
A standardized pipeline structure. The shaders offered by the ray tracing pipelines - RayGen, Miss, AnyHit, ClosestHit, and Intersection shader - map nicely to the concepts of geometry and material, allowing you to have game objects with different intersection algorithms/shading algorithms to coexist in the same scene.
Ray Sorting to solve incoherent rays. Both Intel and NVIDIA’s 40 series GPUs implement ray sorting to mitigate the performance hit from incoherent rays. Sorting rays based on their Shader Binding Table (SBT) entries can significantly improve performance in scenes with diverse intersection/shading algorithms.

Two Three level acceleration structure

The ray tracing unit in your GPU already specifies the first two levels of our acceleration structure, which maps perfectly to our use case.

Top-level acceleration structure: At this level, we consider individual objects. That would be your teapots, stadiums, and everything in between. Self-similarity at this level is high: you're likely to have multiple instances of the same object. Sparsity is also quite high: You could easily have a teapot in a stadium problem at this level of the tree. Other data structures fail to handle this case: You won't be able to find an appropriate resolution for your uniform grid, and your octree will be heavily imbalanced due to the small teapot extending all the way down from your big root node. BVH does a great job at this level, and the TLAS that came with hardware RTX also allows you to easily apply SRT transform to each individual object to represent relative movements. NVIDIA recommends rebuilding the TLAS every frame.
Bottom level: BVH is still the perfect choice for this. Self-similarity is quite limited at this level, but sparsity is still quite high. Using the BLAS that came with hardware RTX will save you a lot of unnecessary work traversing a grid or SVO with software.
Leaf level: As we go deeper in our hierarchy, dense and high frequency data start to dominate. We should stop using BVH when it stops making sense, and that is when occupancy data becomes dense enough to be represented economically as a 3D bit array. This is the same idea as triangle micromaps, except we don’t have the help of the hardware here to make it super fast. Fortunately, in hardware RTX, you can define your own intersection shader for BVH leaves. This provides us with the right amount of flexibility to implement custom traversal algorithms to take advantage of the memory savings of an implicit data structure.

The problem then becomes: How do we design the leaf level acceleration structure, and how do we perform attribute mapping for those leaf nodes?

Our solution stores a single 4x4x4 grid represented as a 64-bit integer at the BVH leaves. Notably, unlike Ray-Tracing Small Voxel Scenes, we also took inspiration from Geometry and Attribute Compression for Voxel Scenes and stored attribute data separately in a texture. This is quite important - a compact representation of the geometry increases the cache hit ratio, greatly increasing the tracing performance.

Additionally, we do need to explicitly store an additional 32-bit pointer for attribute lookups. Each BVH Leaf may contain 0-64 attributes, so we can no longer use the traversal order to index into the attribute texture. We also store an R10G10B10A2 encoded albedo value for final gather rays where high-resolution intersection is not needed.

If you're storing floating point attributes, we can apply all the texture compression formats invented over the past decade, and I would refer you to this article for a comparison: Texture Compression in 2020

struct BVHLeaf {
    u16vec4 position;
    uint64_t occupancy_mask;
    uint32_t material_ptr;
    uint32_t avg_albedo;
};

By shifting to an implicit representation from the BVH at leaf levels, we sidestep the need to convert the 4x4x4 voxels into up to 64 * (4 * 3 * 2) = 1536 bytes of BVH, plus all the extra pointers. This transition has resulted in substantial memory saving, cutting down the memory footprint by at least a factor of 192 at the leaf level. While it's true that performing DDA on a 4x4x4 grid with software in an intersection shader is computationally more demanding, this cost is compensated by the decreased need for memory accesses and the reduced bandwidth requirements. Our application is usually memory-bound anyways, so trading compute time for less memory access is usually a good trade-off to make.

How does this compare to a fully implicit data structure like SVDAG known for its memory efficiency? I would like to directly quote Geometry and Attribute Compression for Voxel Scenes here:

The high gain of the DAG stems from the compression at low levels in the tree. For example, an SVO with 17 hierarchical levels usually has billions of nodes on the second-lowest level while a DAG has at most 256 – the amount of possible unique combinations of eight child voxels having each one bit. For higher levels, the number of combinations increases, which reduces the amount of possible merging operations; this also reflects the difficulty that arises when trying to merge nodes containing attribute data. With only three different data elements (colors of leaves), the merging process already stops after the lowest level.

Given those, we can conclude that the key to having an efficient data structure for voxel scenes is to compact the data at the leaf levels as much as possible. And when we're already representing the leaf levels with a 64-bit integer, there's not much point to have pointers pointing around.

Constructing the BVH

Construction and updates of the BVH are usually fully handled by the graphics driver. The application is expected to provide a list of AABBs, and the driver will build an acceleration structure on the GPU.

The fact that the BVH is opaque to the application implies the necessity to have a dual CPU representation for things like physics and CSG operations. This CPU data structure should allow efficient random access and should be easily convertible into a GPU-friendly representation. We drew inspiration from OpenVDB and built dust_vdb, a hierarchical volumetric data structure with compile-time configurable branching factors at each level. The reason I built the data structure from scratch instead of creating a C++ wrapper for the original implementation is that OpenVDB relies heavily on templates, and it's hard to take advantage of that through a C++ wrapper. By using Rust const generics to implement the foundational ideas of OpenVDB, we get significantly more ergonomic APIs to work with. Compatibility with the original OpenVDB version is less important to me.

The leaf node is always sized to be 4x4x4, same as the GPU representation, but everything else can be configured based on the high-level knowledge we have about the type of the source geometry. For example, MagicaVoxel models have a max size of 256x256x256, so we're going to need a tree with a depth of 3 and a branching factor of 4^3, 2^3, 2^3 on each level. We can then use a macro to define the Tree type at compile time:

pub type TreeRoot = hierarchy!(4, 2, 2);
pub type Tree = dust_vdb::Tree<TreeRoot>;

To extract the list of leaf AABBs from the tree, we mark the surface voxels as "active". When we're ready to build the BVH, we iterate over all active leaf nodes, serialize them into an array of AABBs, and then copy the data to GPU for BVH construction. While our CPU geometry isn't necessarily narrow band, this step effectively guarantee that the extracted GPU version of our geometry would become narrow band - it only stores data immediately adjacent to the "isosurface".

Attributes are separately managed in an arena allocator. Each BVH leaf may be associated with 0-64 attributes, so the allocator maintains a free list for each possible allocation size. Allocation alignments are sized to match the block boundaries of the compressed texture format.

Modifications and Updates

The API imposes strict restrictions on BVH updates. Specifically, we cannot:

Change primitives or instances from active to inactive, or vice versa (by setting the minimum X coordinate of the AABB to NaN).
Change the index or vertex formats of triangle geometry.
Change triangle geometry transform pointers from null to non-null or vice versa.
Change the number of geometries or instances in the structure.
Change the geometry flags for any geometry in the structure.
Change the number of vertices or primitives for any geometry in the structure.

TLDR, we cannot meaningfully update the BVH, so we will have to rebuild the entire BLAS for every update... or do we?

There is actually a stealthy way to change BLAS primitives to inactive: By setting the 64-bit occupancy bit array to 0. The intersection shader will then consider the whole block as empty and will not report any intersections. Any amount of block removal can be achieved entirely on the leaf level without touching the BLAS at all, and the ray traversal after the removal will be no slower than before.

Adding new blocks to the geometry will be slightly more complicated. If the change is local enough to be covered by existing geometry, we simply update the leaf level bitvecs. When the change requires adding new primitives to the BLAS, we instead spawn temporary TLAS instances with only the local area that is being updated. This temporary TLAS instance is small compared to the rest of the geometry, so we can afford to rebuild it on every change. Once we have accumulated enough changes, we batch those changes together and finally rebuild the BLAS. This can happen asynchronously across multiple frames on a separate queue with lower priority to avoid impacting the frame rate.

The pointers we store at the BVH leaves point to 0-64 attribute values in a separate texture. We need to be able to allocate memory from this texture efficiently. To do this, we build an arena allocator on the CPU side, then "commit" the changed sections to the GPU texture upon updates. Special care must be taken to ensure alignment to block boundaries. Assuming some kind of block compression format is being used, we want to ensure that attributes in the same BVH leaf are contained within the same compression block.

Possible future extensions

Sparse attributes

We currently require all per-voxel attributes to be dense. In some cases, you might want some attributes to be sparse - for example, if only a subset of the voxels glow in a model, you'd want the emission attributes to be sparse. Depending on the nature of the data and if you need to read them on the GPU, you might want different strategies:

Separate tree: If you only need the data on the CPU for simulations, you could just create a separate dust_vdb tree with attribute data stored on the tree itself.
Clustered sparse attributes: If voxels with the assigned attribute are likely to be clustered, you want clustered sparse attributes. Each 4x4x4 BVH leaf either maps to a dense block on the attribute map, or they don't.
Hashed sparse attributes: Maintain a GPU HashMap separately for the attribute.

Contours

Our implementation will have a very similar problem to Efficient Sparse Voxel Octree. Specifically, when you shoot rays parallel to a long, flat surface, the intersection shader will have to run DDA at the leaf level for many iterations before it can finally hit something.

It may be worth trying out something similar to the contour pointers as described in Efficient Sparse Voxel Octree. In the case mentioned above, one single contour test can help us to skip over an entire leaf block's voxel grid.

Variable leaf node sizes, and other leaf acceleration structures

We actually do not take advantage of self-similarity at the leaf levels right now. The 64-bit occupancy mask is already compact enough, and storing a pointer pointing to a 64-bit occupancy mask won't be much better here. However, if the leaves are larger in size - for example, 8x8x8, we could probably better take advantage of self-similarity at the leaf levels by storing some pointers. I expect this to be more economical when your voxel density is ultra-high.

Apart from a uniform grid, we can also try other leaf-level acceleration structures. One idea is to store pointers into a sparse voxel DAG on the BVH leaves, thereby fusing the two structures together. I have tried this before, but the traversal algorithm in Efficient Sparse Voxel Octree uses too many registers, and putting it in the intersection shader slows down everything considerably. Other leaf acceleration structures may be more efficient to traverse. In general, hardware RTX does not have any limitations on what you can traverse on BVH leaves, and for specific use cases, it might be worthwhile to use something other than a uniform grid.

For the type of voxel content I'm trying to render (MagicaVoxel models), I've run some performance tests and determined that a 4x4x4 uniform grid at the leaf level is the most performant. Your mileage may vary if the voxel density of your model is significantly higher.

Better intersection testing

In addition to the 64-bit occupancy mask, we currently also need to store a duplicate of the leaf AABB min/max coordinates. Hardware RTX units often operate on f16 coordinates, so when the intersection shader gets invoked for an AABB primitive, the application is supposed to manually verify the hit with a higher-precision ray-AABB intersection test, which requires high-precision coordinates of the AABB to remain available. However, when we already know that the source data is quantized voxel datasets, we could as well quantize the leaf AABB coordinates to integer boundaries.

Unfortunately, the API does not offer us a way to query the min/max coordinates of the intersected AABB. They do offer this capability for triangle mesh through the VK_KHR_ray_tracing_position_fetch extension, so it wouldn't be completely unreasonable for them to do the same for AABBs.

Another complication is that the driver may merge some AABBs together during BVH construction. We'd likely still have to store some kind of discriminant to tell them apart, but this would likely still be more space-efficient than storing the full min/max coordinate for every leaf AABB.

Surfel-based Global Illumination in Dust Engine

Performance testing and analysis of our global illumination techniques.

Why voxels

Why am I making this engine, and who is it for?