# Diary of a Path Tracer – SSE optimized BVH Traversal

## Intro

In our last post, we looked at building our BVH using SAH as our guiding heuristic. Now it’s time to look at how we intersect that BVH.

## High Level

Querying the BVH is actually quite simple. We test our ray against our parent node, if it intersects, we test the ray against both of it’s children. If the ray intersects either of it’s children, we test that children’s children. Once we reach a leaf node, we add it to our list of nodes that we’ve intersected.

This sounds perfect for recursion, but recursion never scales particularly well. Instead a queue or stack is a more effective method to represent the nodes to be tested.

There are three pieces that bring the BVH traversal together. The math library, the AABB intersection code and the (roughly) branchless intersection response code.

We’ve already seen the math library in this post, as a result, let’s take a dive into the intersection routine!

## Ray/AABB Intersection Routine

There are quite a few excellent sources on the ray/AABB intersection routine [1][2][3]. We’ll do a small dive into how we get to the final implementation.

The intersection routine is based on the slab intersection method presented in [3]. We want to define our AABB as a set of 3 slabs in each axis. A slab is simply represented by a pair of planes, the slab being the space in between them.

In order to determine if we’re intersecting the AABB, we want to determine if our ray passes through the shared intersection of each slab.

In order to determine that, we want to determine if our ray enters each slab before it exits any of the other slabs.

A simple approach to this could be:

slabEntryX = intersect(ray, startPlaneX)
slabExitX = intersect(ray, endPlaneX)
slabEntryY = intersect(ray, startPlaneY)
slabExitY = intersect(ray, endPlaneY)

if(slabEntryX is greater than slabExitY or slabEntryY is greater than slabExitX)
return missed collision

return hit collision


As a result, our core approach is to find if there is any entry value that is greater than our exit values. And if so, we’re not intersecting the slab.

To intersect a ray and a plane, we need to define our ray as:

$\vec{p} = \vec{a}*t+\vec{b}$

and our plane as

$0 = \vec{n}.\vec{p}-d$

We can then plug our ray function into our plane function to get:

$0 = \vec{n}.(\vec{a}*t+\vec{b})-d$

And solving for $t$

$d = \vec{n}.\vec{a}*t+vec{n}.\vec{b}$

$d-\vec{n}.\vec{b} = vec{n}.\vec{a}*t$

$\frac{d-\vec{n}.\vec{b}}{vec{n}.\vec{a}} = t$

Now that we have $t$ we can calculate it for each plane, since our planes are axis aligned, our intersection math gets quite simple. Here is an example for our x axis.

$\frac{d-\vec{n}.\vec{b}}{vec{n}.\vec{a}} = t$

where $n = (1, 0, 0)$

$\frac{d_x-b_x}{a_x} = t$

And repeat for each axis.

Now let’s look at our intersection code:

cv3l rayOLanes = cv3l_replicate(rayO);
cv3l invD = cv3l_replicate(cv3_rcp(rayD));
cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);
cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD);

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger  = cv3l_max(t0s, t1s);

cfl rayMinLane = cfl_replicate(rayMin);
cfl rayMaxLane = cfl_replicate(rayMax);
cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);


This code looks different than our math, but it’s essence is the same.

In this code we’re testing 4 AABBs against a single ray instead of one ray against one AABB.

Notice that in this line:

cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);


We’re implementing $\frac{d_x-b_x}{a_x} = t$ where $b_x, b_y, b_z$ is rayOLanes, invD is $\frac{1}{a_x}, \frac{1}{a_y}, \frac{1}{a_z}$ and aabbMin is $d_x, d_y, d_z$.

The next few lines:

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger = cv3l_max(t0s, t1s);


Are there to account for the situation where our ray direction is negative. If our ray direction is negative on one axis, t0 and t1 will actually be flipped. As a result, we want to simply make sure that t0 is the smaller one and t1 is the larger one. [1]

And finally,

cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);


These lines are calculating to see if any of our latest entrance is earlier than our earliest exit. If so, cfl_mask will set the appropriate bit for that AABB.

That’s it for our intersection routine! Now we can take a look at what we do with these results.

## Intersection Response

The previous intersection code is nicely packaged in the caabb_does_ray_intersect_lanes routine.

In our bvh traversal, the call to it looks like this:

uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);


Where intersections hold the bits of the AABBs that intersect with our ray.

One approach to handling these bits is to simply loop through each AABB, determine if it’s bit is set and add it’s children volumes to the queue of being tested or to the final list of leaf nodes.

However, that’s no fun.

Instead it’s a lot more fun to make this code branchless using some SSE intrinsics!

Fair warning, this code can be complicated if you aren’t used to reading SSE intrinsics.

#define _MM_SHUFFLE_EPI8(i3,i2,i1,i0) _mm_set_epi8(i3*4+3,i3*4+2,i3*4+1,i3*4,i2*4+3,i2*4+2,i2*4+1,i2*4,i1*4+3,i1*4+2,i1*4+1,i1*4,i0*4+3,i0*4+2,i0*4+1,i0*4)
__m128i shuffles[16] =
{
_MM_SHUFFLE_EPI8(0,0,0,0), _MM_SHUFFLE_EPI8(0,0,0,0), _MM_SHUFFLE_EPI8(0,0,0,1), _MM_SHUFFLE_EPI8(0,0,1,0), // 0000, 0001, 0010, 0011
_MM_SHUFFLE_EPI8(0,0,0,2), _MM_SHUFFLE_EPI8(0,0,2,0), _MM_SHUFFLE_EPI8(0,0,2,1), _MM_SHUFFLE_EPI8(0,2,1,0), // 0100, 0101, 0110, 0111
_MM_SHUFFLE_EPI8(0,0,0,3), _MM_SHUFFLE_EPI8(0,0,3,0), _MM_SHUFFLE_EPI8(0,0,3,1), _MM_SHUFFLE_EPI8(0,3,1,0), // 1000, 1001, 1010, 1011
_MM_SHUFFLE_EPI8(0,0,3,2), _MM_SHUFFLE_EPI8(0,3,2,0), _MM_SHUFFLE_EPI8(0,3,2,1), _MM_SHUFFLE_EPI8(3,2,1,0)  // 1100, 1101, 1110, 1111
};

uint32_t leafLine = bvh->count - bvh->leafCount;
{
__m128i isParent = _mm_cmplt_epi32(queueIndices, _mm_set_epi32(leafLine, leafLine, leafLine, leafLine));

}

union
{
uint32_t i[4];
__m128i v;
} indexUnion;

indexUnion.v = childIndices;
for (uint32_t i = 0; i < leafCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
candidateIter[i] = bvh->jumps[nodeIndex].leaf.index;
}
candidateIter+=leafCount;

indexUnion.v = parentIndices;
for (uint32_t i = 0; i < parentCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
cran_assert(nodeIndex < bvh->count);
testQueueEnd[i*2] = bvh->jumps[nodeIndex].jumpIndices.left;
testQueueEnd[i*2 + 1] = bvh->jumps[nodeIndex].jumpIndices.right;
}
testQueueEnd += parentCount * 2;


Now this looks like a lot, so let’s take a look at it piece by piece. As we presented in the previous post, our leaf nodes are stored at the end of our array. This allows us to determine which nodes are leaf nodes simply by comparing their indices against that boundary. The following bit of code does exactly that:

__m128i isParent = _mm_cmplt_epi32(queueIndices, _mm_set_epi32(leafLine, leafLine, leafLine, leafLine));


The first line here simply compares our indices against our “leaf boundary”. The result of this tells us if our index is a parent or a leaf node.

Once we know which ones are our leaf nodes, we convert this to a bitmask representing the elements of the vector.

And finally, we mask out the indices that have no intersections associated with them.

The result of this operation is that we have two bitmasks that represents the parents that have intersected the ray and the leaves that have intersected the ray.

Next we want to turn this bit mask into a series of indices.

uint32_t leafCount = __popcnt(childIndexMask);


The first two lines are very simple, it tells us how many leaf nodes have intersections and how many parent nodes. The next pair of lines require a little more explaining.

Right now, our queueIndices vector represents the indices to the AABBs that we’ve tested against our rays.

One approach we could implement is to simply loop through every element, test their bits and read the values if those bits are set.

However, what we can do instead is pack all of our active indices to the front of our vector allowing us to read only the ones that have reported an intersection. Reducing our branching at the cost of shuffling our data. That’s what _mm_shuffle_epi8 is doing. It’s taking a lookup table (shuffles[]) and determining where our values should be moved in the vector to pack them to the front of the vector.

Since our masks only contain 4 bits, they have a maximum of 15 values making for a very manageable look up table.

Once we have our indices packed, we simply have to read them.

union
{
uint32_t i[4];
__m128i v;
} indexUnion;

indexUnion.v = childIndices;
for (uint32_t i = 0; i < leafCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
candidateIter[i] = bvh->jumps[nodeIndex].leaf.index;
}
candidateIter+=leafCount;

indexUnion.v = parentIndices;
for (uint32_t i = 0; i < parentCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
cran_assert(nodeIndex < bvh->count);
testQueueEnd[i*2] = bvh->jumps[nodeIndex].jumpIndices.left;
testQueueEnd[i*2 + 1] = bvh->jumps[nodeIndex].jumpIndices.right;
}
testQueueEnd += parentCount * 2;


These parts are relatively self explanatory. We’re reading our indices and doing the appropriate work if our values are leaves or parent nodes. To understand why we’re using a union to convert our SSE vector, I recommend looking into type punning. [5]

## Conclusion

That’s it! I don’t really have any future work for this part of cranberray, I’m actually quite happy with the end result.

Next we’ll be looking at the ray tracer’s sampling strategy. Until then, happy coding!