# Cranberry Lair

## Walking the programming parameter space

### Intro

In our last post, we looked at cranberray’s sampling strategy and it’s implementation of N-Rooks sampling. In this post, we’re going to look at cranberray’s BRDF implementations and multiple importance sampling.

Cranberray’s shading pipeling is very simple. When loading a mesh, ranges of triangles (or submesh) are tagged to use a specific material. When intersecting the geometry, cranberray will use the triangle’s index to lookup which material it should use for said triangle. Each material has an associated shader and appropriate data object. Once we know which material to use, we call the function associated to that shader and light our pixel. The heavy lifting happens inside the shader.

The primary lighting shader in cranberray is shader_microfacet. This function uses Lambert as it’s diffuse BRDF and GGX-Smith as it’s specular BRDF. It selects and blends both using multiple importance sampling.

In this post we’ll be taking a look at the magic behind the shader starting with our Lambert BRDF!

### Understanding The Lambertian BRDF

For this post, we’ll assume that there is some familiarity with the fact that the Lambert BRDF is simply the Albedo of the surface divided by $\pi$.

However, one bit I’d like to touch on is that a Lambertian reflector emits energy at a constant rate in every direction.

At least, that was how I understood a Lambertian reflector. This is incorrect. A Lambertian reflector emits energy at a constant rate in every direction per unit area.

The difference here is that the first definition describes radiant intensity and the second defines radiance. [3]

What that means is that the amount of energy emitted is proportional to the angle at which the surface is viewed. Wikipedia was quite insightful here [1].

If you imagine a rectangular Lambertian reflector of 1 meter on each side and imagine that it reflects 1 unit of radiant intensity.

Notice that in that case, our Lambertian reflector is reflecting $L = \frac{1}{1} = 1$ units of radiance.

Now, imagine that we’re looking at the reflector at an angle of 60 degrees

Notice that the perceived size of our reflector is now 0.5 meters squared. If our reflector were to emit 1 unit of radiant intensity when viewed at this angle, it would emit $L = \frac{1}{0.5} = 2$ units of radiance. Twice as much energy per unit area! Meaning that if we looked at it at an angle, it would emit the same amount of energy for less area in our field of view making it brighter.

As a result, you can imagine that when we’re looking to render a surface that looks uniform when viewed from all directions this is problematic.

A Lambertian reflector should reflect constant radiance. As a result, the energy reflected per direction by our material should be proportional to the change in size of our reflector when viewed at an angle.

I’ll save you the trigonometry and tell you that this factor is $cos(\theta)$ where $\theta$ is the zenith angle of our viewer.

As a result, you can see that to emit constant radiance, our reflector’s radiant intensity should be reduced by $cos(\theta)$ and if we do that, our reflector would emit 0.5 units of radiant intensity and as a result would emit $L = \frac{0.5}{0.5} = 1$ unit of radiance.

Now, like me, you might be thinking “If we need a cosine here to make our radiant intensity proportional to the foreshortening of our area, why don’t we multiply our emitted light by the cosine factor when rendering?”

The reason is that this cosine factor is relevant when we’re trying to calculate radiance from radiant intensity and the area being shaded. However, when we’re rendering, the result of our shading math is radiance not radiant intensity. If we wanted to instead calculate the radiant intensity, we would have to integrate across our surface.

$I = \int_A L dA$

Taking our previous surface as an example, when we’re shading it in our renderer, we’re asking “Given this irradiance, what is it’s radiance?”. In our previous example, our surface was being shaded with an irradiance of $\pi$ (conveniently skipping over why here) and as a result, our radiance was 1. If we were to ask what the radiant intensity was in any given direction, you would have to multiply the radiance by the projected area of our surface. At an angle of 60 degrees, this would be 0.5, making our radiant intensity $I = L * A = 1 * 0.5 = 0.5$.

This was a long winded way to attempt at explaining a neat and surprising (to me) property of Lambertian reflectors.

### Smith-GGX and microfacet BRFDs

Aside from our Lambert BRDF for diffuse, cranberray makes use of the Smith-GGX specular formulation. This formulation is very common and well documented. I refer the interested reader to [6] for an overview of the topic.

I won’t explain microfacet BRDFs here, but I highly recommend [6] and [9] for a derivation of the specular microfacet BRDF formula, and [8] for more info about the origins of the formula.

As presented in the heading, we’re making use of the Smith Uncorrelated function for our shadowing function and GGX for our NDF. [9]

### Importance Sampling

At this point in our rendering, we have a specular BRDF and a diffuse BRDF however we now need to decide on where to sample our ray.

The simplest method for selecting a ray is to uniformly select a point on the hemisphere centered about the normal of our surface.

This works well to get started but can converge very slowly. Especially in scenarios with sharp highlights which often happens with smooth surfaces.

If we sample only a few times, we might miss all the important details.

As a result, we want to make use of something called importance sampling. Simply put, importance sampling is the idea of trying to sample the most “important” parts of a function.

I assume the reader has some familiarity with Monte Carlo estimation method. If not, I suggested going to [15]. Their treatment of importance sampling and Monte Carlo estimators is excellent.

To sample the most important parts of our scene, we want to take samples where we expect the most contributions to our image would live. We unfortunately don’t know exactly where that is, but we can guess. And to guess, we can use an approximate function to choose what direction we want to send our ray.

Let’s imagine our basic Lambert BRDF, we know that rays that are closest to our normal will contribute the most light to our illumination. As a result, we can try to send more rays towards our normal’s direction with some falloff. [16]

To select a ray, we want to convert our PDF to a CDF and randomly select a point on our CDF using a uniform random number. [13] This allows us to select a random number for the specified distribution using a random number with a uniform distribution.

Once we’ve selected our value, we can calculate the PDF and plug it into our Monte Carlo estimator.

A potential downside of selecting an approximate function is that it might not accurately reflect the illumination in the scene, causing us to increase our variance instead of reducing it. [15]

Cranberray uses the BRDF functions as the approximate representation of the illumination in the scene. This is easy to implement and also quite effective since there tends to be quite a few rays that would effectively contribute nothing at all to the image if we sampled uniformly.

For Lambert, I use the derivation from a previous post [16].

And for GGX I use the derivation found here [13].

static cv3 hemisphere_surface_random_lambert(float r1, float r2)
{
float theta = acosf(1.0f - 2.0f*r1) * 0.5f;
float cosTheta = cosf(theta);
float sinTheta = sinf(theta);
float phi = cran_tao * r2;

return (cv3) { sinTheta*cosf(phi), sinTheta*sinf(phi), cosTheta };
}

static float lambert_pdf(cv3 d, cv3 n)
{
return cv3_dot(d, n) * cran_rpi;
}

static cv3 hemisphere_surface_random_ggx_h(float r1, float r2, float a)
{
float cosTheta = sqrtf((1.0f-r1)*cf_rcp(r1*(a*a-1.0f)+1.0f));
float sinTheta = sqrtf(1.0f - cosTheta*cosTheta);
float phi = cran_tao * r2;
return (cv3) { sinTheta*cosf(phi), sinTheta*sinf(phi), cosTheta };
}

static float ggx_pdf(float roughness, float hdotn, float vdoth)
{
float t = hdotn*hdotn*roughness*roughness - (hdotn*hdotn - 1.0f);
float D = (roughness*roughness)*cf_rcp(t*t)*cran_rpi;
return D*hdotn*cf_rcp(4.0f*fabsf(vdoth));
}


### Multiple Importance Sampling and blending our BRDFs

With these importance functions, we need a way to select between either our diffuse or our specular BRDF and blending between them. This is where Multiple Importance Sampling comes in.

Multiple Importance Sampling allows us to blend our functions using weights assigned to each PDF. [14]

In cranberray, we use a simple strategy for determining if we select from either our specular BRDF or our diffuse BRDF. We use the Fresnel factor for our geometric normal and select our specular BRDF when our random number is less than the Fresnel factor and our diffuse BRDF in the other case.

float geometricFresnel = cmi_fresnel_schlick(1.0f, microfacetData.refractiveIndex, normal, viewDir);
geometricFresnel = fmaxf(geometricFresnel, microfacetData.specularTint.r * gloss); // Force how much we can reflect at a minimum
float weights[distribution_count] =
{
[distribution_lambert] = 1.0f - geometricFresnel,
[distribution_ggx] = geometricFresnel
};

bool reflected = random01f(&context->randomSeed) < weights[distribution_ggx];


Once we’ve selected our BRDF and sampled from it, we want to blend our results with our previous sampling results. This is where multiple importance sampling comes in. (See [17] for an excellent treatment on the topic)

Notice that if we imagine our scene as a simplified function.

Let’s assume this function has an area of 1.2 units.

Imagine that our function is a combination of 2 simpler functions.

$f(X) = g(X)h(X)$

Let’s imagine that we select only one of those functions as the importance function uniformly at random and we select $g(x)$.

To get our value using importance sampling, we need to divide by our PDF and the chance of selecting that PDF as our distribution (Which is 0.5 in this example).

If we were to use this single sample as our estimate using $p_g$ as our PDF and the values $g(x) = 0.1$, $h(x) = 2$ and $p_g(x) = 0.1$ we would get $y = \frac{g(x)h(x)}{0.5*p_g(x)} = \frac{0.1*2}{0.5*0.1} = \frac{0.2}{0.05} = 4$ which is a poor estimate of our area of 1.2.

If we were to use $p_h(x)$ and where $p_h(x) = 1$ as our importance function we would get $y = \frac{g(x)h(x)}{0.5*p_h(x)} = \frac{0.1*2}{0.5*1} = \frac{0.2}{0.5} = 0.4$ which is a closer estimate of our area of 1.2.(Admittedly, this example is a bit contrived)

The issue presented here (and in [17]) is that because our sample location had a low likelihood of being selected (0.1) but our second function $h(x)$ had a relatively high value, our value gets a very high value as well since $2/0.1$ is quite large. It’s important to note that if we were to continue taking samples, our estimator would still converge to 1.2.

Multiple Importance Sampling suggests a method to reduce the overall impact of these spikes to reduce the variance of our estimates by using a set of appropriate weighting functions instead of simply using our PDF directly as we just did.

The one sample model presented in [14] is what we’ll be using for this example, as it’s simpler to think about.

Instead of only dividing by our PDF, we want to apply a weighing function to our calculation.

$F = \frac{w_I(X)f(X)}{c_Ip_I(X)}$

Where $I$ represents the index of our selected PDF, $w_I$ represents the weight assigned to our PDF, $c_I$ represents the probability that we select that PDF and $p_I$ is our PDF function.

In our previous example, we did not have $w_I(X)$. This function does all the magic.

[14] presents the balance heuristic as an option for a weighting function which has the form

$w_i(x) = \frac{c_i p_i(x)}{\sum_k c_k p_k(x)}$

Notice that this is simply a weighted average of our probability density functions.

If we add this function to our single sample model we get

$F = \frac{c_i p_i(x)}{\sum_k c_k p_k(x)} \frac{f(x)}{c_ip_i(x)}$

$F = \frac{f(X)}{\sum_k c_k p_k(x)}$

If we look at our previous example using $p_g(x)$ as our PDF, we can make use of the same values here (notice that our final formulation does not divide by $p_g(x)$ individually anymore.

$c_g = 0.5$, $c_h = 0.5$, $p_g(x) = 0.1$, $g(x) = 0.1$, $p_h(x) = 1$ and $h(x) = 2$

$y = \frac{g(x)h(x)}{0.5*p_g(x)+0.5*p_h(x)}$

$y = \frac{g(x)h(x)}{0.5*p_g(x)+0.5*p_h(x)}$

$y = \frac{0.1*2}{0.5*0.1+0.5*1}$

$y = \frac{0.2}{0.55}$

$y = 0.36$

Which although is not as good as our estimate of 0.4 from earlier, it is much better than our estimate of 4 when using $p_g(x)$

We can see that this is unbiased by noticing that

$F(x) = \frac{g(x)h(x)}{c_g p_g(x) + c_h p_h(x)}$

$E[F(x)] = c_g p_g(x) \frac{g(x)h(x)}{c_g p_g(x) + c_h p_h(x)} + c_h p_h(x) \frac{g(x)h(x)}{c_g p_g(x) + c_h p_h(x)}$

$E[F(x)] = (c_g p_g(x) + c_h p_h(x)) \frac{g(x)h(x)}{c_g p_g(x) + c_h p_h(x)}$

$E[F(x)] = g(x)h(x)$

(I took some shortcuts here, I recommend reading [14] for proofs)

An interesting property of the balance heuristic is that our values are bounded.

Notice that since

$\frac{g(x)}{c_g p_g(x) + c_h p_h(x)} <= \frac{g(x)}{c_g p_g(x)}$

then

$\frac{g(x)h(x)}{c_g p_g(x) + c_h p_h(x)} <= \frac{g(x)h(x)}{c_g p_g(x)}$

as well as

$\frac{g(x)h(x)}{c_g p_g(x) + c_h p_h(x)} <= \frac{g(x)h(x)}{c_h p_h(x)}$

Where both equations on the right were the formulas that we used originally when sampling either with $p_g(x)$ and $p_h(x)$ (Where $c_g=0.5$ and $c_h = 0.5$).

Meaning that our balanced sampling is bounded by the smaller of the 2 values generated by both of our PDFs. Neat! (Note that with uniform sampling, our value would simply be 0.2)

All that to say that this is our weighting code

float sum = 0.00001f;
for (uint32_t i = 0; i < distribution_count; i++)
{
sum += weights[i] * PDFs[i];
}
weight = cf_rcp(sum);


And that’s it! That was a lot of words for a very little chunk of code.

### Conclusion

This post is definitely a large one. But it was a lot of fun to learn enough about multiple importance sampling and the Lambertian BRDF to at least convey some sense of it in this blog post. I hope you enjoyed it! Next time we’ll be taking a look at the much simpler Obj loading code in cranberray. Until next time, happy coding!

### Future Work

I’d like to make more use of a better importance sampling function for Smith-GGX as seen in [10]. I also wonder if it’s possible to only select from half vectors that reflect on the upper half of the hemisphere instead of requiring us to discard rays that would be generated in the lower half of our sphere.

I would also like to make more effective use of multiple importance sampling in the path tracer as a whole. At this moment, it is only used to combine our lambertian importance function and our GGX-Smith importance function.

## Intro

In our last post, we looked at the SSE optimized BVH traversal of cranberray. BVH construction and traversal is a fairly complicated portion of cranberray; cranberray’s sampling strategy however is quite a bit simpler.

Sampling in cranberray is simple. We first split our image into a grid representing every pixel of our image. (Note that a rectangular pixel is a simplifying assumption [4]) Then we take our pixel box and sample it X number of times using X different sample points and average the results. Our sampling strategy determines how we spread out those X different sample points within our pixel rectangle.

The simplest strategy is simply to select a point randomly within our pixel box for every sample. However, this has the undesirable property of potentially introducing clumping causing us to miss some desirable information. [2][3]

Instead, cranberray makes use of a simple technique called N-Rooks sampling.

## N-Rooks

N-Rooks is what’s known as a low discrepancy sequence. It has the desirable property of avoiding the clumping of samples that random sampling has and avoiding the aliasing issues of regular sampling. We won’t be explaining low discrepancy sequences here, but I refer the interested reader to [2] and [3] for excellent treatments on the subject.

N-Rooks sampling is very simple.

You first generate jittered samples along the diagonal of an X by X box.

And then you shuffle the columns of said box by starting from the leftmost column and randomly swapping with the columns to the right of our current column. (Including our current column). This is known as the Fisher-Yates shuffle. [6] It has the desirable property of uniformly selecting from any of the X! permutations of our columns.

And that’s it! We now sample our scene using these sample points and average their results to get our final output.

In our next post, we’ll be looking at BRDF multiple importance sampling using the balance heuristic as well as the GGX-Smith microfacet specular BRDF. Until then, happy coding!

## Future Work

While writing this, I was considering the possibility of selecting the value of our pixel using a different strategy than the mean filter used with Monte Carlo integration. I would have to do some further digging to see if this has been presented anywhere. Something along the lines of a Gaussian filter or potentially something else entirely.

I would also like to implement other low discrepancy sequences in cranberray if ever time permits. N-Rooks was very simple, it would be interesting to be able to select between different sampling strategies.

## Intro

In our last post, we looked at building our BVH using SAH as our guiding heuristic. Now it’s time to look at how we intersect that BVH.

## High Level

Querying the BVH is actually quite simple. We test our ray against our parent node, if it intersects, we test the ray against both of it’s children. If the ray intersects either of it’s children, we test that children’s children. Once we reach a leaf node, we add it to our list of nodes that we’ve intersected.

This sounds perfect for recursion, but recursion never scales particularly well. Instead a queue or stack is a more effective method to represent the nodes to be tested.

There are three pieces that bring the BVH traversal together. The math library, the AABB intersection code and the (roughly) branchless intersection response code.

We’ve already seen the math library in this post, as a result, let’s take a dive into the intersection routine!

## Ray/AABB Intersection Routine

There are quite a few excellent sources on the ray/AABB intersection routine [1][2][3]. We’ll do a small dive into how we get to the final implementation.

The intersection routine is based on the slab intersection method presented in [3]. We want to define our AABB as a set of 3 slabs in each axis. A slab is simply represented by a pair of planes, the slab being the space in between them.

In order to determine if we’re intersecting the AABB, we want to determine if our ray passes through the shared intersection of each slab.

In order to determine that, we want to determine if our ray enters each slab before it exits any of the other slabs.

A simple approach to this could be:

slabEntryX = intersect(ray, startPlaneX)
slabExitX = intersect(ray, endPlaneX)
slabEntryY = intersect(ray, startPlaneY)
slabExitY = intersect(ray, endPlaneY)

if(slabEntryX is greater than slabExitY or slabEntryY is greater than slabExitX)
return missed collision

return hit collision


As a result, our core approach is to find if there is any entry value that is greater than our exit values. And if so, we’re not intersecting the slab.

To intersect a ray and a plane, we need to define our ray as:

$\vec{p} = \vec{a}*t+\vec{b}$

and our plane as

$0 = \vec{n}.\vec{p}-d$

We can then plug our ray function into our plane function to get:

$0 = \vec{n}.(\vec{a}*t+\vec{b})-d$

And solving for $t$

$d = \vec{n}.\vec{a}*t+vec{n}.\vec{b}$

$d-\vec{n}.\vec{b} = vec{n}.\vec{a}*t$

$\frac{d-\vec{n}.\vec{b}}{vec{n}.\vec{a}} = t$

Now that we have $t$ we can calculate it for each plane, since our planes are axis aligned, our intersection math gets quite simple. Here is an example for our x axis.

$\frac{d-\vec{n}.\vec{b}}{vec{n}.\vec{a}} = t$

where $n = (1, 0, 0)$

$\frac{d_x-b_x}{a_x} = t$

And repeat for each axis.

Now let’s look at our intersection code:

cv3l rayOLanes = cv3l_replicate(rayO);
cv3l invD = cv3l_replicate(cv3_rcp(rayD));
cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);
cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD);

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger  = cv3l_max(t0s, t1s);

cfl rayMinLane = cfl_replicate(rayMin);
cfl rayMaxLane = cfl_replicate(rayMax);
cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);


This code looks different than our math, but it’s essence is the same.

In this code we’re testing 4 AABBs against a single ray instead of one ray against one AABB.

Notice that in this line:

cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);


We’re implementing $\frac{d_x-b_x}{a_x} = t$ where $b_x, b_y, b_z$ is rayOLanes, invD is $\frac{1}{a_x}, \frac{1}{a_y}, \frac{1}{a_z}$ and aabbMin is $d_x, d_y, d_z$.

The next few lines:

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger = cv3l_max(t0s, t1s);


Are there to account for the situation where our ray direction is negative. If our ray direction is negative on one axis, t0 and t1 will actually be flipped. As a result, we want to simply make sure that t0 is the smaller one and t1 is the larger one. [1]

And finally,

cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);


These lines are calculating to see if any of our latest entrance is earlier than our earliest exit. If so, cfl_mask will set the appropriate bit for that AABB.

That’s it for our intersection routine! Now we can take a look at what we do with these results.

## Intersection Response

The previous intersection code is nicely packaged in the caabb_does_ray_intersect_lanes routine.

In our bvh traversal, the call to it looks like this:

uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);


Where intersections hold the bits of the AABBs that intersect with our ray.

One approach to handling these bits is to simply loop through each AABB, determine if it’s bit is set and add it’s children volumes to the queue of being tested or to the final list of leaf nodes.

However, that’s no fun.

Instead it’s a lot more fun to make this code branchless using some SSE intrinsics!

Fair warning, this code can be complicated if you aren’t used to reading SSE intrinsics.

#define _MM_SHUFFLE_EPI8(i3,i2,i1,i0) _mm_set_epi8(i3*4+3,i3*4+2,i3*4+1,i3*4,i2*4+3,i2*4+2,i2*4+1,i2*4,i1*4+3,i1*4+2,i1*4+1,i1*4,i0*4+3,i0*4+2,i0*4+1,i0*4)
__m128i shuffles[16] =
{
_MM_SHUFFLE_EPI8(0,0,0,0), _MM_SHUFFLE_EPI8(0,0,0,0), _MM_SHUFFLE_EPI8(0,0,0,1), _MM_SHUFFLE_EPI8(0,0,1,0), // 0000, 0001, 0010, 0011
_MM_SHUFFLE_EPI8(0,0,0,2), _MM_SHUFFLE_EPI8(0,0,2,0), _MM_SHUFFLE_EPI8(0,0,2,1), _MM_SHUFFLE_EPI8(0,2,1,0), // 0100, 0101, 0110, 0111
_MM_SHUFFLE_EPI8(0,0,0,3), _MM_SHUFFLE_EPI8(0,0,3,0), _MM_SHUFFLE_EPI8(0,0,3,1), _MM_SHUFFLE_EPI8(0,3,1,0), // 1000, 1001, 1010, 1011
_MM_SHUFFLE_EPI8(0,0,3,2), _MM_SHUFFLE_EPI8(0,3,2,0), _MM_SHUFFLE_EPI8(0,3,2,1), _MM_SHUFFLE_EPI8(3,2,1,0)  // 1100, 1101, 1110, 1111
};

uint32_t leafLine = bvh->count - bvh->leafCount;
{
__m128i isParent = _mm_cmplt_epi32(queueIndices, _mm_set_epi32(leafLine, leafLine, leafLine, leafLine));

}

union
{
uint32_t i[4];
__m128i v;
} indexUnion;

indexUnion.v = childIndices;
for (uint32_t i = 0; i < leafCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
candidateIter[i] = bvh->jumps[nodeIndex].leaf.index;
}
candidateIter+=leafCount;

indexUnion.v = parentIndices;
for (uint32_t i = 0; i < parentCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
cran_assert(nodeIndex < bvh->count);
testQueueEnd[i*2] = bvh->jumps[nodeIndex].jumpIndices.left;
testQueueEnd[i*2 + 1] = bvh->jumps[nodeIndex].jumpIndices.right;
}
testQueueEnd += parentCount * 2;


Now this looks like a lot, so let’s take a look at it piece by piece. As we presented in the previous post, our leaf nodes are stored at the end of our array. This allows us to determine which nodes are leaf nodes simply by comparing their indices against that boundary. The following bit of code does exactly that:

__m128i isParent = _mm_cmplt_epi32(queueIndices, _mm_set_epi32(leafLine, leafLine, leafLine, leafLine));


The first line here simply compares our indices against our “leaf boundary”. The result of this tells us if our index is a parent or a leaf node.

Once we know which ones are our leaf nodes, we convert this to a bitmask representing the elements of the vector.

And finally, we mask out the indices that have no intersections associated with them.

The result of this operation is that we have two bitmasks that represents the parents that have intersected the ray and the leaves that have intersected the ray.

Next we want to turn this bit mask into a series of indices.

uint32_t leafCount = __popcnt(childIndexMask);


The first two lines are very simple, it tells us how many leaf nodes have intersections and how many parent nodes. The next pair of lines require a little more explaining.

Right now, our queueIndices vector represents the indices to the AABBs that we’ve tested against our rays.

One approach we could implement is to simply loop through every element, test their bits and read the values if those bits are set.

However, what we can do instead is pack all of our active indices to the front of our vector allowing us to read only the ones that have reported an intersection. Reducing our branching at the cost of shuffling our data. That’s what _mm_shuffle_epi8 is doing. It’s taking a lookup table (shuffles[]) and determining where our values should be moved in the vector to pack them to the front of the vector.

Since our masks only contain 4 bits, they have a maximum of 15 values making for a very manageable look up table.

Once we have our indices packed, we simply have to read them.

union
{
uint32_t i[4];
__m128i v;
} indexUnion;

indexUnion.v = childIndices;
for (uint32_t i = 0; i < leafCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
candidateIter[i] = bvh->jumps[nodeIndex].leaf.index;
}
candidateIter+=leafCount;

indexUnion.v = parentIndices;
for (uint32_t i = 0; i < parentCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
cran_assert(nodeIndex < bvh->count);
testQueueEnd[i*2] = bvh->jumps[nodeIndex].jumpIndices.left;
testQueueEnd[i*2 + 1] = bvh->jumps[nodeIndex].jumpIndices.right;
}
testQueueEnd += parentCount * 2;


These parts are relatively self explanatory. We’re reading our indices and doing the appropriate work if our values are leaves or parent nodes. To understand why we’re using a union to convert our SSE vector, I recommend looking into type punning. [5]

## Conclusion

That’s it! I don’t really have any future work for this part of cranberray, I’m actually quite happy with the end result.

Next we’ll be looking at the ray tracer’s sampling strategy. Until then, happy coding!

## Preamble

Jumping into a large project that compiles in 30 minutes, you might be faced with 2 thoughts. Either “Wow! This project compiles in only 30 minutes!” Or “What?!? This project takes 30 minutes to compile?” Either way, you likely don’t want to be iterating on your changes at a rate of 30 minutes per compilation. So here’s a few things you can do to improve your iteration times.

## Hack Your Prototypes – Short Iterations

This is likely where you spend 90% of your time when solving a problem. In this phase you can be searching for a bug, testing different solutions or even just investigating the behaviour of the program. In any of these scenarios, you likely want to iterate quickly without much down time. Here’s a few things you can do:

Modifying your headers is a sure fire way to trigger multiple files to recompile. (Or if you’re unlucky, the whole project) as a result, do anything in your power to avoid having to modify a header while you’re in this phase.

Work in a single cpp file

Make ugly globals. (It’s fine, I won’t tell)

If you need some sort of static state for a class, consider making it a global in the cpp file instead. This holds even outside of the prototyping phase, make sure to mark it static, we don’t need pesky external linkage for these variables.

If you need to have a function available to multiple cpp files, consider simply adding the declaration to the necessary cpp files. This wouldn’t fly for a final submission but we’re prototyping here.

Compile a single file

Out of habit, you might build the whole solution, or the specific project when working to see if you’ve introduced any compile errors. Instead, consider if you can afford to simply compile the single file you’re working in. You can typically afford this when making modifications local to a cpp file. This will allow you to avoid having to deal with the linker until you absolutely need to. It might even be worthwhile to make the shortcut convenient.

Use a smaller project

Is there a smaller project you can test in instead of the primary projects? If so, consider using that instead. Your iteration time will improve simply by the fact that your project has a lot less to handle.

In the same vein, consider If you can simply compile the project your files are in without compiling the larger projects that are likely to include a large number of files.

Disable optimizations locally

If you’re running with optimizations and optimizations are removing useful variables you’d like to investigate, consider disabling optimizations in your file instead of changing the project to a debug configuration. This will only require recompiling the single file instead of the whole project.

## Be Nice To The Compiler And Yourself – Long Term Changes

At this point, we’re done iterating, we’ve solidified our solution and now we need to clean it up for submission. At this stage, you’re probably not compiling as frequently and a lot of your time is writing all the necessary bits to link it all together. Decisions made here will impact not only you, but all users of the project. We want our compile times to be short. Long compile times impact everyone, and that’s no fun. Here are some things to consider at this stage.

Does this change really need a new file?

If you’re considering creating a new file, consider if It’s really worth it. A new file implies a new header, a new header implies new includes, new includes means more dependencies, more dependencies means more compilation.

This is a balancing act. Adding a new cpp/h pair might actually allow you to reduce compile times by only including the relevant bits without needing to modify a highly used header. But it might also spawn into a monstrous chain of includes. It’s a matter of tradeoffs, consider them carefully.

New class != new file

Creating a new class does not imply that a new file should be created. Does your class really need to have a header? Is it used by any other class or is it more of a utility for a single file? If your class is only in use by a single cpp file, consider putting it in the cpp file. If it needs to become public, we can implement that change later. (Your coding standard might disagree here in some contexts, if it does, follow the standard)

Headers are not your friend. Do everything you can to reduce the number of includes that you have to add to a header. The more headers required, the bigger the compilation chain grows. Consider if you can forward declare the required types, or if you’re simply including the header for convenience. A simple rule is to only include the header if you absolutely need it to compile. Avoid “general purpose” headers that include a large chunk of what you need. They’ll eat you alive (and your compile times).

Do you really need this function to be a private function?

When adding a private function to an object, consider if it has better use as a utility function embedded in the cpp file. Adding private methods to a header means that other files have to recompile if you modify functionality that they had no access to in the first place. It’s bad encapsulation. This is especially true for private static functions and private static variables.

Can you remove some includes from that header?

Are you working in a header and notice some unnecessary includes? See if you can remove them! Every little bit helps.

Standard headers are a good target to remove from inclusion. They tend to be heavily templated and can reduce compile times significantly.

If you can forward declare a type instead of including this file in your header, consider doing that.

Use the appropriate include type

Using a relative include “../../File.h” will reduce the load on the preprocessor if the path to the file is correct. Use this format when you reference a file from the same project. Use #include <Project/File.h> when the included file is part of a different project than our current one. Avoid #include “Project/File.h” or #include “../../File.h” if your file is not located at this specific location. This causes the preprocessor to search from the relative location of all the included files which can be quite a bit of work.

Does this need to be a template?

Be very wary of making a template that is included in your header. Templates suffer from a quantity of problems. Is your template needed by anyone else but your cpp? If not, consider embedding it. Does your type really need to be a template? Consider if you have an actual use case for this template. If you absolutely need this template. Consider de-templatifying the functionality that doesn’t require the generic type and putting that functionality either in a base class or a set of utility functions.

What’s problematic about templates? Templates require you to write their implementations in your header so that your compiler can instantiate it correctly. This means that any change to the internal functionality of your template will require a recompilation of all the files that include it. Templates can also be tremendously complex, causing a significant amount of work for your compiler. (Template metaprogramming anyone?) If you need your functionality to be expressed at compile time, consider constexpr, compilers are more efficient at processing these than processing template recursion. Also consider that templates require your compiler to create a new instance for every cpp that includes it. (If one cpp instantiates vector and another also instantiates vector the compiler has to do the work twice!) you might consider extern templates in this situation. But I would strongly recommend dropping the template altogether if you can.

## Tools

There are some tools that can help you improve your compile times in a variety of ways. Here’s a few of them.

Ccache

Ccache is a compilation cache tool that maintains a cache of compilations of your cpp files. This works accross full rebuilds, different project version (Maybe you have 2 similar branches) and even across users.

https://ccache.dev/

Compiler Time Tracking Tools

Compilers are starting to incorporate tools to track where time is being spent in compilation. Clang/GCC have “-ftime-report” and MSVC has build insights https://devblogs.microsoft.com/cppblog/introducing-c-build-insights/

An excellent blog post from Aras Pranckevičius details these more thoroughly https://aras-p.info/blog/2019/01/12/Investigating-compile-times-and-Clang-ftime-report/

Show Include List

MSVC has an option to show what’s included in a file. Consider using this to know what’s being included and what doesn’t need to be included.

https://docs.microsoft.com/en-us/cpp/build/reference/showincludes-list-include-files?view=vs-2019

Clang/GCC have the -M and -H options. https://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Preprocessor-Options.html#Preprocessor-Options

Include What You Use

There is also an interesting tool that determines what includes can be removed and what could be forward declared.

https://include-what-you-use.org/

https://stackoverflow.com/questions/8130602/using-extern-template-c11

https://en.cppreference.com/w/cpp/preprocessor/include

http://virtuallyrandom.com/c-compilation-fixing-it/

https://aras-p.info/blog/2019/01/12/Investigating-compile-times-and-Clang-ftime-report/

https://docs.microsoft.com/en-us/cpp/build/reference/showincludes-list-include-files?view=vs-2019

https://include-what-you-use.org/

## Intro

In our last post, we looked at our SIMD-friendly math library. In this post, we’ll be taking a look at how cranberray builds its BVH. For a primer to BVHs, I recommend going to PBRT’s treatment on the topic here.

Canberray’s BVH is built recursively as a binary tree. It starts with the top level node containing all nodes and recursively splits it into 2 children at each level. Originally, the BVH was split simply by selecting the left and right halves of the children nodes. The next approach was to split according to the median of a particular axis (typically the largest one), and now it makes use of the surface area heuristic.

## SAH

The Surface Area Heuristic or SAH is a well known strategy to build an efficient BVH that tries to minimize the number of intersection tests applied on a tree.

PBRT has an excellent treatment on the topic [2], and I’ve linked one of the original papers referencing the idea [1]. I will try to instead provide an intuitive picture of why and how SAH works.

The idea behind SAH is to discover an efficient partition of a collection that will try to minimize the cost of traversing the tree for any ray.

The SAH function is presented as:

$C_o = C_{trav} + P_A\sum C_{isect}(a_i) + P_B\sum C_{isect}(b_i)$

This cost model is based on probabilities. $C_{trav}$ is the cost of traversing the parent node to determine if any of the two children nodes intersect the ray. In cranberray, this would be the cost of testing the ray against both children AABBs and determining if it intersects either. $C_{isect}$ is the cost of testing the shape itself against the ray. In cranberray, this would be the cost of testing the triangles against the ray. Finally, $P_A$ and $P_B$ are the probabilities that the bounding volumes containing the children nodes intersect the ray.

Now, well make a slight modification and reformulate $P_A$ and $P_B$ as $P(A|C)$ and $P(B|C)$ where C is our parent node. This states that $P(A|C)$ represents the probability of the event A happening given the event C has already happened. In this context, this means that $P(A|C)$ represents the probability that we intersect our volume A given that we’ve already intersected our parent volume C.

With this change, our function is now:

$C_o = C_{trav} + P(A|C)\sum C_{isect}(a_i) + P(B|C)\sum C_{isect}(b_i)$

Now, let’s imagine that we’re a lonesome ray traveling through our BVH. We arrive at a node and we test both of it’s children to see if we intersect it. This takes us 10 units of time ($C_{trav}$). At this point we would intersect A with a probability of $P(A|C)$ and we would intersect B with a probability of $P(B|C)$. In this example, we only intersect with A, and so we test our ray against it’s contained shapes. Let’s say that A has 10 shapes, as a result, we have to pay the cost of testing the intersection with all 10 shapes. This is $\sum_{i=1}^{10} C_{isect}(a_i)$. Finally, let’s say that the cost of $C_{isect}$ is 15 for every child shape.

This gives us the final actual cost of:

$C_o = 10+sum_{i=1}^{10} 15 = 10+150 = 160$

What our original cost function is doing, is calculating the expected cost of doing this operation for any possible ray.

As a result, if we were to build with these values:

$C_{trav}=10$

$C_{isect}=15$

$P(A|C)=1$

$P(B|C)=0$

$A_n = 10$

$B_n = 0$

Then our expected cost for this node would be:

$C_o = 10 + 1*\sum_{i=1}^{10} 15 + 0*\sum_{i=1}^{0} 15 = 10+150 = 160$

Which is exactly the cost of our example above because if our A node has a probability of 1, we would always experience the scenario above.

Of course, if we were to take some more likely value such as:

$P(A|C)=0.5$

$P(B|C)=0.6$

$A_n = 10$

$B_n = 12$

$C_o = 10 + 0.5*\sum_{i=1}^{10} 15 + 0.6*\sum_{i=1}^{12} 15 = 10+0.5*150+0.6*180 = 193$

Which would be the expected value of our node. Notice that our probabilities can add up to more than or less than 1 because it is reasonable that we might be able to intersect with both volumes or none at all.

Now we need to determine how we want to define $P(A|C)$. This probability is the probability of a random ray intersecting with our volume given that we’ve intersected with our parent volume.

$P(A|C)$ is given as the surface area of our volume A divided by the surface area of our parent volume C. As a result, we can define $P(A|C) = \frac{S_A}{S_C}$ as the probability of intersecting A given that we’ve intersected C.

### Derivation In 2D

In 2 dimensions, this makes sense, as we can image a rectangle, embedded in another rectangle. The probability of a random point being placed in the child rectangle given that it is contained in the parent rectangle is equal to the area of the embedded rectangle divided by the area of the parent rectangle.

Now, instead we’ll imagine uniformally random ray instead of a random line. To select a random ray, we can select a random direction and then select a random ray along the plane defined by that direction.

With this in mind, we want to calculate the probability that our ray will intersect the children volume given that it has intersected the parent volume.

To do this, we want to calculate the average projected area of our shape.

Imagine that we’ve selected our random direction for our ray, we can see that our ray will have intersected our shape if it’s projection onto the plane intersects our random ray.

Now if we add a second volume embedded in the first volume, we can see that the probability of intersecting the first volume would be given by the projected area of the child volume divided by the projected area of the parent volume.

This example only works for a single direction. However, since we’re looking for a probability given any ray direction, we need to calculate the average of this projection across the sphere of all directions.

In 2 dimensions, the projected area of a circle is always it’s diameter. We will use this fact to test our work.

To calculate the average projected area of an arbitrary convex shape[4][5], we first want to split that shape into a series of small discrete patches.

This will be our integration domain.

Once we’ve split our shape into discrete patches, we want to calculate the average projected area of a single patch.

Now we’ll integrate across the circle of directions containing the patch and divide it by the area of our integration. This will effectively give us the average projected area for a single patch of our shape. This works because we will have calculated the projection for every single direction.

$dA_p = \frac{1}{2\pi}\int_{\Omega} |cos\theta| dA d\theta$

Integrating with an absolute value is tricky. Instead due to the symmetry of the problem, we can look at only a quarter of our circle and multiply by 4 and remove our absolute term.

$dA_p = \frac{2}{\pi} dA \int_{0}^{\frac{\pi}{2}} cos\theta d\theta$

Since $\int_{0}^{\frac{\pi}{2}} cos\theta d\theta = 1$ our expression reduces to

$dA_p = \frac{2}{\pi} dA$

Finally, we want to divide our value by 2, this is because we’ve now counted our projected area twice.

$dA_p = \frac{1}{\pi} dA$

Using this formulation, we can calculate the average projected area of our shape by adding all of our average projected areas together.

$A_p = \int_A dA_p$

$A_p = \int_A \frac{1}{\pi} dA$

$A_p = \frac{1}{\pi} A$

And there you have it, the average projected area of a 2 dimensional convex object is $\frac{1}{\pi}$ the area of said object.

Notice that if we apply this to the surface area of a circle with the formula $2\pi r$ we get $A_p = \frac{2\pi r}{\pi} = 2r$ which is our diameter which matches up with our expectation.

### Derivation In 3D

We can take the same steps in 3D.

Calculate the average projection over our sphere of directions:

$dA_p = \frac{1}{4\pi} \int_0^{2\pi} \int_0^{\pi} |cos\theta| sin\theta dA d\theta d\phi$

Integrating across the positive hemisphere and multiplying by 2

$dA_p = \frac{1}{2\pi} dA \int_0^{2\pi} \int_0^{\frac{\pi}{2}} cos\theta sin\theta d\theta d\phi$

Since $\int_0^{\frac{\pi}{2}} cos\theta sin\theta d\theta = \frac{1}{2}$

$dA_p = \frac{1}{2\pi} dA \int_0^{2\pi} \frac{1}{2} d\phi$

$dA_p = \frac{1}{4\pi} dA \int_0^{2\pi} d\phi$

$dA_p = \frac{1}{2} dA$

Finally, dividing by 2 for our double projection

$dA_p = \frac{1}{4} dA$

And plugging into our surface area calculation

$A_p = \int_A dA_p$

$A_p = \frac{1}{4} \int_A dA$

$A_p = \frac{1}{4} A$

### Putting It Together

Finally we can see that our average projected area in 3D is $\frac{1}{4}$ it’s surface area.

To calculate our probability, we simply want to divide our parent’s projected area (C) divided by our child’s projected area (A)

$P(A|C) = \frac{A_p}{C_p}$

$P(A|C) = \frac{0.25*A}{0.25*C}$

$P(A|C) = \frac{A}{C}$

Where A and C is the surface area of our volumes. And voila, that’s how we got that original formula.

## In Depth View

Now that we have all the pieces, we can take a look at the construction of our BVH.

Cranberray builds it’s BVH from the top down. Starting from a containing bounding volume and splitting it in 2 recursively.

Cranberray keeps a queue of bounding volumes to process as a ring buffer. This makes management of front popping and back pushing very simple but makes resizing the queue trickier. As a result, Cranberray simply allocates a large buffer. We could likely allocate the maximum possible number of elements in the ring buffer instead (something along the lines of $2n-1$ where n is the next largest power of 2)

Cranberray then selects the axis with the widest breadth of centroids. The code for this looks like so:

cv2 axisSpan[3];
for (uint32_t axis = 0; axis < 3; axis++)
{
for (uint32_t i = 0; i < count; i++)
{
axisSpan[axis].x = fminf(axisSpan[axis].x, caabb_centroid(start[i].bound, axis));
axisSpan[axis].y = fmaxf(axisSpan[axis].y, caabb_centroid(start[i].bound, axis));
}
}

uint32_t axis;
if (axisSpan[0].y - axisSpan[0].x > axisSpan[1].y - axisSpan[1].x && axisSpan[0].y - axisSpan[0].x > axisSpan[2].y - axisSpan[2].x)
{
axis = 0;
}
else if (axisSpan[1].y - axisSpan[1].x > axisSpan[2].y - axisSpan[2].x)
{
axis = 1;
}
else
{
axis = 2;
}


Once we’ve selected our axis, we split our axis in 12 distinct buckets as. (See PBRT [2] for more info on this approach)

We then calculate the cost of each seperation by adding up all the volumes on the left of the seperation and all the buckets on the right of the seperation.

We then store the cost of each seperation and select the seperation with the minimal cost as our split.

We then continue in this recursion until we’ve run out of items in our queue. (When we’ve partitioned all of our child volumes into leaf nodes).

Finally, our BVH is restructured somewhat for an improvement in memory usage.

Our BVH is stored in 2 arrays, a “jump index” array and a bounds array. This allows us to load the bounds array without having to load the jump indices into memory until we absolutely need them.

We read from our bounds memory much more frequently than our jump memory, as a result, splitting them allows us to make more effective use of our caches.

Our BVH structure looks like this:


typedef struct
{
union
{
struct
{
uint32_t left;
uint32_t right;
} jumpIndices;

struct
{
uint32_t index;
} leaf;
};
} bvh_jump_t;

typedef struct
{
caabb* bounds;
bvh_jump_t* jumps;
uint32_t count;
uint32_t leafCount;
} bvh_t;



The final special format of our BVH is that all the leaf nodes in our tree are stored at the end of our array. This allows us to test if a node is a leaf node by simply comparing the index in our array against the size of our array minus the number of leaves contained in the tree. This allows us to use data that we’ve already loaded into memory instead of requiring us to load extra data to use in our branch introducing a data dependency.

You can find the source for BVH construction here: https://github.com/AlexSabourinDev/cranberries/blob/5fe9c25e1df23d558b7ef8b5475717d9e67a19fc/cranberray.c#L963

We’ll be taking a look at the BVH traversal next. Until then, happy coding!

## Future Work

The primary addition to this BVH construction algorithm would likely be to look into parallelizing it’s construction. [6]

## Intro

This is the first in a series of posts detailing the functionality of my hobby path tracing, we will be looking at the underpinning’s of it all, the math library cranberry_math.

The goals of cranberry_math are simple. Making SIMD accelerated mathematics easier and fun to use. There are already a huge swath of math libraries out there that do things very well (DirectXMath and GLM to name a few), however I decided to write my own for fun!

## Ideology

The basis of cranberry_math is the concept of lanes. If you’re not familiar with SIMD, the concept is relatively simple. You can imagine that a SIMD register (or vector register) is a set of 4 “lanes”, these lanes are generally independent of each other but allow you to perform operations on them using another SIMD register. For the purpose of demonstration, we will write a SIMD register as $v_n (v_0, v_1, v_2)$ and the element of one of those registers as $v_ne_m (v_0e_0)$ where $v_0e_0$ would be element 0 of SIMD register 0.

I.e. if I take $v_0$ and add it to $v_1$ the result is the equivalent of $v_2 = (v_0e_0+v_1e_0, v_0e_1+v_1e_1, v_0e_2+v_1e_2, v_0e_3+v_1e_3)$

On it’s own, this might not be particularly useful. But vector register operations have a number of desirable properties.

This is a highly simplified version of how your processor might implement vector adds.

Now, with this added information, we can see why we would want to use these operations efficiently, we could reduce our program latency by at most 4x! (In reality this is unlikely to happen, but we can get fairly close).

Cranberry_math takes this concept and tries to make the best use of it while still making it approachable to the user.

An intuitive and simple use of these registers would be to store our X,Y,Z components into each component such as $v_0 = (x, y, z, unused)$ and working with them in that manner. In some common operations such as adding vectors, this can give you a nice hypothetical improvement of 3x over your original single element addition. However, we’re still leaving 25% of our optimal performance on the table! This approach also causes some issues with common operations such as the dot product.

Let’s say in our fantasy processor, our dot product operation takes 2 cycles to complete when applied to 1 vector. That means we get a throughput of 2 cycles/dot product. I propose (and has been proposed elsewhere as well) that you instead use each register to store a single component of a vector. Such as $v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3)$, etc. With this in mind, we can achieve the dot product by doing $v_x*v_x+v_y*v_y+v_z*v_z$ which is a total of 5 operations that all take 1 cycle in our fantasy processor for 4 vectors or a throughput of 1.25 cycles/dot product.

For SSE according to the intel intrinsics guide a multiply operation has a latency of 4 cycles (on Skylake), an add has a latency of 4 cycles. As a result a dot product in a fantasy processor with no out of order functionality or multiple issue, our wide dot product would take roughly 20 cycles to complete or $20/4=5$ cycles per dot product. While if we look at the dot product instruction, it has a latency of 11 cycles. (See [1] and [2] for info on reading the intel intrinsics guide values) This is not indicative of real performance numbers. Modern processors are significantly more complex than what I’m implying here.

Another benefit of storing vectors this way, is that it scales trivially to SIMD registers of size 8, or 16! You only need to throw more data at it!

Now, this approach to storing vectors in SIMD registers is not fool proof. It introduces difficulty if we want to add similar components together. (x with x, y with y) within one vector, I however have not found this to be a problem while writing my path tracer.

This isn’t a silver bullet, but I recommend considering adding a path for this type of storage if you can afford it.

## In Depth View

With that out of the way, here are the primary types for our library:

#define cran_lane_count 4
cran_align(16) typedef union
{
float f[cran_lane_count];
__m128 sse;
} cfl;

typedef union
{
struct
{
float x, y, z;
};

} cv3;

typedef struct
{
cfl x;
cfl y;
cfl z;
} cv3l;


The theme of this library is that SIMD registers are referred to as lanes. Every element is expected to stay within it’s respective lane. cfl represents “cranberry float lane”, cv3 represents “cranberry vector 3” and cv3l represents “cranberry vector 3 lanes”. Notice how every component in cv3l has it’s own set of lanes? That is a direct implementation of the explanation that was presented above.

cran_forceinline cv3l cv3l_add(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_sub(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_mul(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_min(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_max(cv3l l, cv3l r);


The API for cv3l is simple, it looks exactly the same as it’s cv3 counterpart. This makes it relatively easy to switch from one to the other simply by loading the appropriate data correctly.

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount);


One function that we should take a particular look at is cv3l_indexed_load. It’s likely you don’t want to store your data in cv3l format. It doesn’t lend itself particularly well to general processing and works much better in batch processing. As a result your data needs to be transformed from x,y,z,x,y,z to x,x,x,y,y,y,z,z,z to facilitate our maximum throughput.

One way to shuffle our data would be to load each element individually and store it in the appropriate locations. However this is surprisingly slow (I forget the exact reasons, a reference for that would be awesome).

$v_0 = (x_0, y_0, z_0, u), v_1 = (x_1, y_1, z_1, u), v_2 = (x_2, y_2, z_2, u), v_3 = (x_3, y_3, z_3, u)$

and then shuffle the values into a set of temporary registers

$v_{xy0} = (x_0, y_0, x_1, y_1), v_{xy1} = (x_2,y_2,x_3,y_3), v_{z0} = (z_0, u, z_1, u), v_{z1} = (z_2, u, z_3, u)$

and finally shuffle them into the final registers

$v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3), v_z =(z_0, z_1, z_2, z_3)$

(There’s a mistake with the arrows for $Z_2$ and $Z_3$ they should be pointed to the 4 register)

Here’s the source for this operation:

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount)
{
for (uint32_t i = 0; i < indexCount && i < cran_lane_count; i++)
{
uint8_t const* vectorData = (uint8_t*)vectors;
}

return (cv3l)
{
.x = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(2, 0, 2, 0))},
.y = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(3, 1, 3, 1))},
.z = {.sse = _mm_shuffle_ps(Z0, Z1, _MM_SHUFFLE(2, 0, 2, 0))}
};
}


With all this, we can see that I can write a vectorized ray/AABB intersection using this new API:

cran_forceinline uint32_t caabb_does_ray_intersect_lanes(cv3 rayO, cv3 rayD, float rayMin, float rayMax, cv3l aabbMin, cv3l aabbMax)
{
cv3l rayOLanes = cv3l_replicate(rayO);
cv3l invD = cv3l_replicate(cv3_rcp(rayD));
cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);
cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD);

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger  = cv3l_max(t0s, t1s);

cfl rayMinLane = cfl_replicate(rayMin);
cfl rayMaxLane = cfl_replicate(rayMax);
cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);
}


That’s the gist of it! Writing vectorized BVH traversal using this API has been a breeze. You can find the rest of the library here: https://github.com/AlexSabourinDev/cranberries/blob/cranberray/cranberry_math.h

Here is the library in use for BVH AABB testing:

uint32_t activeLaneCount = min((uint32_t)(testQueueEnd - testQueueIter), cran_lane_count);
cv3l boundMins = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, min), testQueueIter, activeLaneCount);
cv3l boundMaxs = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, max), testQueueIter, activeLaneCount);
uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);


We’ll be taking a deeper look at the SSE optimized BVH traversal in the future. For now, happy coding!

## References:

Within the past few months I’ve been implementing a path tracer during my weekends and evenings. Having reached what I consider to be an acceptable stopping point, I thought it might be interesting to detail my current implementation alongside the references that inspired me. Let’s jump in!

This series will touch on a variety of features that were implemented in the path tracer. Notably:

These posts will try to explain every aspect of the path tracer as they stand but they might not be as comprehensive as the sources referenced to create it. As a result, I will strive to include all relevant sources for each feature.

You can find the path tracer here: https://github.com/AlexSabourinDev/cranberries/tree/cranberray

Next we’ll talk a about the underpinning of the path tracer, the cranberry_math library, see you then!

## Intro

In this post, we’ll be looking at the steps taken to create an optimized transform hierarchy. This is by no means the be all end all of transform hierarchies. If anything, it’s more of a log of considerations that have taken place as the creation of the transform hierarchy took place.

### Recap of part 4

In part 4, we looked at SIMD for computations, more quantization, software prefetching and a bug fix. This time, we’re going to be looking at some more bug fixing, more SIMD and some surprising performance impacts!

### Bug fixes

While I was debugging a new optimization for part 5, I ran into a few bugs in the program…

Farmer and crop removal had a bug in it. When we were removing these elements, we would loop from the start of the removed indices and we would take the last valid element from the back and copy it onto the current index we want removed. The problem with this solution, is that if our last valid element was also to be removed, we would copy it onto the current index. This is a problem, because we would need to remove it! It’s not a valid object anymore! Instead, if we loop from the back of the array, we will only replace our current index with an element that is guaranteed to be valid because the indices are ordered.

Farmer and crop removal also had a bug in it with the second call to simd_moveMaskToIndexMask. Before the fix, the call would only mask out the lower 8 bits from the indexMask. However, this is not correct, because the function itself only works with the lower 8 bits! As a result, the indices would all be the same. In order to modify this, I simply shifted the indexMask by 8 bits to the right.

Before:

simd_moveMaskToIndexMask(indexMask & 0xFF00);


After:

simd_moveMaskToIndexMask((indexMask & 0xFF00) >> 8);


It would probably have been best to take a uint8_t as an argument instead of an unsigned int but I had misunderstood the functionality of the _pdep_u64 intrinsic used in simd_moveMaskToIndexMask.

Finally, there was also a bug in converting the result of a 16 bit compare to a move mask. Because there is no movemask intrinsic for 16 bit integer, I had used _mm256_packs_epi16(a, b) which I believed to convert the 16 bit integers to 8 bit integers at their respective element. However, I had not realized that these actually worked within lanes! The first 8 elements were from a and the next 8 were from b and then the next 8 were from a and then from b. I expected the format to be a, a, b, b. As a result, the movemask would end up incorrect!

Instead, I modified the code to execute the movemask on the 16 bit results and then converted the mask to an 8 bit movemask.

__m256i cmpRes = _mm256_cmpgt_epi16(zeroI, lifetime);


The magic happens in _pext_u32(moveMask, 0x55555555). Because we’re creating a movemask from 16 bit integers and not 8 bit integers, the mask bits will actually double! If the result of our compare was 0xFFFF, 0x0000, 0xFFFF, 0xFFFF then our movemask would be 11001111 which is not correct! We want our movemask to look like 1011. As a result, I used _pext_u32. _pext_u32 will take the bits corresponding set bits in mask (0x55555555) and pack them from least significant bit to most significant bit. This means that we’re taking all our even bits and packing them. Because 0x55555555 in binary is 01010101010101… we’re taking x1x0x1x1 from our movemask and packing them to 1011! (More cool bit tricks can be found here)

As a result to all these changes, the performance of the program degraded. I believe it is because we weren’t adding all the indices to our removal list, and we were also not adding the correct indices! Now we actually loop through all the correct indices and chrome://tracing says that our average ai_tick performance degraded from 1.67ms to 1.87ms. Our average tick is now 2.75ms instead of 2.6ms.

### Bucket farming

At this point we’ve made our program very efficient by improving the speed at which we access and modify our data. But now, it’s gotten quite a bit harder to improve the performance of our algorithm this way.

Instead, we’re going to modify our algorithm. According to VTune, most of our time is spent decrementing timers and moving farmers. We’re going to be tackling the timers.

The timers are currently all decremented at a rate of 16ms per tick, that means at times we can be decrementing 1 million timers per tick! Instead of doing this, we can do something much better, we can group our timers in buckets and only decrement the global bucket timers instead of the timers themselves. Then, in order to retain our fine grained timing, we keep track of which bucket will require fine grained decrementation and we will only decrement that bucket!

Looking at our farm state, we can see that our state decrements a timer between the ranges of 3 seconds and 5 seconds. In order to split this up into buckets, I decided to split these buckets up in 6 buckets where each bucket holds the timers of a specific time range.

We start off with an index indicating which bucket needs to be finely decremented. We then increment our global bucket timer and decrement the timers in the bucket referenced by our index. Once out timer reaches 1s, we reset the timer and advance our index by 1 and modulo by the number of buckets in order to get it to wrap around.

Say our current fine bucket index is 5, then our timer reaches 1s, we advance our index by 1 to 6. Since 6 % 6 is 0, that’s our next fine decrementation bucket. We can guarantee this property because we place the timers in the buckets based on their number of seconds.

int16_t farmerTimer = rand_range(AI_FarmerFarmSpeedMin, AI_FarmerFarmSpeedMax);
uint32_t bucketSecond = (farmerTimer + AI_FarmersFarmBucketTransitionTimer) / AI_TimePrecision;
uint32_t bucketFarmerCount = AI_FarmersFarmHotBucketCounts[bucketIndex];

AI_FarmersFarmHotBuckets[bucketIndex][bucketFarmerCount] = farmerTimer - bucketSecond * AI_TimePrecision;
AI_FarmersFarmHotBucketCounts[bucketIndex]++;


Looking at chrome://tracing indicates that ai_tick now runs at an average of 1.7ms from our original 1.8ms. Interestingly, game_gen_instance_buffer now runs 0.1ms slower. I wonder if this a result of ai_tick not allowing game_gen_instance_buffer to complete it’s work in the background.

Looking at VTune indicates that we’re now 75% back-end bound in ai_tick and we’re now retiring almost 30% of our instructions, very good results from bucketing the farm timers.

Doing the same modification to the search timers doesn’t produce exceptional results, but I will keep this modification in order to keep things consistent.

### More quantization!

Now that our timers are only in the range of 0s to 1s, we can quantize our data even more! Now I’m going to quantize our values to a range of 1s to 10ms. This will cause our precision to drop significantly, but since our timers are for AI, I think this cost is appropriate.

As a result, I changed all the farmer timers to int8_t and change the AI_TimePrecision to 10ms. With this change, chrome://tracing notes that ai_tick now runs at an average of 1.25ms from our 1.7ms! However, once again, game_gen_instance buffer slowed down from 0.9ms to 1.17ms…

Big thanks to Zack Dawson (twitter) for the idea!

### Streaming again?

When we left off for game_gen_instance_buffer, we were mostly using memcpy to copy all of our data from one position buffer to another. At that time, I had decided to use memcpy because memcpy is very fast and the location being written to was not guaranteed to be aligned with the location being read from, restricting me from using _mm256_stream_si256.

As a result, I decided to write the last element of an array multiple times in order to align the write index to a 64 byte boundary. This gave me the opportunity to use the stream intrinsics for our homegrown memcpy:

void simd_streamMemCpy(__m256i* dstWrite, __m256i* srcRead, size_t size)
{
if (size == 0)
{
return;
}

__m256i* dstEnd = dstWrite + (size >> 5);
for (; dstWrite <= dstEnd; dstWrite += 2, srcRead += 2)
{
_mm256_stream_si256(dstWrite, src256);
_mm256_stream_si256(dstWrite + 1, src256_2);
}
}


Which compiles to:

    test    rdx, rdx
je      .LBB0_3
and     rdx, -32
xor     eax, eax
.LBB0_2:
vmovaps ymm0, ymmword ptr [rsi + rax]
vmovntps        ymmword ptr [rdi + rax], ymm0
vmovaps ymm0, ymmword ptr [rsi + rax + 32]
vmovntps        ymmword ptr [rdi + rax + 32], ymm0
lea     rcx, [rdi + rax]
cmp     rcx, rdx
jbe     .LBB0_2
.LBB0_3:
vzeroupper
ret


Taking a look at chrome://tracing shows us that gen_instance_buffer now runs at an average of 1ms instead of 1.1ms, saving us an additional 0.1ms.

And VTune:

Both our slowest functions are now taking less than 1s!

### Can we avoid reading the tile stage?

At first, in order to avoid reading the tile stage and to be able to just store the index without having to access the memory, I thought storing a collection of the unplanted indices might be effective. However, I had missed the obvious that in order to get the index from this collection of indices, I would have to read that memory instead. This added complexity simply slowed down ai_tick by 0.2ms.

As a result, I simply changed the tile from a 32 bit integer, to an 8 bit integer giving speedups in ai_tick but slowing down game_gen_instance_buffer again.

### The return of game_gen_instance_buffer

At this point, game_gen_instance_buffer is running at an average of 1ms per tick. Still slower than the original 0.9ms from part 4. As a result, we’re going to tackle it some more with quite a few modifications.

The first, we changed the type for sprite indices from uint16_t to uint8_t. Since the range of values for these indices is 0 to 11, we have plenty of space for these 11 values in a uint8_t.

The next modification was quite a bit more complex than the uint8_t modification. This change takes into account that the state of the farmers will dictate what sprite index will be used to render the farmers and that all farmers share the same scale.

In order to keep the blog post from becoming all about this optimization, I’m only going to go through the explanation for the scales.

The buffer generation is very simple: I write every scale into the array in the order of tiles, crops, farmers. The number of tiles is constant, thus, we don’t need to worry about it. The number of crops however, is variable. This means that were our farmer scales starts and end is determined by the number of crops rendered that tick.

While rendering, I keep track of a writing index. This index determines where the next sprite instance will be rendered.

Like this:

int writeIndex = 0;

__m256i farmerScale = _mm256_set1_epi16(AI_FarmerScale);

if (Gen_PreviousFarmerScaleStart == 0)
{
Gen_PreviousFarmerScaleEnd = writeIndex + AI_FarmerCount;
Gen_PreviousFarmerScaleEnd += 64 - (AI_FarmerSearchCount % 64);
Gen_PreviousFarmerScaleEnd += 64 - (AI_FarmerMoveCount % 64);

simd_memSetToValue((__m256i*)(buffer->scales + writeIndex), farmerScale, (Gen_PreviousFarmerScaleEnd - writeIndex) * sizeof(uint16_t));
Gen_PreviousFarmerScaleStart = writeIndex;
}
else
{
uint32_t newEnd = writeIndex + AI_FarmerCount;
newEnd += 64 - (AI_FarmerSearchCount % 64);
newEnd += 64 - (AI_FarmerMoveCount % 64);

if (newEnd > Gen_PreviousFarmerScaleEnd)
{
uint32_t extra = Gen_PreviousFarmerScaleEnd % 64;
uint32_t writeLoc = Gen_PreviousFarmerScaleEnd - extra;
simd_memSetToValue((__m256i*)(buffer->scales + writeLoc), farmerScale, (newEnd - Gen_PreviousFarmerScaleEnd + extra) * sizeof(uint16_t));
}

if (writeIndex < Gen_PreviousFarmerScaleStart)
{
simd_memSetToValue((__m256i*)(buffer->scales + writeIndex), farmerScale, (Gen_PreviousFarmerScaleStart - writeIndex) * sizeof(uint16_t));
}

Gen_PreviousFarmerScaleEnd = newEnd;
Gen_PreviousFarmerScaleStart = writeIndex;
}


And that’s it! This change adds quite a bit of complexity to our generation code but the results are worth it!

chrome://tracing indicates that our average game_gen_instance_buffer tick is now 0.72ms, faster than our original performance! ai_tick also runs at an average of 1.2ms and our average tick is now 1.9ms.

VTune shows us that game_gen_instance_buffer is now at

### More bug fixes…

At this point, I realized that I hadn’t ran the game out of profile mode in a bit… I had gotten too enthralled in the performance side of things. And when I ran it… nothing rendered…

It turns out I had made a mistake very early on when I changed the format of Game_InstanceBuffer.

As a refresher, our buffer looks like this:

typedef struct
{
uint8_t spriteIndices[GAME_MAX_INSTANCE_COUNT];
uint16_t scales[GAME_MAX_INSTANCE_COUNT];
uint16_t positionX[GAME_MAX_INSTANCE_COUNT];
uint16_t positionY[GAME_MAX_INSTANCE_COUNT];
} Game_InstanceBuffer;


The code was set up in such a way that I would copy the data needed for all the rendered instances with this call:

sg_update_buffer(Render_DrawState.vertex_buffers[0], Render_InstanceBuffer, (sizeof(uint16_t) * 3 + sizeof(uint8_t)) * renderInstances);


This doesn’t work! If our renderInstances is 1, we’re rendering 7 bytes of data from the sprite indices array, not spriteIndices, scales, positionX and positionY! This caused the GPU to only get a little bit of the data that it needed and the rest was completely missing…

This could be seen as a lesson in testing all aspects of your code…

### One last hoorah

At this point, it was very clear to me that I had to address the movement code. The problem with this, is that we have to access every target position and velocity in order to move the farmer.

One approach that I attempted was to store the future positions of the farmers into N buffers and read from those and only process a fraction of the active move farmers while using the buffered positions for the other fractions. A rough prototype of this approach however showed that this was slower than the simple processing of all the farmers due to needing to touch quite a bit of memory to store the future positions of the farmers.

Another consideration was to bucket the farmer movements by cardinal direction. If the farmers were close enough to a cardinal direction, they would use that direction as an approximation of their velocity. This attempt only managed to provide speedups if a large portion of the farmers were bound to predefined velocities and quickly introduced visual artifacts. As a result, this solution although potentially viable with a large amount of predefined velocities didn’t seem particularly viable for this project.

After these two attempts, I’m going to call it for this project. We made some excellent progress and I learned a ton in the process. I hope you did as well!

### Where are we now?

Looking at chrome://tracing tells us that we’re now at an average tick of around 2ms from our original 42ms. 21 times faster than our original performance! We tackled a lot from memory access patterns, quantization, SIMD, software prefetching and non-temporal stores. As a result however, our program went from ~400 lines to almost 1k lines of code. Our program is now harder to change and harder to read, but code cleanliness was not one of our concerns here.

I had great fun and I hope you did too!

Until next time!

Find the repo on github here!