Featured

## The Introduction

Recently, I’ve been doing a bit of extra research into exposure in the context of physically based rendering. Primarily, I was looking to strengthen my understanding of the underlying concepts and values involved. This post can perhaps be described as a diary and something of an explainer? Either way, here we go!

More

## Where is the cosine? Lambertian Reflectors and Rendering

For quite some time, I’ve been plagued by a question about Lambertian reflectors and sampling them in a renderer. The issue arises from the definition of a lambertian reflector. You might be familiar with the formulation of the Lambertian BRDF of $\frac{Albedo}{pi}$, this formulation is simple and constant across viewing direction and illumination direction. However, there is actually an implicit cosine factor based on viewing direction involved.

Lambertian surfaces emit less light as the viewing angle from the surface’s normal increases. The factor is actually $cos(\theta_o)$ where $\theta_o$ is the angle from our viewing direction to our normal. [1]

My question was, why don’t we apply this cosine factor when sampling our surfaces?

After all, we’re asking for how much light is being emitted from a surface at a given point.

If we wanted this answer, shouldn’t we also multiply it by the cosine of our viewing angle?

Shouldn’t it be $\frac{Albedo}{\pi} * cos(\theta_o) * L_i * cos(\theta_i)$ instead of $\frac{Albedo}{\pi} * L_i * cos(\theta_i)$?

The problem lies in the mental model I’ve presented.

When rendering, we’re not simply asking how much light is being emitted at a given point.

We’re asking how much light is being emitted from a given area. This area, is the projection of our pixel onto our surface $A_p$.

If we imagine our rendering problem as a 2 dimensional problem – where our image plane is represented by a line segment and a pixel is represented as a segment of our line.

What we’re asking when rendering our surface, is not the light being emitted from a single sample point. But actually the light being emitted from the surface area of $A_p$.

$EmittedLight = L_o * A_p$ Where $L_o$ represents the amount of light being emitted in a particular direction per unit area.

As a result, if we do a bit of trigonometry, we can see that $A_p = \frac{A}{cos(\theta_o)}$.

And we also know that our Lambertian reflector emits light as $L_o = \frac{Albedo}{\pi} * cos(\theta_o) * L_i * cos(\theta_i)$

If we place these values in our equation for emitted light we get:

$EmittedLight = L_o * A_p$

$EmittedLight = \frac{Albedo}{\pi} * cos(\theta_o) * L_i * cos(\theta_i) * \frac{A}{cos(\theta_o)}$

$EmittedLight = \frac{Albedo}{\pi} * L_i * cos(\theta_i) * A$

Which cancels out our cosine terms! ($A$ is implicit when rendering a pixel grid)

This led me to some interesting insights/questions.

• The value represented by a pixel can be seen as radiant intensity.
• The value calculated by our shader can be seen as radiant intensity. (If it was radiance, we would need to introduce the cosine term again since we would not be multiplying by projected area)
• We can think of the pixel grid as a big integration of our image plane and thinking of the pixels as little squares can be a valuable tool.
• Do all BRDFs have an implicit cosine term? This leads me to think that is the case.

That’s it for today! Thanks for popping by, hopefully you found this as insightful as I have.

## Memory Alignment And Why You (Might) Care

Something interesting came up about memory alignment yesterday and I figured it could be neat to write up a quick synopsis of memory alignment.

### What is memory alignment?

Memory alignment is the requirement that the address of your object must be a multiple of its alignment.

i.e. a 4 byte integer has a memory alignment of 4 bytes, this mean that you can find the integer at addresses 0, 4, 8, 12, 16, 20, 24, etc

Unaligned memory accesses used to be a pretty significant problem since CPUs could only load memory on particular alignment boundaries. (You could only load a 4 byte word at addresses that were a multiple of 4 bytes, I assume that compilers could likely handle the cast where memory was not aligned at the price of some performance – needs a reference)

However, x86-64’s load instruction (mov) supports unaligned accesses [1]. This is not necessarily the case for other architectures, so be careful!

### Why do we care?

Usually we don’t!

Because:

• Most memory will be aligned on an 8 byte boundary if you are using a standard allocator (see max_align_t [2] which defines the alignment of all allocations by the standard allocators).
• Most memory can be loaded on an unaligned address without any problems anyways (sortof).

However there is a case that can cause problems.

### Welcome to SIMD

SIMD stands for Single Instruction, Multiple Data. I’ll be glossing over it for now, feel free to reach out if you’d like to talk about it some more.

The important bit about SIMD in x86-64, is that it’s primary type, __m128, is 16 bytes large and has an alignment requirement of 16 bytes.

This type is not loaded using the standard x86 load instruction. It uses a particular instruction that _requires_ the type to be aligned on a 16 byte boundary (movaps). If the address being loaded isn’t a multiple of 16 bytes, then your application will throw an exception.

This means that if your standard allocator allocates on an 8 byte boundary, you are likely to crash if you allocate __m128 without telling it to align on a 16 byte boundary.

There is an instruction that allows you to load this type without alignment (movups) and the compiler _can_ emit this instruction.

However, this is easy to break.

Consider this small program.

uint8_t bytes[40];
__m128& vector1 = *(__m128*)&bytes[8];
__m128& vector2 = *(__m128*)&bytes[24];


MSVC will generate this (authored to make more readable)

movups xmm0,xmmword[vector1]


Notice that the first instruction is an unaligned load (yay!) HOWEVER! Notice that we’re also referencing vector2 in our addps instruction.

addps requires it’s source arguments to be aligned to a 16 byte boundary. And our memory is very likely not. (24 is not a multiple of 16!)

This will crash.

The worst part? Inconsistently. It all depends on “bytes” original address. If it was at address 8, then vector1 will be at address 8+8=16 and vector2 will be at address 8+24=32, no crash. However it it was at address 0, well…….

## More things!

### How can we break this alignment?

#### Packing into a buffer

Let’s say I allocate a large block of memory, maybe 1000TB… or 32 bytes for simplicity.

[................................]

Maybe what I want to do, is add a series of objects into that buffer to make then nice and packed.

First I add a byte into it.

[1...............................]

Then maybe I want to add a 4 byte integer into it

[12222...........................]

Then maybe I want to add our __m128 object

[122223333333333333333...........]

Uh oh! Now our __m128 object is at address 5, which definitely isn’t aligned to a 16 byte boundary.

#### Custom allocation

Another possibility, is to allocate our __m128 object using a standard allocator.

say I do this:

__m128* myVector = new __m128{};

This can easily cause a crash later on since standard allocators are only guaranteed to align to max_align_t which on most platforms is 8 bytes. (This seems it might change in C++17 [3])

### What are the benefits of alignment?

There are a few that I can think of:

#### CPU Architectural simplicity.

Disclaimer: This is pure speculation. If I can guarantee that all my memory loads are aligned at particular boundaries, I can likely make my CPU much simpler which takes up less space, less power, less heat. More space is more good.

If I have memory aligned at powers of 2 such as 2,4,8,16,etc, then I can guarantee that I will not have a small data type that will cross a cache line boundary.

Notice that with an 8 byte cache line, I can fit one 8 byte object. However, if the object is unaligned, the object could be in 2 cache lines!

[........][........]

[11111111][........] // Good

[....1111][1111....] // Bad

Our CPUs load memory as these singular cache line, requiring 2 cache lines to be loaded for a single object can introduce some undesired latency in a hot loop. [6]

This gets even worse with memory pages, however, due to a bout of extreme laziness I will not be going into details about it.

### Conclusion

And that’s it! Thanks for coming by, hope this was informative! Got any questions? Please reach out!

## Diary of a Path Tracer – Wrapping Up

### Intro

In our last post, we took a look at texture sampling and bump mapping. In this post, we’ll be talking about the last few additions that were made to cranberray before it was wrapped up. Converting from a recursive path tracer to an iterative path tracer and implementing tone mapping using ACES tone mapping curves.

### Iterative Path Tracing

Converting cranberray from a recursive path tracer to an iterative path tracer was surprisingly simple. Before this change, cranberray would shoot a ray from the camera and run the appropriate shader at the intersection point. This shader would in turn shoot a ray in some arbitrary direction to determine how much light should illuminate that location. This would continue recursively until the rays would either reach their maximum recursive depth or the ray would terminate by exiting the scene completely and sampling our skybox.

This approach makes sense if you think of every point as reflecting light that it has received from another location. As a result, you need to know how much light you’ve received before you can shade your point.

However, instead you can consider the opposite view. That every point is continuously absorbing light. We can see that the amount of light that the viewer receives is modeled by

$light = lightSource * absorb_0 * absorb_1 * absorb_2 ... * absorb_n$

With that in mind, we can see that we can simply calculate how much light we absorb at each point without needing to know the quality or the quantity of the light that we are receiving.

Instead of thinking of it as the light source being recursively reflected to the viewer.

light recursive(i)
{
return emissionAtPoint(i) + absorptionAtPoint(i) * recursive(i + 1);
}


We instead think of it as the light being absorbed at every stage

light iterative()
{
for(i = 0; i < maxDepth; i++)
{
light += absorption * emissionAtPoint(i);
absorption *= absorptionAtPoint(i);
}
return light;
}


This has some interesting properties!

• If we notice that our absorption has reached 0, we can stop iterating.
• We can multi-thread this much more easily than the recursive approach.
• It has the desirable property of not easily introducing stack overflows.

I also find this version easier to think about which allows me more flexibility in the solutions I can imagine.

### Tone Mapping

I will only touch briefly on tone mapping, I find that [1], [2] and [3] already provide excellent views on this topic.

Cranberray makes use of the histogram based average luminance calculations alongside the ACES tone mapping operator. You can find more information about these in the referenced articles.

However, one portion that I would like to talk about is why we are using the geometric mean to calculate the average luminance and hopefully provide some intuitions about the geometric mean.

As you can see in [1], the histogram luminance is stored as a histogram of the logarithms of our luminance instead of as their raw value. We sum up the values of the logarithms, divide it by the total number of values and revert it to a luminance.

This is the definition of the geometric mean. You can calculate the geometric mean either through repeated multiplication:

$\sqrt[n]{i_0*i_1*...*i_n}$

Or by using the logarithmic identities:

$e^{1/n*((ln(i_0)+ln(i_1)+...+ln(i_n)))}$

It took me quite some time to understand why the geometric mean would be a desirable value when doing tone mapping. Then, I realized that it is desirable when you are speaking in terms of multiplicative properties.

If I’m speaking of light, I’m thinking of the intensity either being twice as bright, or half as bright. In most scenarios, I don’t necessarily care if something is 2 units brighter versus 2 units dimmer. The center value of something that is twice as bright and half as bright is the geometric mean.

You can derive this by imagining that we have our center value $x$.

Imagine that we have a set of numbers, $0.5x, 1x, 2x$ and I would like to determine the value of $x$. If we repeatedly multiply our values, we get $0.5*1*2*x^3=x^3$ and we can then retrieve our value by taking the cube root.

Generalizing, given 3 variables

$y,z,w$

where

$y=ax, z=bx, w=cx$

and

$a*b*c = 1$ (1)

Then we can find a value for x by

$\sqrt[3]{y*z*w}=m$

since

$\sqrt[3]{ax*bx*cx}=m$

$\sqrt[3]{a*b*c*x^3}=m$

given (1), then

$\sqrt[3]{x^3}=m$

$x=m$

As an example, say we have the values 0.2, 0.7, 1.3. If we take the geometric mean we get $\sqrt[3]{0.2*0.7*1.3}$ which is 0.56670511081. Now if we divide our original values by that value we get 0.35291723364, 1.23521031776 and 2.2939620187 which when multiplied together give 1!

In terms of desirable properties, let’s take a look at a more visual example.

These images show the difference between using the arithmetic mean to normalize our values and using the geometric mean to normalize our values. The top row is our arithmetic mean and the bottom row is our geometric mean.

Our colors are simply generated by starting with 2, 4, 8, 16, 32, 2^n and calculating their arithmetic and geometric means then dividing each value by the respective means.

In our first example we have 2, 4 and 8. We divide by 4.6666 with the arithmetic mean and divide by 4 with the geometric mean.

You’ll notice that as our numbers get larger and larger, the arithmetic mean slowly loses all detail in the low range of our values and only maintains the brightest values at the edges.

Notice also that the geometric mean remains centered despite the ever growing magnitude of our values. This is desirable since most of our image would become very dark, very quickly.

### Conclusion

That’s it for this series! Our next series will likely focus on a completely different topic, possibly in the world of real-time rendering. Cranberray was an excellent experience in various areas of path tracing. I would love to come back to it someday. Until next time, happy coding!

## Diary of a Path Tracer – Texture Sampling And Bump Mapping

### Intro

In our last post, we looked at the various utilities used by cranberray. in this post we’re going to return to the world of graphics and take a look at cranberray’s texture sampling and bump mapping strategy.

### Texture Sampling

Cranberray takes inspiration from graphics APIs and splits texture sampling into 2 parts. The sampler and the texture.

The sampler in cranberray stores information such as sample type (nearest, bilinear) and some other useful flags such as gamma_to_linear to convert our textures stored in sRGBA to linear RGBA. Textures store pixel data and simply support sampling a single point in them.

Sampling a texture in cranberray is quite simple, you take your UV coordinates and convert them to an array index.

static cv4 sample_r_u8(cv2 uv, uint8_t* cran_restrict image, uint32_t width, uint32_t height, uint32_t offsetX, uint32_t offsetY)
{
float readY = uv.y * (float)height;
float readX = uv.x * (float)width;

uint32_t y = (uint32_t)floorf(readY) + offsetY;
y = y >= height ? height - 1 : y;
uint32_t x = (uint32_t)floorf(readX) + offsetX;
x = x >= width ? width - 1 : x;
uint32_t readIndex = y * width + x;

float f = (float)image[readIndex] / 255.0f;
return (cv4) { f, f, f, f };
}


Once your can retrieve distinct pixel values, you can either sample nearest of use bilinear interpolation between 4 different samples.

if (sampleType == sample_type_nearest)
{
color = samplers[texture->format](uv, texture->data, texture->width, texture->height, 0, 0);
}
else if (sampleType == sample_type_bilinear)
{
cv4 s00 = samplers[texture->format](uv, texture->data, texture->width, texture->height, 0, 0);
cv4 s01 = samplers[texture->format](uv, texture->data, texture->width, texture->height, 0, 1);
cv4 s10 = samplers[texture->format](uv, texture->data, texture->width, texture->height, 1, 0);
cv4 s11 = samplers[texture->format](uv, texture->data, texture->width, texture->height, 1, 1);

float wf = cf_frac((float)texture->width * uv.x);
float hf = cf_frac((float)texture->height * uv.y);
wf = wf < 0.0f ? 1.0f + wf : wf;
hf = hf < 0.0f ? 1.0f + hf : hf;
color = (cv4)
{
cf_bilinear(s00.x, s01.x, s10.x, s11.x, wf, hf),
cf_bilinear(s00.y, s01.y, s10.y, s11.y, wf, hf),
cf_bilinear(s00.z, s01.z, s10.z, s11.z, wf, hf),
cf_bilinear(s00.w, s01.w, s10.w, s11.w, wf, hf)
};
}


And that’s pretty much it for how cranberray samples its textures!

### Bump Mapping

Bump mapping is a bit more fun than texture sampling.

A very important and interesting point about the tangent frame is that it is the set of vectors that represent the basis for our texture coordinates. Originally, I believed that any tangent and bitangent could be selected as long as they were orthonormal. In the context of bump mapping, this is not correct. We actually want to select our tangent and bitangent so as to have them represent the flow of the U and V coordinates in space. ([2] has an excellent visualization for this)

To construct your tangent and bitangent, you can imagine that your triangle edge is a construction of the tangent and bitangent vectors.

Once you know you can construct your edges from some contribution of the tangent and bitangent, you can solve for your tangent and bitangent vectors with a little bit of algebra.

// e0=du0T+dv0B (1)
// e1=du1T+dv1B (2)
// solve for B
// (e0-du0T)/dv0
// plug into (2)
// e1=du1T+dv1(e0-du0T)/dv0
// solve for T
// e1=du1dv0T/dv0+dv1e0/dv0-dv1du0T/dv0
// dv0e1=du1dv0T+dv1e0-dv1du0T
// dv0e1-dv1e0=du1dv0T-dv1du0T
// dv0e1-dv1e0=T(du1dv0-dv1du0)
// T = (dv0e1-dv1e0)/(dv0du1-dv1du0)


Calculating tangent frames caused a surprising amount of issues with degenerate triangles and NaNs. Be careful here.

And now that you have your tangent and bitangent vectors, you can “bump” them using the partial derivatives of a height map. To bump your tangent and bitangent vectors, you add a scaled normal to the tangent and bitangent vectors and recalculate the normal using the cross product of said vectors.

cv3 normal = cv3_normalize(inputs.normal);
{
cv4 partialDerivative = sampler_sample(&scene->textureStore, microfacetData.bumpSampler, microfacetData.bumpTexture, inputs.uv);
normal = cv3_normalize(normal);
cran_assert(cv3_dot(normal, inputs.normal) >= 0.0f);
}


### Conclusion

And that’s it! This part of cranberray is quite simple but was fun to write!. Next time we’ll look at the iterative path tracing of cranberray. Until then, happy coding!

### Future Work

I would like to add a few things to the texture sampling system as it’s quite barebones. Things such as mip map selection, trilinear interpolation as well as add a texture caching system. (Currently cranberray keeps all textures resident in memory)

## Diary of a Path Tracer – Utilities

### Intro

In our last post, we took a look at a variety of things. Namely the GGX-Smith BRDF, importance sampling and multiple importance sampling. These are really interesting aspects of path tracing, I recommend the read!

In this post, we’re going to look at all the small utilities for cranberray. Obj loading, multi-threading and our allocator. All these pieces are relatively simple but were a lot of fun to write!

### Allocator

Cranberray’s allocator is tiny. It’s roughly 60 lines of code.

typedef struct
{
// Grows from bottom to top
uint8_t* mem;
uint64_t top;
uint64_t size;
bool locked;
} crana_stack_t;

void* crana_stack_alloc(crana_stack_t* stack, uint64_t size)
{
cran_assert(!stack->locked);
uint8_t* ptr = stack->mem + stack->top;
stack->top += size + sizeof(uint64_t);
cran_assert(stack->top <= stack->size);

*(uint64_t*)ptr = size + sizeof(uint64_t);
return ptr + sizeof(uint64_t);
}

void crana_stack_free(crana_stack_t* stack, void* memory)
{
cran_assert(!stack->locked);
stack->top -= *((uint64_t*)memory - 1);
cran_assert(stack->top <= stack->size);
}

// Lock our stack, this means we have free range on the memory pointer
// Commit the pointer back to the stack to complete the operation.
// While the stack is locked, alloc and free cannot be called.
void* crana_stack_lock(crana_stack_t* stack)
{
cran_assert(!stack->locked);
stack->locked = true;
return stack->mem + stack->top + sizeof(uint64_t);
}

// Commit our new pointer. You can have moved it forward or backwards
// Make sure that the pointer is still from the same memory pool as the stack
void crana_stack_commit(crana_stack_t* stack, void* memory)
{
cran_assert(stack->locked);
cran_assert((uint64_t)((uint8_t*)memory - stack->mem) <= stack->size);
*(uint64_t*)(stack->mem + stack->top) = (uint8_t*)memory - (stack->mem + stack->top);
stack->top = (uint8_t*)memory - stack->mem;
stack->locked = false;
}

// Cancel our lock, all the memory we were planning on commiting can be ignored
void crana_stack_revert(crana_stack_t* stack)
{
cran_assert(stack->locked);
stack->locked = false;
}


As you can see, it’s a simple stack allocator. It grows upwards, stores the size of your allocation before your allocation and shrinks when you pop from it. It has very few safety checks and is simple to the point of being flawed.

The part that I thought was interesting is the lock, commit and revert API. It’s a simple concept that I really can only afford because of the simplicity of cranberray.

Using the locking API is simple, you call crana_stack_lock to retrieve the raw pointer from the allocator. This allows you to move the pointer forward to any size without needing to continuously allocate chunks of memory. The piece of mind it gives and the ease of use is excellent.

Once you’re done with your memory, you have two options. You can commit your pointer, allowing you to finalize your changes. Or you can revert, returning the top pointer to it’s location from before the call to lock. This allows you to work with a transient buffer of memory that you can simply “wish away” once you’re done with it.

Coding with this allocator, was a blast. I would consider adding an extension to any custom allocator if it could be made a bit more safely.

Maybe what could be done, is to write it on top of another more general allocator. The stack allocator could request a fixed sized buffer from the allocator and allow the user to work within that range. Once you’ve finished with the stack allocator, it could return the unused parts to the primary allocator.

When I speak of Obj, I mean the text version of the Wavefront geometry format.

I won’t be going into the details of the format. They are well explained at [1].

Cranberray’s Obj loader is simple. First we memory map the file, this allows us to easily read our memory using a raw pointer and allows us to suspend our knowledge that this is an IO operation.

Once our file is mapped, we process our file in 2 passes. We first loop through our file to determine how many of each elements we should be allocating and once we’ve done that, we parse the data into the appropriate buffers.

for (char* cran_restrict fileIter = fileStart; fileIter != fileEnd; fileIter = token(fileIter, fileEnd, delim))
{
if (memcmp(fileIter, "vt ", 3) == 0)
{
uvCount++;
}
else if (memcmp(fileIter, "vn ", 3) == 0)
{
normalCount++;
}
else if (memcmp(fileIter, "v ", 2) == 0)
{
vertexCount++;
}
else if (memcmp(fileIter, "g ", 2) == 0)
{
groupCount++;
}
else if (memcmp(fileIter, "f ", 2) == 0)
{
uint32_t vertCount = 0;
fileIter += 1;
for (uint32_t i = 0; i < 4; i++)
{
int32_t v = strtol(fileIter, &fileIter, 10);
if (v != 0)
{
strtol(fileIter + 1, &fileIter, 10);
strtol(fileIter + 1, &fileIter, 10);
vertCount++;
}
else
{
break;
}
}
assert(vertCount == 3 || vertCount == 4);
faceCount += vertCount == 3 ? 1 : 2;
}
else if (memcmp(fileIter, "usemtl ", 7) == 0)
{
materialCount++;
}
else if (memcmp(fileIter, "mtllib ", 7) == 0)
{
materialLibCount++;
}
else if (memcmp(fileIter, "# ", 2) == 0)
{
}
}


Cranberray’s Obj loader simply assumes that every space, tab or return carriage is a potential delimiter of a token. You might be thinking “this is fragile!” and you would be right! But it works for my use case.

Once done iterating through every token to count how much memory it should allocate, we reset the pointer and gather our data.

We could write this loader as a state machine, but I find this solution mildly simpler (and less effort to think about).

Something that I think is of note for the loader, is how materials are handled. Instead of storing our mesh in chunks for each material type, I simply store every material boundary. This allows me to store our mesh as one big chunk while also handling multiple materials per chunk.

Honestly, that’s pretty much it. It’s just a series of switch statements and loops.

Multi-threading cranberray was a piece of cake. Since all state being mutated is contained in a thread, there is very little to no effort needed to parallelize it.

Every thread in cranberray gets access to the immutable source render data (scene, lights, etc), it’s own render context (random seed, stack allocator, stats object) and the render queue.

Work for cranberray is split into render “chunks” which are simply rectangular segments that every thread will be rendering to. These chunks are stored in a queue (the render queue).

To get an element from the queue, every thread is simply using an atomic increment to retrieve their next chunk.

int32_t chunkIdx = cranpl_atomic_increment(&renderQueue->next) - 1;
for (; chunkIdx < renderQueue->count; chunkIdx = cranpl_atomic_increment(&renderQueue->next) - 1)


Note the “-1” here. That’s because if we simply read from next and then increment, multiple threads could have read the same value for their chunk index. Instead, we want to read the post incremented value returned by the atomic instruction. This guarantees that our queue index is unique to us.

Once a thread has it’s chunk, it simply writes to the output target at it’s specified location.

Cranberray also captures some useful metrics from it’s rendering such as rays spawned, time spent and other useful bits and pieces.

Originally this data was global, but we can’t have that if we’re going to access it with different threads. One option is to modify every item in that data with an atomic operation but atomic operations aren’t free and the user has to be careful about where and how they update the data. Instead, a much simpler approach is to store metric data per-thread and merge it at the end. This is the approach cranberray takes, and the peace of mind afforded by this strategy is sublime.

### Conclusion

This post was a simple rundown of a few small but essential things from cranberray. Next post we’ll be looking at texture mapping and bump mapping! Until next time. Happy coding!

## Diary of a Path Tracer – GGX-Smith and Multiple Importance Sampling

### Intro

In our last post, we looked at cranberray’s sampling strategy and it’s implementation of N-Rooks sampling. In this post, we’re going to look at cranberray’s BRDF implementations and multiple importance sampling.

Cranberray’s shading pipeling is very simple. When loading a mesh, ranges of triangles (or submesh) are tagged to use a specific material. When intersecting the geometry, cranberray will use the triangle’s index to lookup which material it should use for said triangle. Each material has an associated shader and appropriate data object. Once we know which material to use, we call the function associated to that shader and light our pixel. The heavy lifting happens inside the shader.

The primary lighting shader in cranberray is shader_microfacet. This function uses Lambert as it’s diffuse BRDF and GGX-Smith as it’s specular BRDF. It selects and blends both using multiple importance sampling.

In this post we’ll be taking a look at the magic behind the shader starting with our Lambert BRDF!

### Understanding The Lambertian BRDF

For this post, we’ll assume that there is some familiarity with the fact that the Lambert BRDF is simply the Albedo of the surface divided by $\pi$.

However, one bit I’d like to touch on is that a Lambertian reflector emits energy at a constant rate in every direction.

At least, that was how I understood a Lambertian reflector. This is incorrect. A Lambertian reflector emits energy at a constant rate in every direction per unit area.

The difference here is that the first definition describes radiant intensity and the second defines radiance. [3]

What that means is that the amount of energy emitted is proportional to the angle at which the surface is viewed. Wikipedia was quite insightful here [1].

If you imagine a rectangular Lambertian reflector of 1 meter on each side and imagine that it reflects 1 unit of radiant intensity.

Notice that in that case, our Lambertian reflector is reflecting $L = \frac{1}{1} = 1$ units of radiance.

Now, imagine that we’re looking at the reflector at an angle of 60 degrees

Notice that the perceived size of our reflector is now 0.5 meters squared. If our reflector were to emit 1 unit of radiant intensity when viewed at this angle, it would emit $L = \frac{1}{0.5} = 2$ units of radiance. Twice as much energy per unit area! Meaning that if we looked at it at an angle, it would emit the same amount of energy for less area in our field of view making it brighter.

As a result, you can imagine that when we’re looking to render a surface that looks uniform when viewed from all directions this is problematic.

A Lambertian reflector should reflect constant radiance. As a result, the energy reflected per direction by our material should be proportional to the change in size of our reflector when viewed at an angle.

I’ll save you the trigonometry and tell you that this factor is $cos(\theta)$ where $\theta$ is the zenith angle of our viewer.

As a result, you can see that to emit constant radiance, our reflector’s radiant intensity should be reduced by $cos(\theta)$ and if we do that, our reflector would emit 0.5 units of radiant intensity and as a result would emit $L = \frac{0.5}{0.5} = 1$ unit of radiance.

Now, like me, you might be thinking “If we need a cosine here to make our radiant intensity proportional to the foreshortening of our area, why don’t we multiply our emitted light by the cosine factor when rendering?”

The reason is that this cosine factor is relevant when we’re trying to calculate radiance from radiant intensity and the area being shaded. However, when we’re rendering, the result of our shading math is radiance not radiant intensity. If we wanted to instead calculate the radiant intensity, we would have to integrate across our surface.

$I = \int_A L dA$

Taking our previous surface as an example, when we’re shading it in our renderer, we’re asking “Given this irradiance, what is it’s radiance?”. In our previous example, our surface was being shaded with an irradiance of $\pi$ (conveniently skipping over why here) and as a result, our radiance was 1. If we were to ask what the radiant intensity was in any given direction, you would have to multiply the radiance by the projected area of our surface. At an angle of 60 degrees, this would be 0.5, making our radiant intensity $I = L * A = 1 * 0.5 = 0.5$.

This was a long winded way to attempt at explaining a neat and surprising (to me) property of Lambertian reflectors.

### Smith-GGX and microfacet BRFDs

Aside from our Lambert BRDF for diffuse, cranberray makes use of the Smith-GGX specular formulation. This formulation is very common and well documented. I refer the interested reader to [6] for an overview of the topic.

I won’t explain microfacet BRDFs here, but I highly recommend [6] and [9] for a derivation of the specular microfacet BRDF formula, and [8] for more info about the origins of the formula.

As presented in the heading, we’re making use of the Smith Uncorrelated function for our shadowing function and GGX for our NDF. [9]

### Importance Sampling

At this point in our rendering, we have a specular BRDF and a diffuse BRDF however we now need to decide on where to sample our ray.

The simplest method for selecting a ray is to uniformly select a point on the hemisphere centered about the normal of our surface.

This works well to get started but can converge very slowly. Especially in scenarios with sharp highlights which often happens with smooth surfaces.

If we sample only a few times, we might miss all the important details.

As a result, we want to make use of something called importance sampling. Simply put, importance sampling is the idea of trying to sample the most “important” parts of a function.

I assume the reader has some familiarity with Monte Carlo estimation method. If not, I suggested going to [15]. Their treatment of importance sampling and Monte Carlo estimators is excellent.

To sample the most important parts of our scene, we want to take samples where we expect the most contributions to our image would live. We unfortunately don’t know exactly where that is, but we can guess. And to guess, we can use an approximate function to choose what direction we want to send our ray.

Let’s imagine our basic Lambert BRDF, we know that rays that are closest to our normal will contribute the most light to our illumination. As a result, we can try to send more rays towards our normal’s direction with some falloff. [16]

To select a ray, we want to convert our PDF to a CDF and randomly select a point on our CDF using a uniform random number. [13] This allows us to select a random number for the specified distribution using a random number with a uniform distribution.

Once we’ve selected our value, we can calculate the PDF and plug it into our Monte Carlo estimator.

A potential downside of selecting an approximate function is that it might not accurately reflect the illumination in the scene, causing us to increase our variance instead of reducing it. [15]

Cranberray uses the BRDF functions as the approximate representation of the illumination in the scene. This is easy to implement and also quite effective since there tends to be quite a few rays that would effectively contribute nothing at all to the image if we sampled uniformly.

For Lambert, I use the derivation from a previous post [16].

And for GGX I use the derivation found here [13].

static cv3 hemisphere_surface_random_lambert(float r1, float r2)
{
float theta = acosf(1.0f - 2.0f*r1) * 0.5f;
float cosTheta = cosf(theta);
float sinTheta = sinf(theta);
float phi = cran_tao * r2;

return (cv3) { sinTheta*cosf(phi), sinTheta*sinf(phi), cosTheta };
}

static float lambert_pdf(cv3 d, cv3 n)
{
return cv3_dot(d, n) * cran_rpi;
}

static cv3 hemisphere_surface_random_ggx_h(float r1, float r2, float a)
{
float cosTheta = sqrtf((1.0f-r1)*cf_rcp(r1*(a*a-1.0f)+1.0f));
float sinTheta = sqrtf(1.0f - cosTheta*cosTheta);
float phi = cran_tao * r2;
return (cv3) { sinTheta*cosf(phi), sinTheta*sinf(phi), cosTheta };
}

static float ggx_pdf(float roughness, float hdotn, float vdoth)
{
float t = hdotn*hdotn*roughness*roughness - (hdotn*hdotn - 1.0f);
float D = (roughness*roughness)*cf_rcp(t*t)*cran_rpi;
return D*hdotn*cf_rcp(4.0f*fabsf(vdoth));
}


### Multiple Importance Sampling and blending our BRDFs

With these importance functions, we need a way to select between either our diffuse or our specular BRDF and blending between them. This is where Multiple Importance Sampling comes in.

Multiple Importance Sampling allows us to blend our functions using weights assigned to each PDF. [14]

In cranberray, we use a simple strategy for determining if we select from either our specular BRDF or our diffuse BRDF. We use the Fresnel factor for our geometric normal and select our specular BRDF when our random number is less than the Fresnel factor and our diffuse BRDF in the other case.

float geometricFresnel = cmi_fresnel_schlick(1.0f, microfacetData.refractiveIndex, normal, viewDir);
geometricFresnel = fmaxf(geometricFresnel, microfacetData.specularTint.r * gloss); // Force how much we can reflect at a minimum
float weights[distribution_count] =
{
[distribution_lambert] = 1.0f - geometricFresnel,
[distribution_ggx] = geometricFresnel
};

bool reflected = random01f(&context->randomSeed) < weights[distribution_ggx];


Once we’ve selected our BRDF and sampled from it, we want to blend our results with our previous sampling results. This is where multiple importance sampling comes in. (See [17] for an excellent treatment on the topic)

Notice that if we imagine our scene as a simplified function.

Let’s assume this function has an area of 1.2 units.

Imagine that our function is a combination of 2 simpler functions.

$f(X) = g(X)h(X)$

Let’s imagine that we select only one of those functions as the importance function uniformly at random and we select $g(x)$.

To get our value using importance sampling, we need to divide by our PDF and the chance of selecting that PDF as our distribution (Which is 0.5 in this example).

If we were to use this single sample as our estimate using $p_g$ as our PDF and the values $g(x) = 0.1$, $h(x) = 2$ and $p_g(x) = 0.1$ we would get $y = \frac{g(x)h(x)}{0.5*p_g(x)} = \frac{0.1*2}{0.5*0.1} = \frac{0.2}{0.05} = 4$ which is a poor estimate of our area of 1.2.

If we were to use $p_h(x)$ and where $p_h(x) = 1$ as our importance function we would get $y = \frac{g(x)h(x)}{0.5*p_h(x)} = \frac{0.1*2}{0.5*1} = \frac{0.2}{0.5} = 0.4$ which is a closer estimate of our area of 1.2.(Admittedly, this example is a bit contrived)

The issue presented here (and in [17]) is that because our sample location had a low likelihood of being selected (0.1) but our second function $h(x)$ had a relatively high value, our value gets a very high value as well since $2/0.1$ is quite large. It’s important to note that if we were to continue taking samples, our estimator would still converge to 1.2.

Multiple Importance Sampling suggests a method to reduce the overall impact of these spikes to reduce the variance of our estimates by using a set of appropriate weighting functions instead of simply using our PDF directly as we just did.

The one sample model presented in [14] is what we’ll be using for this example, as it’s simpler to think about.

Instead of only dividing by our PDF, we want to apply a weighing function to our calculation.

$F = \frac{w_I(X)f(X)}{c_Ip_I(X)}$

Where $I$ represents the index of our selected PDF, $w_I$ represents the weight assigned to our PDF, $c_I$ represents the probability that we select that PDF and $p_I$ is our PDF function.

In our previous example, we did not have $w_I(X)$. This function does all the magic.

[14] presents the balance heuristic as an option for a weighting function which has the form

$w_i(x) = \frac{c_i p_i(x)}{\sum_k c_k p_k(x)}$

Notice that this is simply a weighted average of our probability density functions.

If we add this function to our single sample model we get

$F = \frac{c_i p_i(x)}{\sum_k c_k p_k(x)} \frac{f(x)}{c_ip_i(x)}$

$F = \frac{f(X)}{\sum_k c_k p_k(x)}$

If we look at our previous example using $p_g(x)$ as our PDF, we can make use of the same values here (notice that our final formulation does not divide by $p_g(x)$ individually anymore.

$c_g = 0.5$, $c_h = 0.5$, $p_g(x) = 0.1$, $g(x) = 0.1$, $p_h(x) = 1$ and $h(x) = 2$

$y = \frac{g(x)h(x)}{0.5*p_g(x)+0.5*p_h(x)}$

$y = \frac{g(x)h(x)}{0.5*p_g(x)+0.5*p_h(x)}$

$y = \frac{0.1*2}{0.5*0.1+0.5*1}$

$y = \frac{0.2}{0.55}$

$y = 0.36$

Which although is not as good as our estimate of 0.4 from earlier, it is much better than our estimate of 4 when using $p_g(x)$

We can see that this is unbiased by noticing that

$F(x) = \frac{g(x)h(x)}{c_g p_g(x) + c_h p_h(x)}$

$E[F(x)] = c_g p_g(x) \frac{g(x)h(x)}{c_g p_g(x) + c_h p_h(x)} + c_h p_h(x) \frac{g(x)h(x)}{c_g p_g(x) + c_h p_h(x)}$

$E[F(x)] = (c_g p_g(x) + c_h p_h(x)) \frac{g(x)h(x)}{c_g p_g(x) + c_h p_h(x)}$

$E[F(x)] = g(x)h(x)$

(I took some shortcuts here, I recommend reading [14] for proofs)

An interesting property of the balance heuristic is that our values are bounded.

Notice that since

$\frac{g(x)}{c_g p_g(x) + c_h p_h(x)} <= \frac{g(x)}{c_g p_g(x)}$

then

$\frac{g(x)h(x)}{c_g p_g(x) + c_h p_h(x)} <= \frac{g(x)h(x)}{c_g p_g(x)}$

as well as

$\frac{g(x)h(x)}{c_g p_g(x) + c_h p_h(x)} <= \frac{g(x)h(x)}{c_h p_h(x)}$

Where both equations on the right were the formulas that we used originally when sampling either with $p_g(x)$ and $p_h(x)$ (Where $c_g=0.5$ and $c_h = 0.5$).

Meaning that our balanced sampling is bounded by the smaller of the 2 values generated by both of our PDFs. Neat! (Note that with uniform sampling, our value would simply be 0.2)

All that to say that this is our weighting code

float sum = 0.00001f;
for (uint32_t i = 0; i < distribution_count; i++)
{
sum += weights[i] * PDFs[i];
}
weight = cf_rcp(sum);


And that’s it! That was a lot of words for a very little chunk of code.

### Conclusion

This post is definitely a large one. But it was a lot of fun to learn enough about multiple importance sampling and the Lambertian BRDF to at least convey some sense of it in this blog post. I hope you enjoyed it! Next time we’ll be taking a look at the much simpler Obj loading code in cranberray. Until next time, happy coding!

### Future Work

I’d like to make more use of a better importance sampling function for Smith-GGX as seen in [10]. I also wonder if it’s possible to only select from half vectors that reflect on the upper half of the hemisphere instead of requiring us to discard rays that would be generated in the lower half of our sphere.

I would also like to make more effective use of multiple importance sampling in the path tracer as a whole. At this moment, it is only used to combine our lambertian importance function and our GGX-Smith importance function.

## Intro

In our last post, we looked at the SSE optimized BVH traversal of cranberray. BVH construction and traversal is a fairly complicated portion of cranberray; cranberray’s sampling strategy however is quite a bit simpler.

Sampling in cranberray is simple. We first split our image into a grid representing every pixel of our image. (Note that a rectangular pixel is a simplifying assumption [4]) Then we take our pixel box and sample it X number of times using X different sample points and average the results. Our sampling strategy determines how we spread out those X different sample points within our pixel rectangle.

The simplest strategy is simply to select a point randomly within our pixel box for every sample. However, this has the undesirable property of potentially introducing clumping causing us to miss some desirable information. [2][3]

Instead, cranberray makes use of a simple technique called N-Rooks sampling.

## N-Rooks

N-Rooks is what’s known as a low discrepancy sequence. It has the desirable property of avoiding the clumping of samples that random sampling has and avoiding the aliasing issues of regular sampling. We won’t be explaining low discrepancy sequences here, but I refer the interested reader to [2] and [3] for excellent treatments on the subject.

N-Rooks sampling is very simple.

You first generate jittered samples along the diagonal of an X by X box.

And then you shuffle the columns of said box by starting from the leftmost column and randomly swapping with the columns to the right of our current column. (Including our current column). This is known as the Fisher-Yates shuffle. [6] It has the desirable property of uniformly selecting from any of the X! permutations of our columns.

And that’s it! We now sample our scene using these sample points and average their results to get our final output.

In our next post, we’ll be looking at BRDF multiple importance sampling using the balance heuristic as well as the GGX-Smith microfacet specular BRDF. Until then, happy coding!

## Future Work

While writing this, I was considering the possibility of selecting the value of our pixel using a different strategy than the mean filter used with Monte Carlo integration. I would have to do some further digging to see if this has been presented anywhere. Something along the lines of a Gaussian filter or potentially something else entirely.

I would also like to implement other low discrepancy sequences in cranberray if ever time permits. N-Rooks was very simple, it would be interesting to be able to select between different sampling strategies.

## Intro

In our last post, we looked at building our BVH using SAH as our guiding heuristic. Now it’s time to look at how we intersect that BVH.

## High Level

Querying the BVH is actually quite simple. We test our ray against our parent node, if it intersects, we test the ray against both of it’s children. If the ray intersects either of it’s children, we test that children’s children. Once we reach a leaf node, we add it to our list of nodes that we’ve intersected.

This sounds perfect for recursion, but recursion never scales particularly well. Instead a queue or stack is a more effective method to represent the nodes to be tested.

There are three pieces that bring the BVH traversal together. The math library, the AABB intersection code and the (roughly) branchless intersection response code.

We’ve already seen the math library in this post, as a result, let’s take a dive into the intersection routine!

## Ray/AABB Intersection Routine

There are quite a few excellent sources on the ray/AABB intersection routine [1][2][3]. We’ll do a small dive into how we get to the final implementation.

The intersection routine is based on the slab intersection method presented in [3]. We want to define our AABB as a set of 3 slabs in each axis. A slab is simply represented by a pair of planes, the slab being the space in between them.

In order to determine if we’re intersecting the AABB, we want to determine if our ray passes through the shared intersection of each slab.

In order to determine that, we want to determine if our ray enters each slab before it exits any of the other slabs.

A simple approach to this could be:

slabEntryX = intersect(ray, startPlaneX)
slabExitX = intersect(ray, endPlaneX)
slabEntryY = intersect(ray, startPlaneY)
slabExitY = intersect(ray, endPlaneY)

if(slabEntryX is greater than slabExitY or slabEntryY is greater than slabExitX)
return missed collision

return hit collision


As a result, our core approach is to find if there is any entry value that is greater than our exit values. And if so, we’re not intersecting the slab.

To intersect a ray and a plane, we need to define our ray as:

$\vec{p} = \vec{a}*t+\vec{b}$

and our plane as

$0 = \vec{n}.\vec{p}-d$

We can then plug our ray function into our plane function to get:

$0 = \vec{n}.(\vec{a}*t+\vec{b})-d$

And solving for $t$

$d = \vec{n}.\vec{a}*t+vec{n}.\vec{b}$

$d-\vec{n}.\vec{b} = vec{n}.\vec{a}*t$

$\frac{d-\vec{n}.\vec{b}}{vec{n}.\vec{a}} = t$

Now that we have $t$ we can calculate it for each plane, since our planes are axis aligned, our intersection math gets quite simple. Here is an example for our x axis.

$\frac{d-\vec{n}.\vec{b}}{vec{n}.\vec{a}} = t$

where $n = (1, 0, 0)$

$\frac{d_x-b_x}{a_x} = t$

And repeat for each axis.

Now let’s look at our intersection code:

cv3l rayOLanes = cv3l_replicate(rayO);
cv3l invD = cv3l_replicate(cv3_rcp(rayD));
cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);
cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD);

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger  = cv3l_max(t0s, t1s);

cfl rayMinLane = cfl_replicate(rayMin);
cfl rayMaxLane = cfl_replicate(rayMax);
cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);


This code looks different than our math, but it’s essence is the same.

In this code we’re testing 4 AABBs against a single ray instead of one ray against one AABB.

Notice that in this line:

cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);


We’re implementing $\frac{d_x-b_x}{a_x} = t$ where $b_x, b_y, b_z$ is rayOLanes, invD is $\frac{1}{a_x}, \frac{1}{a_y}, \frac{1}{a_z}$ and aabbMin is $d_x, d_y, d_z$.

The next few lines:

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger = cv3l_max(t0s, t1s);


Are there to account for the situation where our ray direction is negative. If our ray direction is negative on one axis, t0 and t1 will actually be flipped. As a result, we want to simply make sure that t0 is the smaller one and t1 is the larger one. [1]

And finally,

cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);


These lines are calculating to see if any of our latest entrance is earlier than our earliest exit. If so, cfl_mask will set the appropriate bit for that AABB.

That’s it for our intersection routine! Now we can take a look at what we do with these results.

## Intersection Response

The previous intersection code is nicely packaged in the caabb_does_ray_intersect_lanes routine.

In our bvh traversal, the call to it looks like this:

uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);


Where intersections hold the bits of the AABBs that intersect with our ray.

One approach to handling these bits is to simply loop through each AABB, determine if it’s bit is set and add it’s children volumes to the queue of being tested or to the final list of leaf nodes.

However, that’s no fun.

Instead it’s a lot more fun to make this code branchless using some SSE intrinsics!

Fair warning, this code can be complicated if you aren’t used to reading SSE intrinsics.

#define _MM_SHUFFLE_EPI8(i3,i2,i1,i0) _mm_set_epi8(i3*4+3,i3*4+2,i3*4+1,i3*4,i2*4+3,i2*4+2,i2*4+1,i2*4,i1*4+3,i1*4+2,i1*4+1,i1*4,i0*4+3,i0*4+2,i0*4+1,i0*4)
__m128i shuffles[16] =
{
_MM_SHUFFLE_EPI8(0,0,0,0), _MM_SHUFFLE_EPI8(0,0,0,0), _MM_SHUFFLE_EPI8(0,0,0,1), _MM_SHUFFLE_EPI8(0,0,1,0), // 0000, 0001, 0010, 0011
_MM_SHUFFLE_EPI8(0,0,0,2), _MM_SHUFFLE_EPI8(0,0,2,0), _MM_SHUFFLE_EPI8(0,0,2,1), _MM_SHUFFLE_EPI8(0,2,1,0), // 0100, 0101, 0110, 0111
_MM_SHUFFLE_EPI8(0,0,0,3), _MM_SHUFFLE_EPI8(0,0,3,0), _MM_SHUFFLE_EPI8(0,0,3,1), _MM_SHUFFLE_EPI8(0,3,1,0), // 1000, 1001, 1010, 1011
_MM_SHUFFLE_EPI8(0,0,3,2), _MM_SHUFFLE_EPI8(0,3,2,0), _MM_SHUFFLE_EPI8(0,3,2,1), _MM_SHUFFLE_EPI8(3,2,1,0)  // 1100, 1101, 1110, 1111
};

uint32_t leafLine = bvh->count - bvh->leafCount;
{
__m128i isParent = _mm_cmplt_epi32(queueIndices, _mm_set_epi32(leafLine, leafLine, leafLine, leafLine));

}

union
{
uint32_t i[4];
__m128i v;
} indexUnion;

indexUnion.v = childIndices;
for (uint32_t i = 0; i < leafCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
candidateIter[i] = bvh->jumps[nodeIndex].leaf.index;
}
candidateIter+=leafCount;

indexUnion.v = parentIndices;
for (uint32_t i = 0; i < parentCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
cran_assert(nodeIndex < bvh->count);
testQueueEnd[i*2] = bvh->jumps[nodeIndex].jumpIndices.left;
testQueueEnd[i*2 + 1] = bvh->jumps[nodeIndex].jumpIndices.right;
}
testQueueEnd += parentCount * 2;


Now this looks like a lot, so let’s take a look at it piece by piece. As we presented in the previous post, our leaf nodes are stored at the end of our array. This allows us to determine which nodes are leaf nodes simply by comparing their indices against that boundary. The following bit of code does exactly that:

__m128i isParent = _mm_cmplt_epi32(queueIndices, _mm_set_epi32(leafLine, leafLine, leafLine, leafLine));


The first line here simply compares our indices against our “leaf boundary”. The result of this tells us if our index is a parent or a leaf node.

Once we know which ones are our leaf nodes, we convert this to a bitmask representing the elements of the vector.

And finally, we mask out the indices that have no intersections associated with them.

The result of this operation is that we have two bitmasks that represents the parents that have intersected the ray and the leaves that have intersected the ray.

Next we want to turn this bit mask into a series of indices.

uint32_t leafCount = __popcnt(childIndexMask);


The first two lines are very simple, it tells us how many leaf nodes have intersections and how many parent nodes. The next pair of lines require a little more explaining.

Right now, our queueIndices vector represents the indices to the AABBs that we’ve tested against our rays.

One approach we could implement is to simply loop through every element, test their bits and read the values if those bits are set.

However, what we can do instead is pack all of our active indices to the front of our vector allowing us to read only the ones that have reported an intersection. Reducing our branching at the cost of shuffling our data. That’s what _mm_shuffle_epi8 is doing. It’s taking a lookup table (shuffles[]) and determining where our values should be moved in the vector to pack them to the front of the vector.

Since our masks only contain 4 bits, they have a maximum of 15 values making for a very manageable look up table.

Once we have our indices packed, we simply have to read them.

union
{
uint32_t i[4];
__m128i v;
} indexUnion;

indexUnion.v = childIndices;
for (uint32_t i = 0; i < leafCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
candidateIter[i] = bvh->jumps[nodeIndex].leaf.index;
}
candidateIter+=leafCount;

indexUnion.v = parentIndices;
for (uint32_t i = 0; i < parentCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
cran_assert(nodeIndex < bvh->count);
testQueueEnd[i*2] = bvh->jumps[nodeIndex].jumpIndices.left;
testQueueEnd[i*2 + 1] = bvh->jumps[nodeIndex].jumpIndices.right;
}
testQueueEnd += parentCount * 2;


These parts are relatively self explanatory. We’re reading our indices and doing the appropriate work if our values are leaves or parent nodes. To understand why we’re using a union to convert our SSE vector, I recommend looking into type punning. [5]

## Conclusion

That’s it! I don’t really have any future work for this part of cranberray, I’m actually quite happy with the end result.

Next we’ll be looking at the ray tracer’s sampling strategy. Until then, happy coding!