# Cranberry Lair

## Intro

In our last post, we looked at building our BVH using SAH as our guiding heuristic. Now it’s time to look at how we intersect that BVH.

## High Level

Querying the BVH is actually quite simple. We test our ray against our parent node, if it intersects, we test the ray against both of it’s children. If the ray intersects either of it’s children, we test that children’s children. Once we reach a leaf node, we add it to our list of nodes that we’ve intersected.

This sounds perfect for recursion, but recursion never scales particularly well. Instead a queue or stack is a more effective method to represent the nodes to be tested.

There are three pieces that bring the BVH traversal together. The math library, the AABB intersection code and the (roughly) branchless intersection response code.

We’ve already seen the math library in this post, as a result, let’s take a dive into the intersection routine!

## Ray/AABB Intersection Routine

There are quite a few excellent sources on the ray/AABB intersection routine [1][2][3]. We’ll do a small dive into how we get to the final implementation.

The intersection routine is based on the slab intersection method presented in [3]. We want to define our AABB as a set of 3 slabs in each axis. A slab is simply represented by a pair of planes, the slab being the space in between them.

In order to determine if we’re intersecting the AABB, we want to determine if our ray passes through the shared intersection of each slab.

In order to determine that, we want to determine if our ray enters each slab before it exits any of the other slabs.

A simple approach to this could be:

slabEntryX = intersect(ray, startPlaneX)
slabExitX = intersect(ray, endPlaneX)
slabEntryY = intersect(ray, startPlaneY)
slabExitY = intersect(ray, endPlaneY)

if(slabEntryX is greater than slabExitY or slabEntryY is greater than slabExitX)
return missed collision

return hit collision


As a result, our core approach is to find if there is any entry value that is greater than our exit values. And if so, we’re not intersecting the slab.

To intersect a ray and a plane, we need to define our ray as:

$\vec{p} = \vec{a}*t+\vec{b}$

and our plane as

$0 = \vec{n}.\vec{p}-d$

We can then plug our ray function into our plane function to get:

$0 = \vec{n}.(\vec{a}*t+\vec{b})-d$

And solving for $t$

$d = \vec{n}.\vec{a}*t+vec{n}.\vec{b}$

$d-\vec{n}.\vec{b} = vec{n}.\vec{a}*t$

$\frac{d-\vec{n}.\vec{b}}{vec{n}.\vec{a}} = t$

Now that we have $t$ we can calculate it for each plane, since our planes are axis aligned, our intersection math gets quite simple. Here is an example for our x axis.

$\frac{d-\vec{n}.\vec{b}}{vec{n}.\vec{a}} = t$

where $n = (1, 0, 0)$

$\frac{d_x-b_x}{a_x} = t$

And repeat for each axis.

Now let’s look at our intersection code:

cv3l rayOLanes = cv3l_replicate(rayO);
cv3l invD = cv3l_replicate(cv3_rcp(rayD));
cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);
cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD);

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger  = cv3l_max(t0s, t1s);

cfl rayMinLane = cfl_replicate(rayMin);
cfl rayMaxLane = cfl_replicate(rayMax);
cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);


This code looks different than our math, but it’s essence is the same.

In this code we’re testing 4 AABBs against a single ray instead of one ray against one AABB.

Notice that in this line:

cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);


We’re implementing $\frac{d_x-b_x}{a_x} = t$ where $b_x, b_y, b_z$ is rayOLanes, invD is $\frac{1}{a_x}, \frac{1}{a_y}, \frac{1}{a_z}$ and aabbMin is $d_x, d_y, d_z$.

The next few lines:

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger = cv3l_max(t0s, t1s);


Are there to account for the situation where our ray direction is negative. If our ray direction is negative on one axis, t0 and t1 will actually be flipped. As a result, we want to simply make sure that t0 is the smaller one and t1 is the larger one. [1]

And finally,

cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);


These lines are calculating to see if any of our latest entrance is earlier than our earliest exit. If so, cfl_mask will set the appropriate bit for that AABB.

That’s it for our intersection routine! Now we can take a look at what we do with these results.

## Intersection Response

The previous intersection code is nicely packaged in the caabb_does_ray_intersect_lanes routine.

In our bvh traversal, the call to it looks like this:

uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);


Where intersections hold the bits of the AABBs that intersect with our ray.

One approach to handling these bits is to simply loop through each AABB, determine if it’s bit is set and add it’s children volumes to the queue of being tested or to the final list of leaf nodes.

However, that’s no fun.

Instead it’s a lot more fun to make this code branchless using some SSE intrinsics!

Fair warning, this code can be complicated if you aren’t used to reading SSE intrinsics.

#define _MM_SHUFFLE_EPI8(i3,i2,i1,i0) _mm_set_epi8(i3*4+3,i3*4+2,i3*4+1,i3*4,i2*4+3,i2*4+2,i2*4+1,i2*4,i1*4+3,i1*4+2,i1*4+1,i1*4,i0*4+3,i0*4+2,i0*4+1,i0*4)
__m128i shuffles[16] =
{
_MM_SHUFFLE_EPI8(0,0,0,0), _MM_SHUFFLE_EPI8(0,0,0,0), _MM_SHUFFLE_EPI8(0,0,0,1), _MM_SHUFFLE_EPI8(0,0,1,0), // 0000, 0001, 0010, 0011
_MM_SHUFFLE_EPI8(0,0,0,2), _MM_SHUFFLE_EPI8(0,0,2,0), _MM_SHUFFLE_EPI8(0,0,2,1), _MM_SHUFFLE_EPI8(0,2,1,0), // 0100, 0101, 0110, 0111
_MM_SHUFFLE_EPI8(0,0,0,3), _MM_SHUFFLE_EPI8(0,0,3,0), _MM_SHUFFLE_EPI8(0,0,3,1), _MM_SHUFFLE_EPI8(0,3,1,0), // 1000, 1001, 1010, 1011
_MM_SHUFFLE_EPI8(0,0,3,2), _MM_SHUFFLE_EPI8(0,3,2,0), _MM_SHUFFLE_EPI8(0,3,2,1), _MM_SHUFFLE_EPI8(3,2,1,0)  // 1100, 1101, 1110, 1111
};

uint32_t leafLine = bvh->count - bvh->leafCount;
{
__m128i isParent = _mm_cmplt_epi32(queueIndices, _mm_set_epi32(leafLine, leafLine, leafLine, leafLine));

}

union
{
uint32_t i[4];
__m128i v;
} indexUnion;

indexUnion.v = childIndices;
for (uint32_t i = 0; i < leafCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
candidateIter[i] = bvh->jumps[nodeIndex].leaf.index;
}
candidateIter+=leafCount;

indexUnion.v = parentIndices;
for (uint32_t i = 0; i < parentCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
cran_assert(nodeIndex < bvh->count);
testQueueEnd[i*2] = bvh->jumps[nodeIndex].jumpIndices.left;
testQueueEnd[i*2 + 1] = bvh->jumps[nodeIndex].jumpIndices.right;
}
testQueueEnd += parentCount * 2;


Now this looks like a lot, so let’s take a look at it piece by piece. As we presented in the previous post, our leaf nodes are stored at the end of our array. This allows us to determine which nodes are leaf nodes simply by comparing their indices against that boundary. The following bit of code does exactly that:

__m128i isParent = _mm_cmplt_epi32(queueIndices, _mm_set_epi32(leafLine, leafLine, leafLine, leafLine));


The first line here simply compares our indices against our “leaf boundary”. The result of this tells us if our index is a parent or a leaf node.

Once we know which ones are our leaf nodes, we convert this to a bitmask representing the elements of the vector.

And finally, we mask out the indices that have no intersections associated with them.

The result of this operation is that we have two bitmasks that represents the parents that have intersected the ray and the leaves that have intersected the ray.

Next we want to turn this bit mask into a series of indices.

uint32_t leafCount = __popcnt(childIndexMask);


The first two lines are very simple, it tells us how many leaf nodes have intersections and how many parent nodes. The next pair of lines require a little more explaining.

Right now, our queueIndices vector represents the indices to the AABBs that we’ve tested against our rays.

One approach we could implement is to simply loop through every element, test their bits and read the values if those bits are set.

However, what we can do instead is pack all of our active indices to the front of our vector allowing us to read only the ones that have reported an intersection. Reducing our branching at the cost of shuffling our data. That’s what _mm_shuffle_epi8 is doing. It’s taking a lookup table (shuffles[]) and determining where our values should be moved in the vector to pack them to the front of the vector.

Since our masks only contain 4 bits, they have a maximum of 15 values making for a very manageable look up table.

Once we have our indices packed, we simply have to read them.

union
{
uint32_t i[4];
__m128i v;
} indexUnion;

indexUnion.v = childIndices;
for (uint32_t i = 0; i < leafCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
candidateIter[i] = bvh->jumps[nodeIndex].leaf.index;
}
candidateIter+=leafCount;

indexUnion.v = parentIndices;
for (uint32_t i = 0; i < parentCount; i++)
{
uint32_t nodeIndex = indexUnion.i[i];
cran_assert(nodeIndex < bvh->count);
testQueueEnd[i*2] = bvh->jumps[nodeIndex].jumpIndices.left;
testQueueEnd[i*2 + 1] = bvh->jumps[nodeIndex].jumpIndices.right;
}
testQueueEnd += parentCount * 2;


These parts are relatively self explanatory. We’re reading our indices and doing the appropriate work if our values are leaves or parent nodes. To understand why we’re using a union to convert our SSE vector, I recommend looking into type punning. [5]

## Conclusion

That’s it! I don’t really have any future work for this part of cranberray, I’m actually quite happy with the end result.

Next we’ll be looking at the ray tracer’s sampling strategy. Until then, happy coding!

## Intro

In our last post, we looked at our SIMD-friendly math library. In this post, we’ll be taking a look at how cranberray builds its BVH. For a primer to BVHs, I recommend going to PBRT’s treatment on the topic here.

Canberray’s BVH is built recursively as a binary tree. It starts with the top level node containing all nodes and recursively splits it into 2 children at each level. Originally, the BVH was split simply by selecting the left and right halves of the children nodes. The next approach was to split according to the median of a particular axis (typically the largest one), and now it makes use of the surface area heuristic.

## SAH

The Surface Area Heuristic or SAH is a well known strategy to build an efficient BVH that tries to minimize the number of intersection tests applied on a tree.

PBRT has an excellent treatment on the topic [2], and I’ve linked one of the original papers referencing the idea [1]. I will try to instead provide an intuitive picture of why and how SAH works.

The idea behind SAH is to discover an efficient partition of a collection that will try to minimize the cost of traversing the tree for any ray.

The SAH function is presented as:

$C_o = C_{trav} + P_A\sum C_{isect}(a_i) + P_B\sum C_{isect}(b_i)$

This cost model is based on probabilities. $C_{trav}$ is the cost of traversing the parent node to determine if any of the two children nodes intersect the ray. In cranberray, this would be the cost of testing the ray against both children AABBs and determining if it intersects either. $C_{isect}$ is the cost of testing the shape itself against the ray. In cranberray, this would be the cost of testing the triangles against the ray. Finally, $P_A$ and $P_B$ are the probabilities that the bounding volumes containing the children nodes intersect the ray.

Now, well make a slight modification and reformulate $P_A$ and $P_B$ as $P(A|C)$ and $P(B|C)$ where C is our parent node. This states that $P(A|C)$ represents the probability of the event A happening given the event C has already happened. In this context, this means that $P(A|C)$ represents the probability that we intersect our volume A given that we’ve already intersected our parent volume C.

With this change, our function is now:

$C_o = C_{trav} + P(A|C)\sum C_{isect}(a_i) + P(B|C)\sum C_{isect}(b_i)$

Now, let’s imagine that we’re a lonesome ray traveling through our BVH. We arrive at a node and we test both of it’s children to see if we intersect it. This takes us 10 units of time ($C_{trav}$). At this point we would intersect A with a probability of $P(A|C)$ and we would intersect B with a probability of $P(B|C)$. In this example, we only intersect with A, and so we test our ray against it’s contained shapes. Let’s say that A has 10 shapes, as a result, we have to pay the cost of testing the intersection with all 10 shapes. This is $\sum_{i=1}^{10} C_{isect}(a_i)$. Finally, let’s say that the cost of $C_{isect}$ is 15 for every child shape.

This gives us the final actual cost of:

$C_o = 10+sum_{i=1}^{10} 15 = 10+150 = 160$

What our original cost function is doing, is calculating the expected cost of doing this operation for any possible ray.

As a result, if we were to build with these values:

$C_{trav}=10$

$C_{isect}=15$

$P(A|C)=1$

$P(B|C)=0$

$A_n = 10$

$B_n = 0$

Then our expected cost for this node would be:

$C_o = 10 + 1*\sum_{i=1}^{10} 15 + 0*\sum_{i=1}^{0} 15 = 10+150 = 160$

Which is exactly the cost of our example above because if our A node has a probability of 1, we would always experience the scenario above.

Of course, if we were to take some more likely value such as:

$P(A|C)=0.5$

$P(B|C)=0.6$

$A_n = 10$

$B_n = 12$

$C_o = 10 + 0.5*\sum_{i=1}^{10} 15 + 0.6*\sum_{i=1}^{12} 15 = 10+0.5*150+0.6*180 = 193$

Which would be the expected value of our node. Notice that our probabilities can add up to more than or less than 1 because it is reasonable that we might be able to intersect with both volumes or none at all.

Now we need to determine how we want to define $P(A|C)$. This probability is the probability of a random ray intersecting with our volume given that we’ve intersected with our parent volume.

$P(A|C)$ is given as the surface area of our volume A divided by the surface area of our parent volume C. As a result, we can define $P(A|C) = \frac{S_A}{S_C}$ as the probability of intersecting A given that we’ve intersected C.

### Derivation In 2D

In 2 dimensions, this makes sense, as we can image a rectangle, embedded in another rectangle. The probability of a random point being placed in the child rectangle given that it is contained in the parent rectangle is equal to the area of the embedded rectangle divided by the area of the parent rectangle.

Now, instead we’ll imagine uniformally random ray instead of a random line. To select a random ray, we can select a random direction and then select a random ray along the plane defined by that direction.

With this in mind, we want to calculate the probability that our ray will intersect the children volume given that it has intersected the parent volume.

To do this, we want to calculate the average projected area of our shape.

Imagine that we’ve selected our random direction for our ray, we can see that our ray will have intersected our shape if it’s projection onto the plane intersects our random ray.

Now if we add a second volume embedded in the first volume, we can see that the probability of intersecting the first volume would be given by the projected area of the child volume divided by the projected area of the parent volume.

This example only works for a single direction. However, since we’re looking for a probability given any ray direction, we need to calculate the average of this projection across the sphere of all directions.

In 2 dimensions, the projected area of a circle is always it’s diameter. We will use this fact to test our work.

To calculate the average projected area of an arbitrary convex shape[4][5], we first want to split that shape into a series of small discrete patches.

This will be our integration domain.

Once we’ve split our shape into discrete patches, we want to calculate the average projected area of a single patch.

Now we’ll integrate across the circle of directions containing the patch and divide it by the area of our integration. This will effectively give us the average projected area for a single patch of our shape. This works because we will have calculated the projection for every single direction.

$dA_p = \frac{1}{2\pi}\int_{\Omega} |cos\theta| dA d\theta$

Integrating with an absolute value is tricky. Instead due to the symmetry of the problem, we can look at only a quarter of our circle and multiply by 4 and remove our absolute term.

$dA_p = \frac{2}{\pi} dA \int_{0}^{\frac{\pi}{2}} cos\theta d\theta$

Since $\int_{0}^{\frac{\pi}{2}} cos\theta d\theta = 1$ our expression reduces to

$dA_p = \frac{2}{\pi} dA$

Finally, we want to divide our value by 2, this is because we’ve now counted our projected area twice.

$dA_p = \frac{1}{\pi} dA$

Using this formulation, we can calculate the average projected area of our shape by adding all of our average projected areas together.

$A_p = \int_A dA_p$

$A_p = \int_A \frac{1}{\pi} dA$

$A_p = \frac{1}{\pi} A$

And there you have it, the average projected area of a 2 dimensional convex object is $\frac{1}{\pi}$ the area of said object.

Notice that if we apply this to the surface area of a circle with the formula $2\pi r$ we get $A_p = \frac{2\pi r}{\pi} = 2r$ which is our diameter which matches up with our expectation.

### Derivation In 3D

We can take the same steps in 3D.

Calculate the average projection over our sphere of directions:

$dA_p = \frac{1}{4\pi} \int_0^{2\pi} \int_0^{\pi} |cos\theta| sin\theta dA d\theta d\phi$

Integrating across the positive hemisphere and multiplying by 2

$dA_p = \frac{1}{2\pi} dA \int_0^{2\pi} \int_0^{\frac{\pi}{2}} cos\theta sin\theta d\theta d\phi$

Since $\int_0^{\frac{\pi}{2}} cos\theta sin\theta d\theta = \frac{1}{2}$

$dA_p = \frac{1}{2\pi} dA \int_0^{2\pi} \frac{1}{2} d\phi$

$dA_p = \frac{1}{4\pi} dA \int_0^{2\pi} d\phi$

$dA_p = \frac{1}{2} dA$

Finally, dividing by 2 for our double projection

$dA_p = \frac{1}{4} dA$

And plugging into our surface area calculation

$A_p = \int_A dA_p$

$A_p = \frac{1}{4} \int_A dA$

$A_p = \frac{1}{4} A$

### Putting It Together

Finally we can see that our average projected area in 3D is $\frac{1}{4}$ it’s surface area.

To calculate our probability, we simply want to divide our parent’s projected area (C) divided by our child’s projected area (A)

$P(A|C) = \frac{A_p}{C_p}$

$P(A|C) = \frac{0.25*A}{0.25*C}$

$P(A|C) = \frac{A}{C}$

Where A and C is the surface area of our volumes. And voila, that’s how we got that original formula.

## In Depth View

Now that we have all the pieces, we can take a look at the construction of our BVH.

Cranberray builds it’s BVH from the top down. Starting from a containing bounding volume and splitting it in 2 recursively.

Cranberray keeps a queue of bounding volumes to process as a ring buffer. This makes management of front popping and back pushing very simple but makes resizing the queue trickier. As a result, Cranberray simply allocates a large buffer. We could likely allocate the maximum possible number of elements in the ring buffer instead (something along the lines of $2n-1$ where n is the next largest power of 2)

Cranberray then selects the axis with the widest breadth of centroids. The code for this looks like so:

cv2 axisSpan[3];
for (uint32_t axis = 0; axis < 3; axis++)
{
for (uint32_t i = 0; i < count; i++)
{
axisSpan[axis].x = fminf(axisSpan[axis].x, caabb_centroid(start[i].bound, axis));
axisSpan[axis].y = fmaxf(axisSpan[axis].y, caabb_centroid(start[i].bound, axis));
}
}

uint32_t axis;
if (axisSpan[0].y - axisSpan[0].x > axisSpan[1].y - axisSpan[1].x && axisSpan[0].y - axisSpan[0].x > axisSpan[2].y - axisSpan[2].x)
{
axis = 0;
}
else if (axisSpan[1].y - axisSpan[1].x > axisSpan[2].y - axisSpan[2].x)
{
axis = 1;
}
else
{
axis = 2;
}


Once we’ve selected our axis, we split our axis in 12 distinct buckets as. (See PBRT [2] for more info on this approach)

We then calculate the cost of each seperation by adding up all the volumes on the left of the seperation and all the buckets on the right of the seperation.

We then store the cost of each seperation and select the seperation with the minimal cost as our split.

We then continue in this recursion until we’ve run out of items in our queue. (When we’ve partitioned all of our child volumes into leaf nodes).

Finally, our BVH is restructured somewhat for an improvement in memory usage.

Our BVH is stored in 2 arrays, a “jump index” array and a bounds array. This allows us to load the bounds array without having to load the jump indices into memory until we absolutely need them.

We read from our bounds memory much more frequently than our jump memory, as a result, splitting them allows us to make more effective use of our caches.

Our BVH structure looks like this:


typedef struct
{
union
{
struct
{
uint32_t left;
uint32_t right;
} jumpIndices;

struct
{
uint32_t index;
} leaf;
};
} bvh_jump_t;

typedef struct
{
caabb* bounds;
bvh_jump_t* jumps;
uint32_t count;
uint32_t leafCount;
} bvh_t;



The final special format of our BVH is that all the leaf nodes in our tree are stored at the end of our array. This allows us to test if a node is a leaf node by simply comparing the index in our array against the size of our array minus the number of leaves contained in the tree. This allows us to use data that we’ve already loaded into memory instead of requiring us to load extra data to use in our branch introducing a data dependency.

You can find the source for BVH construction here: https://github.com/AlexSabourinDev/cranberries/blob/5fe9c25e1df23d558b7ef8b5475717d9e67a19fc/cranberray.c#L963

We’ll be taking a look at the BVH traversal next. Until then, happy coding!

## Future Work

The primary addition to this BVH construction algorithm would likely be to look into parallelizing it’s construction. [6]

## Intro

This is the first in a series of posts detailing the functionality of my hobby path tracing, we will be looking at the underpinning’s of it all, the math library cranberry_math.

The goals of cranberry_math are simple. Making SIMD accelerated mathematics easier and fun to use. There are already a huge swath of math libraries out there that do things very well (DirectXMath and GLM to name a few), however I decided to write my own for fun!

## Ideology

The basis of cranberry_math is the concept of lanes. If you’re not familiar with SIMD, the concept is relatively simple. You can imagine that a SIMD register (or vector register) is a set of 4 “lanes”, these lanes are generally independent of each other but allow you to perform operations on them using another SIMD register. For the purpose of demonstration, we will write a SIMD register as $v_n (v_0, v_1, v_2)$ and the element of one of those registers as $v_ne_m (v_0e_0)$ where $v_0e_0$ would be element 0 of SIMD register 0.

I.e. if I take $v_0$ and add it to $v_1$ the result is the equivalent of $v_2 = (v_0e_0+v_1e_0, v_0e_1+v_1e_1, v_0e_2+v_1e_2, v_0e_3+v_1e_3)$

On it’s own, this might not be particularly useful. But vector register operations have a number of desirable properties.

This is a highly simplified version of how your processor might implement vector adds.

Now, with this added information, we can see why we would want to use these operations efficiently, we could reduce our program latency by at most 4x! (In reality this is unlikely to happen, but we can get fairly close).

Cranberry_math takes this concept and tries to make the best use of it while still making it approachable to the user.

An intuitive and simple use of these registers would be to store our X,Y,Z components into each component such as $v_0 = (x, y, z, unused)$ and working with them in that manner. In some common operations such as adding vectors, this can give you a nice hypothetical improvement of 3x over your original single element addition. However, we’re still leaving 25% of our optimal performance on the table! This approach also causes some issues with common operations such as the dot product.

Let’s say in our fantasy processor, our dot product operation takes 2 cycles to complete when applied to 1 vector. That means we get a throughput of 2 cycles/dot product. I propose (and has been proposed elsewhere as well) that you instead use each register to store a single component of a vector. Such as $v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3)$, etc. With this in mind, we can achieve the dot product by doing $v_x*v_x+v_y*v_y+v_z*v_z$ which is a total of 5 operations that all take 1 cycle in our fantasy processor for 4 vectors or a throughput of 1.25 cycles/dot product.

For SSE according to the intel intrinsics guide a multiply operation has a latency of 4 cycles (on Skylake), an add has a latency of 4 cycles. As a result a dot product in a fantasy processor with no out of order functionality or multiple issue, our wide dot product would take roughly 20 cycles to complete or $20/4=5$ cycles per dot product. While if we look at the dot product instruction, it has a latency of 11 cycles. (See [1] and [2] for info on reading the intel intrinsics guide values) This is not indicative of real performance numbers. Modern processors are significantly more complex than what I’m implying here.

Another benefit of storing vectors this way, is that it scales trivially to SIMD registers of size 8, or 16! You only need to throw more data at it!

Now, this approach to storing vectors in SIMD registers is not fool proof. It introduces difficulty if we want to add similar components together. (x with x, y with y) within one vector, I however have not found this to be a problem while writing my path tracer.

This isn’t a silver bullet, but I recommend considering adding a path for this type of storage if you can afford it.

## In Depth View

With that out of the way, here are the primary types for our library:

#define cran_lane_count 4
cran_align(16) typedef union
{
float f[cran_lane_count];
__m128 sse;
} cfl;

typedef union
{
struct
{
float x, y, z;
};

} cv3;

typedef struct
{
cfl x;
cfl y;
cfl z;
} cv3l;


The theme of this library is that SIMD registers are referred to as lanes. Every element is expected to stay within it’s respective lane. cfl represents “cranberry float lane”, cv3 represents “cranberry vector 3” and cv3l represents “cranberry vector 3 lanes”. Notice how every component in cv3l has it’s own set of lanes? That is a direct implementation of the explanation that was presented above.

cran_forceinline cv3l cv3l_add(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_sub(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_mul(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_min(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_max(cv3l l, cv3l r);


The API for cv3l is simple, it looks exactly the same as it’s cv3 counterpart. This makes it relatively easy to switch from one to the other simply by loading the appropriate data correctly.

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount);


One function that we should take a particular look at is cv3l_indexed_load. It’s likely you don’t want to store your data in cv3l format. It doesn’t lend itself particularly well to general processing and works much better in batch processing. As a result your data needs to be transformed from x,y,z,x,y,z to x,x,x,y,y,y,z,z,z to facilitate our maximum throughput.

One way to shuffle our data would be to load each element individually and store it in the appropriate locations. However this is surprisingly slow (I forget the exact reasons, a reference for that would be awesome).

$v_0 = (x_0, y_0, z_0, u), v_1 = (x_1, y_1, z_1, u), v_2 = (x_2, y_2, z_2, u), v_3 = (x_3, y_3, z_3, u)$

and then shuffle the values into a set of temporary registers

$v_{xy0} = (x_0, y_0, x_1, y_1), v_{xy1} = (x_2,y_2,x_3,y_3), v_{z0} = (z_0, u, z_1, u), v_{z1} = (z_2, u, z_3, u)$

and finally shuffle them into the final registers

$v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3), v_z =(z_0, z_1, z_2, z_3)$

(There’s a mistake with the arrows for $Z_2$ and $Z_3$ they should be pointed to the 4 register)

Here’s the source for this operation:

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount)
{
for (uint32_t i = 0; i < indexCount && i < cran_lane_count; i++)
{
uint8_t const* vectorData = (uint8_t*)vectors;
}

return (cv3l)
{
.x = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(2, 0, 2, 0))},
.y = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(3, 1, 3, 1))},
.z = {.sse = _mm_shuffle_ps(Z0, Z1, _MM_SHUFFLE(2, 0, 2, 0))}
};
}


With all this, we can see that I can write a vectorized ray/AABB intersection using this new API:

cran_forceinline uint32_t caabb_does_ray_intersect_lanes(cv3 rayO, cv3 rayD, float rayMin, float rayMax, cv3l aabbMin, cv3l aabbMax)
{
cv3l rayOLanes = cv3l_replicate(rayO);
cv3l invD = cv3l_replicate(cv3_rcp(rayD));
cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);
cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD);

cv3l tsmaller = cv3l_min(t0s, t1s);
cv3l tbigger  = cv3l_max(t0s, t1s);

cfl rayMinLane = cfl_replicate(rayMin);
cfl rayMaxLane = cfl_replicate(rayMax);
cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
cfl result = cfl_less(tmin, tmax);
}


That’s the gist of it! Writing vectorized BVH traversal using this API has been a breeze. You can find the rest of the library here: https://github.com/AlexSabourinDev/cranberries/blob/cranberray/cranberry_math.h

Here is the library in use for BVH AABB testing:

uint32_t activeLaneCount = min((uint32_t)(testQueueEnd - testQueueIter), cran_lane_count);
cv3l boundMins = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, min), testQueueIter, activeLaneCount);
cv3l boundMaxs = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, max), testQueueIter, activeLaneCount);
uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);


We’ll be taking a deeper look at the SSE optimized BVH traversal in the future. For now, happy coding!

## Intro

In this post, we’ll be looking at the steps taken to create an optimized transform hierarchy. This is by no means the be all end all of transform hierarchies. If anything, it’s more of a log of considerations that have taken place as the creation of the transform hierarchy took place.

### Recap of part 4

In part 4, we looked at SIMD for computations, more quantization, software prefetching and a bug fix. This time, we’re going to be looking at some more bug fixing, more SIMD and some surprising performance impacts!

### Bug fixes

While I was debugging a new optimization for part 5, I ran into a few bugs in the program…

Farmer and crop removal had a bug in it. When we were removing these elements, we would loop from the start of the removed indices and we would take the last valid element from the back and copy it onto the current index we want removed. The problem with this solution, is that if our last valid element was also to be removed, we would copy it onto the current index. This is a problem, because we would need to remove it! It’s not a valid object anymore! Instead, if we loop from the back of the array, we will only replace our current index with an element that is guaranteed to be valid because the indices are ordered.

Farmer and crop removal also had a bug in it with the second call to simd_moveMaskToIndexMask. Before the fix, the call would only mask out the lower 8 bits from the indexMask. However, this is not correct, because the function itself only works with the lower 8 bits! As a result, the indices would all be the same. In order to modify this, I simply shifted the indexMask by 8 bits to the right.

Before:

simd_moveMaskToIndexMask(indexMask & 0xFF00);


After:

simd_moveMaskToIndexMask((indexMask & 0xFF00) >> 8);


It would probably have been best to take a uint8_t as an argument instead of an unsigned int but I had misunderstood the functionality of the _pdep_u64 intrinsic used in simd_moveMaskToIndexMask.

Finally, there was also a bug in converting the result of a 16 bit compare to a move mask. Because there is no movemask intrinsic for 16 bit integer, I had used _mm256_packs_epi16(a, b) which I believed to convert the 16 bit integers to 8 bit integers at their respective element. However, I had not realized that these actually worked within lanes! The first 8 elements were from a and the next 8 were from b and then the next 8 were from a and then from b. I expected the format to be a, a, b, b. As a result, the movemask would end up incorrect!

Instead, I modified the code to execute the movemask on the 16 bit results and then converted the mask to an 8 bit movemask.

__m256i cmpRes = _mm256_cmpgt_epi16(zeroI, lifetime);


The magic happens in _pext_u32(moveMask, 0x55555555). Because we’re creating a movemask from 16 bit integers and not 8 bit integers, the mask bits will actually double! If the result of our compare was 0xFFFF, 0x0000, 0xFFFF, 0xFFFF then our movemask would be 11001111 which is not correct! We want our movemask to look like 1011. As a result, I used _pext_u32. _pext_u32 will take the bits corresponding set bits in mask (0x55555555) and pack them from least significant bit to most significant bit. This means that we’re taking all our even bits and packing them. Because 0x55555555 in binary is 01010101010101… we’re taking x1x0x1x1 from our movemask and packing them to 1011! (More cool bit tricks can be found here)

As a result to all these changes, the performance of the program degraded. I believe it is because we weren’t adding all the indices to our removal list, and we were also not adding the correct indices! Now we actually loop through all the correct indices and chrome://tracing says that our average ai_tick performance degraded from 1.67ms to 1.87ms. Our average tick is now 2.75ms instead of 2.6ms.

### Bucket farming

At this point we’ve made our program very efficient by improving the speed at which we access and modify our data. But now, it’s gotten quite a bit harder to improve the performance of our algorithm this way.

Instead, we’re going to modify our algorithm. According to VTune, most of our time is spent decrementing timers and moving farmers. We’re going to be tackling the timers.

The timers are currently all decremented at a rate of 16ms per tick, that means at times we can be decrementing 1 million timers per tick! Instead of doing this, we can do something much better, we can group our timers in buckets and only decrement the global bucket timers instead of the timers themselves. Then, in order to retain our fine grained timing, we keep track of which bucket will require fine grained decrementation and we will only decrement that bucket!

Looking at our farm state, we can see that our state decrements a timer between the ranges of 3 seconds and 5 seconds. In order to split this up into buckets, I decided to split these buckets up in 6 buckets where each bucket holds the timers of a specific time range.

We start off with an index indicating which bucket needs to be finely decremented. We then increment our global bucket timer and decrement the timers in the bucket referenced by our index. Once out timer reaches 1s, we reset the timer and advance our index by 1 and modulo by the number of buckets in order to get it to wrap around.

Say our current fine bucket index is 5, then our timer reaches 1s, we advance our index by 1 to 6. Since 6 % 6 is 0, that’s our next fine decrementation bucket. We can guarantee this property because we place the timers in the buckets based on their number of seconds.

int16_t farmerTimer = rand_range(AI_FarmerFarmSpeedMin, AI_FarmerFarmSpeedMax);
uint32_t bucketSecond = (farmerTimer + AI_FarmersFarmBucketTransitionTimer) / AI_TimePrecision;
uint32_t bucketFarmerCount = AI_FarmersFarmHotBucketCounts[bucketIndex];

AI_FarmersFarmHotBuckets[bucketIndex][bucketFarmerCount] = farmerTimer - bucketSecond * AI_TimePrecision;
AI_FarmersFarmHotBucketCounts[bucketIndex]++;


Looking at chrome://tracing indicates that ai_tick now runs at an average of 1.7ms from our original 1.8ms. Interestingly, game_gen_instance_buffer now runs 0.1ms slower. I wonder if this a result of ai_tick not allowing game_gen_instance_buffer to complete it’s work in the background.

Looking at VTune indicates that we’re now 75% back-end bound in ai_tick and we’re now retiring almost 30% of our instructions, very good results from bucketing the farm timers.

Doing the same modification to the search timers doesn’t produce exceptional results, but I will keep this modification in order to keep things consistent.

### More quantization!

Now that our timers are only in the range of 0s to 1s, we can quantize our data even more! Now I’m going to quantize our values to a range of 1s to 10ms. This will cause our precision to drop significantly, but since our timers are for AI, I think this cost is appropriate.

As a result, I changed all the farmer timers to int8_t and change the AI_TimePrecision to 10ms. With this change, chrome://tracing notes that ai_tick now runs at an average of 1.25ms from our 1.7ms! However, once again, game_gen_instance buffer slowed down from 0.9ms to 1.17ms…

Big thanks to Zack Dawson (twitter) for the idea!

### Streaming again?

When we left off for game_gen_instance_buffer, we were mostly using memcpy to copy all of our data from one position buffer to another. At that time, I had decided to use memcpy because memcpy is very fast and the location being written to was not guaranteed to be aligned with the location being read from, restricting me from using _mm256_stream_si256.

As a result, I decided to write the last element of an array multiple times in order to align the write index to a 64 byte boundary. This gave me the opportunity to use the stream intrinsics for our homegrown memcpy:

void simd_streamMemCpy(__m256i* dstWrite, __m256i* srcRead, size_t size)
{
if (size == 0)
{
return;
}

__m256i* dstEnd = dstWrite + (size >> 5);
for (; dstWrite <= dstEnd; dstWrite += 2, srcRead += 2)
{
_mm256_stream_si256(dstWrite, src256);
_mm256_stream_si256(dstWrite + 1, src256_2);
}
}


Which compiles to:

    test    rdx, rdx
je      .LBB0_3
and     rdx, -32
xor     eax, eax
.LBB0_2:
vmovaps ymm0, ymmword ptr [rsi + rax]
vmovntps        ymmword ptr [rdi + rax], ymm0
vmovaps ymm0, ymmword ptr [rsi + rax + 32]
vmovntps        ymmword ptr [rdi + rax + 32], ymm0
lea     rcx, [rdi + rax]
cmp     rcx, rdx
jbe     .LBB0_2
.LBB0_3:
vzeroupper
ret


Taking a look at chrome://tracing shows us that gen_instance_buffer now runs at an average of 1ms instead of 1.1ms, saving us an additional 0.1ms.

And VTune:

Both our slowest functions are now taking less than 1s!

### Can we avoid reading the tile stage?

At first, in order to avoid reading the tile stage and to be able to just store the index without having to access the memory, I thought storing a collection of the unplanted indices might be effective. However, I had missed the obvious that in order to get the index from this collection of indices, I would have to read that memory instead. This added complexity simply slowed down ai_tick by 0.2ms.

As a result, I simply changed the tile from a 32 bit integer, to an 8 bit integer giving speedups in ai_tick but slowing down game_gen_instance_buffer again.

### The return of game_gen_instance_buffer

At this point, game_gen_instance_buffer is running at an average of 1ms per tick. Still slower than the original 0.9ms from part 4. As a result, we’re going to tackle it some more with quite a few modifications.

The first, we changed the type for sprite indices from uint16_t to uint8_t. Since the range of values for these indices is 0 to 11, we have plenty of space for these 11 values in a uint8_t.

The next modification was quite a bit more complex than the uint8_t modification. This change takes into account that the state of the farmers will dictate what sprite index will be used to render the farmers and that all farmers share the same scale.

In order to keep the blog post from becoming all about this optimization, I’m only going to go through the explanation for the scales.

The buffer generation is very simple: I write every scale into the array in the order of tiles, crops, farmers. The number of tiles is constant, thus, we don’t need to worry about it. The number of crops however, is variable. This means that were our farmer scales starts and end is determined by the number of crops rendered that tick.

While rendering, I keep track of a writing index. This index determines where the next sprite instance will be rendered.

Like this:

int writeIndex = 0;

__m256i farmerScale = _mm256_set1_epi16(AI_FarmerScale);

if (Gen_PreviousFarmerScaleStart == 0)
{
Gen_PreviousFarmerScaleEnd = writeIndex + AI_FarmerCount;
Gen_PreviousFarmerScaleEnd += 64 - (AI_FarmerSearchCount % 64);
Gen_PreviousFarmerScaleEnd += 64 - (AI_FarmerMoveCount % 64);

simd_memSetToValue((__m256i*)(buffer->scales + writeIndex), farmerScale, (Gen_PreviousFarmerScaleEnd - writeIndex) * sizeof(uint16_t));
Gen_PreviousFarmerScaleStart = writeIndex;
}
else
{
uint32_t newEnd = writeIndex + AI_FarmerCount;
newEnd += 64 - (AI_FarmerSearchCount % 64);
newEnd += 64 - (AI_FarmerMoveCount % 64);

if (newEnd > Gen_PreviousFarmerScaleEnd)
{
uint32_t extra = Gen_PreviousFarmerScaleEnd % 64;
uint32_t writeLoc = Gen_PreviousFarmerScaleEnd - extra;
simd_memSetToValue((__m256i*)(buffer->scales + writeLoc), farmerScale, (newEnd - Gen_PreviousFarmerScaleEnd + extra) * sizeof(uint16_t));
}

if (writeIndex < Gen_PreviousFarmerScaleStart)
{
simd_memSetToValue((__m256i*)(buffer->scales + writeIndex), farmerScale, (Gen_PreviousFarmerScaleStart - writeIndex) * sizeof(uint16_t));
}

Gen_PreviousFarmerScaleEnd = newEnd;
Gen_PreviousFarmerScaleStart = writeIndex;
}


And that’s it! This change adds quite a bit of complexity to our generation code but the results are worth it!

chrome://tracing indicates that our average game_gen_instance_buffer tick is now 0.72ms, faster than our original performance! ai_tick also runs at an average of 1.2ms and our average tick is now 1.9ms.

VTune shows us that game_gen_instance_buffer is now at

### More bug fixes…

At this point, I realized that I hadn’t ran the game out of profile mode in a bit… I had gotten too enthralled in the performance side of things. And when I ran it… nothing rendered…

It turns out I had made a mistake very early on when I changed the format of Game_InstanceBuffer.

As a refresher, our buffer looks like this:

typedef struct
{
uint8_t spriteIndices[GAME_MAX_INSTANCE_COUNT];
uint16_t scales[GAME_MAX_INSTANCE_COUNT];
uint16_t positionX[GAME_MAX_INSTANCE_COUNT];
uint16_t positionY[GAME_MAX_INSTANCE_COUNT];
} Game_InstanceBuffer;


The code was set up in such a way that I would copy the data needed for all the rendered instances with this call:

sg_update_buffer(Render_DrawState.vertex_buffers[0], Render_InstanceBuffer, (sizeof(uint16_t) * 3 + sizeof(uint8_t)) * renderInstances);


This doesn’t work! If our renderInstances is 1, we’re rendering 7 bytes of data from the sprite indices array, not spriteIndices, scales, positionX and positionY! This caused the GPU to only get a little bit of the data that it needed and the rest was completely missing…

This could be seen as a lesson in testing all aspects of your code…

### One last hoorah

At this point, it was very clear to me that I had to address the movement code. The problem with this, is that we have to access every target position and velocity in order to move the farmer.

One approach that I attempted was to store the future positions of the farmers into N buffers and read from those and only process a fraction of the active move farmers while using the buffered positions for the other fractions. A rough prototype of this approach however showed that this was slower than the simple processing of all the farmers due to needing to touch quite a bit of memory to store the future positions of the farmers.

Another consideration was to bucket the farmer movements by cardinal direction. If the farmers were close enough to a cardinal direction, they would use that direction as an approximation of their velocity. This attempt only managed to provide speedups if a large portion of the farmers were bound to predefined velocities and quickly introduced visual artifacts. As a result, this solution although potentially viable with a large amount of predefined velocities didn’t seem particularly viable for this project.

After these two attempts, I’m going to call it for this project. We made some excellent progress and I learned a ton in the process. I hope you did as well!

### Where are we now?

Looking at chrome://tracing tells us that we’re now at an average tick of around 2ms from our original 42ms. 21 times faster than our original performance! We tackled a lot from memory access patterns, quantization, SIMD, software prefetching and non-temporal stores. As a result however, our program went from ~400 lines to almost 1k lines of code. Our program is now harder to change and harder to read, but code cleanliness was not one of our concerns here.

I had great fun and I hope you did too!

Until next time!

Find the repo on github here!

### Part 3 Recap

In part 3 we looked at quantization, field_crop memory access, dirty draw commands, SIMD for memory and non-temporal stores. In part 4, we’re going to be looking at SIMD for computation, more quantization and software prefetching.

### Back at it again

As always, let’s take a look at our VTune profile.

It seems that we’re fairly back-end bound, however not nearly as much as we used to be. At this point, it’s not entirely clear where we could be improving our performance in ai_tick. VTune tells us that most of our operations are fairly efficient.

As a result, I considered that SIMD might be the best approach to improve the performance of this function. It does a lot of computations and if we can make use of the larger SIMD registers for loading as well, we could be more efficient.

To introduce SIMD, we have quite a lot of setup to do.

Currently, most code that requires positions looks like this:

typedef struct
{
Vec2 tilePos;
} AI_FarmerMveStateHot;


What we need to do, is to split up these structures into an X and Y component array.

Like this:

static float* AI_FarmersMoveHotX = NULL;
static float* AI_FarmersMoveHotY = NULL;


There’s a very simple reason for doing this. We could load our registers with 2 vec2s ([x1, y1, x2, y2]) but then all our following operations are only working on 2 positions. However, by splitting the positions into 2 arrays, we can actually load 4 positions at once and process those elements at the cost of a few more registers ([x1, x2, x3, x4], [y1, y2, y3, y4]).

It also takes away the need to do horizontal adds. (Horizontal adds will add the elements of a vector across a register. For a vector of [x1, y1, x2, y2] it would be x1 + y1 + x2 + y2). We want to avoid doing horizontal adds because their performance is inferior to our usual vector operations. (See here)

Another approach to processing 4 positions at once is to shuffle 4 of our vectors into the X only, Y only ([x1, x2, x3, x4] [y1, y2, y3, y4]) format but this is additional code that we can avoid because we have complete control over the layout of our data.

A few of the changes we had to make were:

Changing the instance buffer to have a positionX and positionY array to facilitate our memcpy in the instance buffer generation.

typedef struct
{
int16_t spriteIndicesAndScales[GAME_MAX_INSTANCE_COUNT * 2];
float positionX[GAME_MAX_INSTANCE_COUNT];
float positionY[GAME_MAX_INSTANCE_COUNT];
} Game_InstanceBuffer;


Removal of the vec2 structure, everything is an array of floats now. Our usage of vec2s is completely erased. Our movement code is now purely floating point math.

float farmerX = AI_FarmersMoveHotX[i];
float farmerY = AI_FarmersMoveHotY[i];
float genFarmerX = AI_FarmersMoveGenX[i];
float genFarmerY = AI_FarmersMoveGenY[i];

float dirVecX = farmerX - genFarmerX;
float dirVecY = farmerY - genFarmerY;
float mag = sqrtf(dirVecX * dirVecX + dirVecY * dirVecY);

float velX = dirVecX * velMag / mag;
float velY = dirVecY * velMag / mag;
AI_FarmersMoveGenX[i] = genFarmerX + velX;
AI_FarmersMoveGenY[i] = genFarmerY + velY;


Finally, another change was the removal of referencing position in tiles and in the crops. This was done because we can determine the position of a tile and crop from it’s tile index. This removes 32 bytes from these structures at the cost of a little more computation.

typedef struct
{
float grown;
uint32_t cropType;
uint32_t tileIndex;
} Field_Crop;

typedef struct
{
Field_Stage stage;
} Field_Tile;


These changes had a better performance impact than I expected.

ai_tick improved by one second from 5.3s to 4.1s and field_tick doubled in performance from 1.5s to 0.7s.

Looking at chrome://tracing, ai_tick improved from 4.4ms to 4.1ms, field tick improved from 0.7ms to 0.5ms and our overall tick improved from 6.6ms to 6.1ms.

Link to this change is here.

### The SIMD-fication

Here is how our movement code started after we split our positions into X and Y arrays:

float farmerX = AI_FarmersMoveHotX[i];
float farmerY = AI_FarmersMoveHotY[i];
float genFarmerX = AI_FarmersMoveGenX[i];
float genFarmerY = AI_FarmersMoveGenY[i];

float dirVecX = farmerX - genFarmerX;
float dirVecY = farmerY - genFarmerY;
float mag = sqrtf(dirVecX * dirVecX + dirVecY * dirVecY);

float velX = dirVecX * velMag / mag;
float velY = dirVecY * velMag / mag;
AI_FarmersMoveGenX[i] = genFarmerX + velX;
AI_FarmersMoveGenY[i] = genFarmerY + velY;

if (velMag > mag)
{
// Move is complete
}


The first step to SIMD-ing this code is the linear algebra:

__m256 farmerX = _mm256_load_ps(AI_FarmersMoveHotX + i);
__m256 farmerY = _mm256_load_ps(AI_FarmersMoveHotY + i);
__m256 genFarmerX = _mm256_load_ps(AI_FarmersMoveGenX + i);
__m256 genFarmerY = _mm256_load_ps(AI_FarmersMoveGenY + i);

__m256 dirVecX = _mm256_sub_ps(farmerX, genFarmerX);
__m256 dirVecY = _mm256_sub_ps(farmerY, genFarmerY);
__m256 rmag = _mm256_rsqrt_ps(_mm256_add_ps(_mm256_mul_ps(dirVecX, dirVecX), _mm256_mul_ps(dirVecY, dirVecY)));

__m256 velX = _mm256_mul_ps(dirVecX, _mm256_mul_ps(velMag, rmag));
__m256 velY = _mm256_mul_ps(dirVecY, _mm256_mul_ps(velMag, rmag));


This looks like a lot, but it maps almost exactly 1 to 1 to our original code. You might notice that instead of using sqrt, we’re using rsqrt. The reason for this is simple, the reciprocal of a square root has a fast approximation that will reduce the precision of our operation but will also improve our performance. rsqrt has a latency of around 5 units, whilst sqrt has a latency of 13. (see the Intel Intrinsics guide and wikipedia)

The next step is more challenging. We now have 8 magnitudes to compare. Before SIMD we could have a simple branch, but now this has become more difficult.

One approach would be to loop through each element and do a single comparison… This is alright, but it’s not very “SIMD”, we’re not processing multiple elements at once anymore.

We’ll be doing our comparison and storing the indices of the magnitudes that passed our comparison. (There’s an excellent collection of slides on doing this from Andreas Fredriksson detailing how to do this here.)

We’re currently using AVX for our movement code, this allows us to process 8 elements at once instead of 4, which means we also get access to additional intrinsics that we would not typically have access to with SSE.

Let’s take a look at the code and break it down line by line:

// Thanks to https://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask for this great algorithm
// Uses 64bit pdep / pext to save a step in unpacking.
{
uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101);  // unpack each bit to a byte

const uint64_t identity_indices = 0x0706050403020100;    // the identity shuffle for vpermps, packed to one index per byte

__m128i bytevec = _mm_cvtsi64_si128(wanted_indices);

}

int bitMask = (1 << math_min(previousFarmerMoveCount - i, 8)) - 1;

__m256 cmpRes = _mm256_cmp_ps(rvelMag, rmag, _CMP_LT_OQ);

__m256i indices = _mm256_set_epi32(i, i, i, i, i, i, i, i);

_mm256_storeu_si256((__m256i*)(AI_FarmerRemovalIndices + removedFarmerCount), indices);


First we need to mask out the comparison of elements that aren’t valid farmers. We can create a bit mask of valid farmers like so:

int bitMask = (1 << math_min(previousFarmerMoveCount - i, 8)) - 1;


This line will simply determine how many farmers we’re processing in this iteration. (If we were at index 8 but only had 14 farmers we would get 14 – 8 which would mean we’re processing 6 farmers and we create a mask of the first 6 bits set.

Then, we do our comparison:

__m256 cmpRes = _mm256_cmp_ps(rvelMag, rmag, _CMP_LT_OQ);


This compares our velocity against our magnitude, similarly to how we did it before. When a velocity’s magnitude is higher than our distance, the element in the result register will have all of its bits set. (If the magnitudes are [0, 2, 3, 4] and the velocities are [1, 1, 1, 1], the resulting vector would be [0xFF, 0, 0, 0]) This isn’t of much use to us yet, we need to convert this register into a mask where the bits that passed the comparison are set. (For the previous example, the value would be 1000)

As this point, we want to be able to take this mask and convert it to a collection of packed indices.

We want to pack our indices to the left of a register because we want to be able to store only the indices that we want to remove from the collection of farmers. Currently our mask has a value, maybe 01001100, but what we want is a register with the values i + 3, i + 4, i + 7.

To do this, we call simd_moveMaskToIndexMask which is a modified version of the code found here. This function will return a collection of packed integers as a result of the mask provided. (for the mask 01001100 we would get 3, 4, 7)

Then we can take these values and add them to i to get our removed farmer indices.

__m256i indices = _mm256_set_epi32(i, i, i, i, i, i, i, i);

_mm256_storeu_si256((__m256i*)(AI_FarmerRemovalIndices + removedFarmerCount), indices);


Using the previous example, our resulting vector would be [i + 3, i + 4, i + 7, don’t care, don’t care, don’t care…] We don’t care about the following elements because they didn’t pass the comparison.

And then we move our write index forward by the number of farmers removed:

removedFarmerCount += _mm_popcnt_u32(indexMask); // _mm_popcnt_u32 counts the number of bits set in a u32


Finally we can just loop through our removed farmer indices:

for(uint32_t i = 0; i < removedFarmerCount; i++)
{
int r = AI_FarmerRemovalIndices[i];
}


That’s it! We SIMD-ed our movement.

Let’s take a look at VTune!

We’ve shaved off another second from our ai_tick from 4.1s to 3.2s. chrome://tracing says that our ai_tick is now 3.3ms from 4.1ms. The average tick has improved from 6.1ms to 5.2ms.

You might notice that we’re more back-end bound and our CPI has increased. This is to be expected, SIMD instructions have a higher CPI but this is fine since we’re now processing 8 elements instead of 1 element at a time. As for being more back-end bound, since we’re now processing data quickly, I suspect that we’re not running slow enough to hide the latency of the memory loads. I’m not quite sure how to address this.

Link to this change is here.

### Field_tick is back?

Before I SIMD everything else, there’s one change to be made to field_tick.

The Field_Crop structure looks like this:

typedef struct
{
float grown;
uint32_t cropType;
uint32_t tileIndex;
} Field_Crop;


Looking at this structure, we don’t actually need lifetime and grown! We can use a single float to represent how long the crop has to live and instead of incrementing and comparing against grown, we will decrement and compare against zero. (We will also split the Field_Crop structure, it will simplify our SIMD code)

typedef struct
{
uint32_t cropType;
uint32_t tileIndex;
} Field_Crop;



With this simple change, field_tick has improved from 0.49ms to 0.052ms… I doubt we will be looking at field_tick anytime soon.

Link to this change is here.

### The Great SIMD-fication

Following the field_tick improvement. I SIMD-ed the rest of the game code. chrome://tracing states that ai_tick improved from 3.3ms to 2.8ms. VTune notes that the performance has slightly improved from 3.2s to 2.8s.

Link to this change is here.

### The Useless Optimization

I then went on to try to using the streaming instructions _mm_stream_ps and _mm_stream_load_si128 in an attempt to maybe speed things up by not evicting anything I might need from the cache.

My reasoning behind the performance not improving was that I had no data to evict that I might need in the first place. And since I still had to read the data, I couldn’t have the benefit of being able to write without loading.

### Quantization Again? That’s half the truth

At this point, I couldn’t just “SIMD” everything and get more performance (We already did that!). But I ran into something interesting. AVX has a series of float to half conversion intrinsics that could be used to store our data in half the size of our previous data. (For the curious half is a form of floating point storage that takes up half the size of a typical float (16 bits) more information on them can be found here)

This is very similar to the quantization I implemented in part 3 where we take a range of values and map them to an integer range.

The modifications for this were simple.

I added 2 new macros that did all the work for me on load and store:

#define SIMD_FLOAT_TO_HALF(f) _mm_extract_epi16(_mm_cvtps_ph(_mm_set_ss(f), _MM_FROUND_NO_EXC), 0)


And then simply changed the data types from floats to 16 bit integers for storage.

This change had some beneficial effects on all of the gameplay code. According to chrome://tracing ai_tick went from 2.8ms to 2.6ms, field_tick went from 0.052ms to 0.022ms and game_gen_instance_buffer went from 1.5ms to 0.8ms. On average, the tick improved from 4.4ms to 3.4ms!

Link to this change is here.

### More failures

At this point, I tried a variety of other things to improve performance:

• I attempted to store field tiles in 4 bits instead of 8 bits. This was a detriment to the performance overall. I assume that the load needed to write to the tile might have affected the CPUs ability to send of the write without having to load anything into registers?
• I attempted to store the crop type in the tile index uint8_t instead of storing it as another uint32_t. But this caused the program to also have reduced performance.

### The prefetching chronicles

At this point, it got harder to make things faster. I was out of “tricks” for improving the performance. And so I turned to VTune. According to VTune I was still highly back-end bound, especially in the farmer removal code where farmers are transitioned from state to state.

A potential reason why removal is causing our code to be back-end bound is that we have a layer of indirection that keeps track of the indices that we want to remove from the the farmer list. After processing all the farmers, we then loop through this collection of removed indices and remove the farmers associated to those indices.

The problem with this indirection, is that our memory patterns are slightly irregular and our hardware prefetcher can’t predict where our next farmer will be. As a result, I decided to turn to software prefetching.

### What is software prefetching?

Software prefetching is a mechanism built in to modern CPUs that allows you to hint to your CPU that you would like it to start prefetching some memory at some address for you. It’s similar to hardware prefetching in the sense that it will go gather memory for you, but it differs by allowing you to ask for memory that has a more irregular pattern of fetching.

A potential downside of asking for memory to be prefetched is that it might evict something from the cache that is currently in use or our cache line might be evicted before we even get to it. (A few sources on software prefetching can be found here and here)

Software prefetching is very tricky to get right, and I don’t know all of the intricacies of using it, but it made sense for the removal of farmers.

The change was simple. The code for transitioning a moving farmer to a farming farmer looked like this:

for(uint32_t i = 0; i < removedFarmerCount; i++)
{
int r = SIMD_FarmerRemovalIndices[i];

AI_FarmersFarmHot[AI_FarmerFarmCount + i] = rand_range(AI_FarmerFarmSpeedMin, AI_FarmerFarmSpeedMax);
AI_FarmersFarmCold[AI_FarmerFarmCount + i] = AI_FarmersMoveCold[r];
AI_FarmersFarmGenX[AI_FarmerFarmCount + i] = AI_FarmersMoveGenX[r];
AI_FarmersFarmGenY[AI_FarmerFarmCount + i] = AI_FarmersMoveGenY[r];

AI_FarmersMoveHotX[r] = AI_FarmersMoveHotX[AI_FarmerMoveCount - 1 - i];
AI_FarmersMoveHotY[r] = AI_FarmersMoveHotY[AI_FarmerMoveCount - 1 - i];
AI_FarmersMoveCold[r] = AI_FarmersMoveCold[AI_FarmerMoveCount - 1 - i];
AI_FarmersMoveGenX[r] = AI_FarmersMoveGenX[AI_FarmerMoveCount - 1 - i];
AI_FarmersMoveGenY[r] = AI_FarmersMoveGenY[AI_FarmerMoveCount - 1 - i];
}


And now the code looks like this:

for(uint32_t i = 0; i < removedFarmerCount; i++)
{
int r = SIMD_FarmerRemovalIndices[i];

int nextRemoval = SIMD_FarmerRemovalIndices[i + 2];
_mm_prefetch((const char*)(AI_FarmersMoveCold + nextRemoval), _MM_HINT_T0);
_mm_prefetch((const char*)(AI_FarmersMoveGenX + nextRemoval), _MM_HINT_T0);
_mm_prefetch((const char*)(AI_FarmersMoveGenY + nextRemoval), _MM_HINT_T0);

AI_FarmersFarmHot[AI_FarmerFarmCount + i] = rand_range(AI_FarmerFarmSpeedMin, AI_FarmerFarmSpeedMax);
AI_FarmersFarmCold[AI_FarmerFarmCount + i] = AI_FarmersMoveCold[r];
AI_FarmersFarmGenX[AI_FarmerFarmCount + i] = AI_FarmersMoveGenX[r];
AI_FarmersFarmGenY[AI_FarmerFarmCount + i] = AI_FarmersMoveGenY[r];

AI_FarmersMoveHotX[r] = AI_FarmersMoveHotX[AI_FarmerMoveCount - 1 - i];
AI_FarmersMoveHotY[r] = AI_FarmersMoveHotY[AI_FarmerMoveCount - 1 - i];
AI_FarmersMoveCold[r] = AI_FarmersMoveCold[AI_FarmerMoveCount - 1 - i];
AI_FarmersMoveGenX[r] = AI_FarmersMoveGenX[AI_FarmerMoveCount - 1 - i];
AI_FarmersMoveGenY[r] = AI_FarmersMoveGenY[AI_FarmerMoveCount - 1 - i];
}


Notice the addition of _mm_prefetch? Those are the SSE prefetch instrinsics. They take a hint that notes which cache level you would prefer the memory you are requesting to be in. I decided that it would go to L1 because I’m going to use it very soon.

You might notice that the nextRemoval index is not the next one, but actually the second one after. The reason for this, is that asking for the next one would most likely not complete the prefetch before the next iteration. Through some experimentation, I found that the second one was an effective value for the prefetch.

This change was coupled with the next change. As a result, we won’t take a look at chrome://tracing just yet.

### More quantization?

While looking for some ways to improve the performance of the farmers, I came upon an idea.

If I convert the timers from 16 bit halves to 16 bit integer values, I would be able to load 16 timers instead of 8 at a time. Currently we’re limited to loading 8 at a time because we have to expand them to floats before we can operate on them. If instead, we load and store them as 16 bit integer values, we can use the full power of the 256 registers.

We store these values in integers by multiplying the floating point by 1000 and storing the result in an int16_t. Instead of decrementing floating point, we reduce the precision of our numbers and load our 16 bit integers as a number representing the number of milliseconds left. This means we are limiting our range of precision to 32s and 1ms, which is enough precision for a farmer brain.

With this simple change, our timer code now looks like this:

__m256i farmerSearchTimer = _mm256_load_si256((__m256i*)(AI_FarmersSearchHot + i));
farmerSearchTimer = _mm256_sub_epi16(farmerSearchTimer, delta256);


And that’s it!

With these changes in hand, chrome://tracing says that ai_tick is now 1.78ms from 2.7ms, and crops are now 0.013ms from 0.021ms. This leaves us with an average tick of 2.7ms. Achieving our goal of reaching less than 3ms.

Link to this change is here.

### The bug fix that saved the world

As I was thinking about the program during the week, and trying to think of different ways to improve the performance, I came upon a bug!!

In the index packing code, we were only packing the first 8 indices and forgetting the next 8. This caused our removal code to access uninitialized indices and (spoiler) severely affecting the performance of our ai_tick code!

int indexMask = _mm256_movemask_ps(cmpRes) & bitMask;

__m256i indices = _mm256_set_epi32(i, i, i, i, i, i, i, i);

_mm256_storeu_si256((__m256i*)(AI_FarmerRemovalIndices + removedFarmerCount), indices);


Looking at this code, you’ll notice that __m256i can only hold 8 32 bit integers. However, we’re processing 16 timers. And since we’re telling our removal count that we’re removing up to 16 timers, this causes us to loop over too much memory.

The quick fix for this problem was simply to handle an additional 8 indices:

__m256i indices = _mm256_set_epi32(i, i, i, i, i, i, i, i);

_mm256_storeu_si256((__m256i*)(SIMD_FarmerRemovalIndices + removedFarmerCount), indices);

__m256i next8Indices = _mm256_set_epi32(i + 8, i + 8, i + 8, i + 8, i + 8, i + 8, i + 8, i + 8);

_mm256_storeu_si256((__m256i*)(SIMD_FarmerRemovalIndices + removedFarmerCount), next8Indices);


And the results are a slight improvement in performance. chrome://tracing says that ai_tick now runs at an average of 1.67ms instead of 1.7ms.

Link to this change is here.

### That’s it for part 4!

In this part, we looked at:

• SIMD computations
• Half floating point
• More quantization
• Software prefetching

Our program now runs at an average 2.6ms per tick. Achieving our goal of less than 3ms per tick!

In part 5, we’ll look into wrapping up the series, we’re going to tackle game_gen_instance_buffer’s alignment issues, attempt to use the AVX gather instructions in order to potentially vectorize our removal loop and reducing the amount of timers that we need to decrement every tick!

Until next time!

### What happened again?

In part 2, we looked at branch mispredictions and making effective use of cache memory. We split our farmer structure from a union into a series of smaller structures that allows us to make the most use of our cache line and also significantly reduced our branch mispredictions.

### Let’s jump right in!

Jumping off from our changes in part 2, we’ll launch VTune and see where our performance is going!

Looking at our capture, we can see that game_gen_instance_buffer completely overshadows the rest of our methods. At 77.4% back-end bound, it is severely limited by our memory access patterns. Addressing that will be our first task.

If we take a look at chrome://tracing we also notice that game_gen_instance_buffer runs at an average of 12 ms per tick.

The profiler notes that this chunk of code is largely back-end bound:

if (Field_Tiles[i].crop != NULL)
{
uint32_t cropWriteIndex = writeIndex++;
buffer->instances[cropWriteIndex].spriteIndex = 7.0f + Field_Tiles[i].crop->cropType;
buffer->instances[cropWriteIndex].scale = 2.0f / Field_Width;

buffer->instances[cropWriteIndex].pos[0] = Field_Tiles[i].pos.x;
buffer->instances[cropWriteIndex].pos[1] = Field_Tiles[i].pos.y;
}


More accurately this line is a large contributor to our memory access woes:

buffer->instances[cropWriteIndex].spriteIndex = 7.0f + Field_Tiles[i].crop->cropType;


Why is that? We’ll if we take a look at our Field_Tile structure:

typedef struct
{
Field_Crop* crop;
Field_Stage stage;
Vec2 pos;
} Field_Tile;


Since “crop” is a pointer to a Field_Crop allocated at an arbitrary position on the heap, our hardware prefetchers can’t fetch it for us! (More information on heap allocations using malloc can be found here, See Part 2 for more information about the memory hierarchy)

As a result, we’re going to want to modify how crops are allocated and accessed. Instead of accessing the crops through their tiles, we’re going to move the crops into their own array and loop through the array directly as we generate the rendering information. This will improve our access patterns for the prefetcher and allow our memory to be nicely packed into cache lines.

Following this modification, we might notice that field_tick (inlined into game_tick) might also become faster! This is because instead of indirectly accessing the crops through their tile, we’ll be able to access them linearly in memory.

While implementing this optimization, I noticed that I didn’t correctly address removing the farmer states when they were complete… I forgot to move the iterator index back 1 after the swap which would cause the state that was swapped to not be processed… This was also fixed! …oops…

Another resulting optimization was the conversion of pointers to 32 bit indices, this simplified a lot of the addressing and also reduced the size of the structures on a 64 bit system by 32 bits!

Looking at our Field_Crop and Field_Tile structures we can see that their relationship was inverted:

typedef struct
{
float grown;
uint32_t cropType;
uint32_t tileIndex;
Vec2 pos;
} Field_Crop;

static uint32_t Field_CropCount = 0;
static Field_Crop* Field_Crops = NULL;

typedef struct
{
Field_Stage stage;
Vec2 pos;
} Field_Tile;


Field_Crop now references Field_Tiles through an index instead of Field_Tile holding a pointer to the Field_Crop memory, and our Field_Crop memory is now stored in a nice linear array.

Our Field_Tick code used to look like this:

void field_tick(float delta)
{
for (uint32_t i = 0; i lifetime += delta;
{
Field_Tiles[i].stage = FieldStage_Grown;
}
}
}
}


But it now looks like this:

void field_tick(float delta)
{
for (uint32_t i = 0; i lifetime += delta;

{
Field_Tile* tile = &Field_Tiles[crop->tileIndex];
tile->stage = FieldStage_Grown;

SWAP(Field_Crop, *crop, Field_Crops[Field_CropCount - 1]);
Field_CropCount--;
i--;
}
}
}


As you might notice, instead of looping through our tiles, we’re looping through our crops, which is the data that we want to access the most. Far better!

Our Field_Crop generation code in game_gen_instance_buffer used to look like so:

for (uint32_t i = 0; i instances[writeLoc].spriteIndex = Game_FieldImageTable[Field_Tiles[i].stage];
buffer->instances[writeLoc].scale = 2.0f / Field_Width;
buffer->instances[writeLoc].pos[0] = Field_Tiles[i].pos.x;
buffer->instances[writeLoc].pos[1] = Field_Tiles[i].pos.y;
if (Field_Tiles[i].crop != NULL)
{
uint32_t cropWriteIndex = writeIndex++;
buffer->instances[cropWriteIndex].spriteIndex = 7.0f + Field_Tiles[i].crop->cropType;
buffer->instances[cropWriteIndex].scale = 2.0f / Field_Width;
buffer->instances[cropWriteIndex].pos[0] = Field_Tiles[i].pos.x;
buffer->instances[cropWriteIndex].pos[1] = Field_Tiles[i].pos.y;
}
}


But now it looks like:

for (uint32_t i = 0; i instances[writeLoc].spriteIndex = Game_FieldImageTable[Field_Tiles[i].stage];
buffer->instances[writeLoc].scale = 2.0f / Field_Width;
buffer->instances[writeLoc].pos[0] = Field_Tiles[i].pos.x;
buffer->instances[writeLoc].pos[1] = Field_Tiles[i].pos.y;
}

for (uint32_t i = 0; i instances[cropWriteIndex].spriteIndex = 7.0f + Field_Crops[i].cropType;
buffer->instances[cropWriteIndex].scale = 2.0f / Field_Width;
buffer->instances[cropWriteIndex].pos[0] = Field_Crops[i].pos.x;
buffer->instances[cropWriteIndex].pos[1] = Field_Crops[i].pos.y;
}


The amount of pointer indirection was reduced and now we access our crops in a nice linear fashion.

Taking a look at our results we can see quite a few things. game_gen_instance_buffer has improved by 9 seconds and game_tick (secretly field_tick) has improved by 6 seconds! This was simply by changing our access patterns from memory allocated in random spots on the heap to memory being nicely packed in an array.

When we take a look at chrome://tracing, we can see that our average tick is now 11ms from our 21ms in part 2. field_tick is now 1.2ms from 5ms and game_gen_instance_buffer is now 6ms from 11ms. Excellent results for a relatively simple change.

Link to this change is here.

### Dirty what?

Up next, we’re going to tackle how much data we’re storing.

Taking a look at VTune tells us that we’re heavily store bound! How do we address this? To begin, we know that our Field_Tiles never change position, they only ever change state. As a result, we can store a series of “state” change commands instead of updating the tiles every tick.

I came across this technique when working on my NES version of GO. When drawing on the NES, you only have VBlank to render everything. If you take too long, too bad, you’re going to get major flickering. Instead, one approach is to only draw what has changed during the frame by storing a series of commands indicating a change and that’s exactly what we’re going to do.

Our command structure looks like this:

typedef struct
{
uint32_t writeIndex;
float spriteIndex;
} Field_TileDrawCommand;


And our tile generation code now looks like this:

for (uint32_t i = 0; i instances[command->writeIndex].spriteIndex = command->spriteIndex;
}
Field_TileDrawCommandCount = 0;


As you can see, the tile generation code is much simpler and will touch far less memory.

Looking at our VTune results, you’ll notice that our performance has improved from 11s to 6s! Now our ai_tick and game_gen_instance_buffer functions are even. But we’ll keep tackling game_gen_instance_buffer for this part of the series.

Chrome://tracing tells us that game_gen_instance_buffer averages at 3.3ms from our 6ms and our average tick is now 9ms! That’s much better than our target 16ms. But we’re not done, there’s so much more to do. As a new goal, we’ll try to achieve less than 3ms per tick. Can we do it? I don’t know…

### The crop treatment

We’re now going to do the same draw command optimization to our crops as well.

Our crop command looks like this:

typedef struct
{
uint32_t writeIndex;
float spriteIndex;
Vec2 pos;
} Field_CropDrawCommand;


And our crop generation code looks like:

for (uint32_t i = 0; i instances[command->writeIndex].spriteIndex = command->spriteIndex;
buffer->instances[command->writeIndex].scale = 2.0f / Field_Width;
buffer->instances[command->writeIndex].pos[0] = command->pos.x;
buffer->instances[command->writeIndex].pos[1] = command->pos.y;
}
Field_CropDrawCommandCount = 0;


Similarly to the previous optimization, the commands allow us to only update a few bits of data instead of many. This reduces how much memory we have to load and store which relieves the pressure on our memory systems.

Looking at our VTune profile, we see that we’re now firmly behind ai_tick at 4s. We shaved off another 2s.

chrome://tracing reports that our average game_gen_instance_buffer tick is now 2.7ms instead of 3.3ms and our average tick is now 8.4ms from 9ms.

Link to this change is here.

### The big guns

At this point, we’re going to pull out the big guns to improve the performance of game_gen_instance_buffer. SIMD. We’re now spending a lot of our time storing memory, but we could improve this by using the larger SIMD registers to store our memory.

### A quick background on SIMD

SIMD is a feature present in almost all modern CPUs that allows you to work on multiple data at once by having larger registers than what we’re used to. Instead of 64 bit registers, we now have access to 128 bit registers that can act as “arrays” of 4 32 bit elements, 2 64 bit elements or 16 8 bit elements. These registers can act on 2 of these registers as if you were acting on the individual elements of these 2 arrays.

Say we wanted to add 2 128 bit registers of 4 32 bit elements together, in SSE we would call _mm_add_ps(…). If our registers held the values [0, 1, 2, 3] and [4, 5, 6, 7] respectively, _mm_add_ps would return the values [4, 6, 8, 10].

A few links that explain SIMD and ways to use it can be found here, here and on wikipedia.

### Back to the game!

However, despite it’s use for mathematics, we’ll be using these larger registers for storing memory more effectively. An interesting trick about these registers is that their memory can be loaded and stored in 128 bit chunks. As a result, instead of storing our memory 64 bits at a time, we’ll be storing and loading our memory 128 bits at a time.

If we look at the code for generating AI_FarmerSearchState:

for (uint32_t i = 0; i instances[writeLoc].spriteIndex = FarmerState_Search;
buffer->instances[writeLoc].scale = 0.025f;
}


We see that we’re storing spriteIndex, scale and pos one at a time. The generated assembly looks like this:

 mov         qword ptr [rcx+rsi],r11
mov         edi,dword ptr [r9+rdx*8]
mov         dword ptr [rcx+rsi+8],edi
mov         edi,dword ptr [r9+rdx*8+4]
mov         dword ptr [rcx+rsi+0Ch],edi


All “mov”s to rcx+rsi match up to spriteIndex+scale, pos[0] and pos[1] respectively. Instead, our new code will store 4 sprite indices at once, 4 scales at once and 2 farmer positions at once.

How? We’re going to split our buffer into 3 different buffers!

If we take a look at our buffer structure, it looks like this:

#define GAME_MAX_INSTANCE_COUNT 50000000
typedef struct
{
float spriteIndex;
float scale;
float pos[2];
} Game_Instance;

typedef struct
{
Game_Instance instances[GAME_MAX_INSTANCE_COUNT];
} Game_InstanceBuffer;


We need to split this buffer because we want to be able to load our farmer positions directly from our arrays without having to shuffle things around. If we left things as they were, we would have to add work to load the spriteIndex, scale and position into one register, however if we split the buffers, we’re allowing ourselves to simply load the memory and store it directly using our larger registers.

We’re also going to need to split our positions out of our farmer states into new “Gen” states that will only hold their positions and nothing else in order to be able to copy them without the need to load the rest of the farmer state’s data into memory.

Here is our new instance buffer:

#define GAME_MAX_INSTANCE_COUNT 50000000

typedef struct
{
float spriteIndices[GAME_MAX_INSTANCE_COUNT];
float scales[GAME_MAX_INSTANCE_COUNT];
float positions[GAME_MAX_INSTANCE_COUNT * 2];
} Game_InstanceBuffer;


And our new generation code without SIMD looks like this:

memcpy(&buffer->positions[writeIndex * 2], AI_FarmersSearchGen, sizeof(float) * 2 * AI_FarmerSearchCount);
for (uint32_t i = 0; i spriteIndices[writeLoc] = FarmerState_Search;
buffer->scales[writeLoc] = 0.025f;
}


Notice that we don’t store our positions one by one anymore. Due to the “Gen” states only storing position, this allows us to straight memcpy them into our positions buffer! A great win because memcpy’s performance is much faster than loading and storing one position at a time.

Let’s take a look at VTune to see how this change affected our performance:

Wow! game_gen_instance_buffer is now at 1.827s! …or is it? There’s now a new contender to our perfomance woes, an anonymous function. I suspect that this is the result of the new memcpy calls introduced. As a result, we’ll add it to our game_gen_instance_buffer time which is now 3.5 instead of 4s. Slightly better but not excellent.

chrome://tracing verifies the assumption that the anonymous function is memcpy, as our game_gen_instance_buffer average tick only improved to 2.2ms from 2.7ms.

Link to this change is here.

### It’s time to SIMD! (Or is it?)

Following the split, I implemented the vectorization code for the loading and storing.

Here is our new code:

float search = FarmerState_Search;
__m256 searchIndices = _mm256_set_ps(search, search, search, search, search, search, search, search);
_mm256_storeu_ps(&buffer->spriteIndices[writeIndex], searchIndices);
for (uint32_t i = 8 - writeIndex % 8; i spriteIndices[writeIndex + i], searchIndices);
}


At this point, I was surprised at how little the performance actually improved. And looking back, it’s because the compiler had already SIMD-ed my code! Scale and indices we’re already getting loaded and stored into the array with the movups instruction.

Here is the assembly pre-SIMD:

 movups      xmmword ptr [rsi+rdi*4],xmm0


Here is the assembly post-SIMD:

 vmovaps     ymmword ptr [rsi+rcx*4],ymm0


This is a lesson in looking at the generated assembly and the profiler before making an assumption at how to optimize the code. The change only made game_gen_instance_buffers run at around 0.2ms faster per tick.

As a result, we’ll be reverting the changes back to the non SIMD-ed version for now, as this will keep the code easier to follow.

### The optimization that wasn’t

Another attempt at improving game_gen_instance_buffer was to not call it at all and write into the buffer while processing the AI. The reasoning behind this optimization was that the data was already in memory (at least for the move code) and storing it would be best at this point. However, as a result of this, the Search and Farm code had to access data they didn’t need and the code ended up slightly slower.

### Can we make it smaller?

This next optimization is fairly simple. The scale and sprite indices are stored as 32 bit floats. We don’t need that much space for them. Instead, we’ll store them in 2 16 bit normalized integers.

The idea behind 16 bit normalized integers is simple, instead of storing the values between 0 and the Max value (11 for sprite indices, 1 for scale) we divide the value by the range and multiply by INT16_MAX. This will scale our normalized value to the whole range of int16_t.

The math would look like this:

int16_t toInt16(float v, float min, float max)
{
return ((v - min) / (max - min)) * INT16_MAX;
}


This will allow us to store our sprite index and scale in half the size!

The game instance buffer now looks like this:

typedef struct
{
int16_t spriteIndicesAndScales[GAME_MAX_INSTANCE_COUNT * 2];
float positions[GAME_MAX_INSTANCE_COUNT * 2];
} Game_InstanceBuffer;


And our chrome://tracing results tells us that our game_gen_instance_buffer average is now 1.8ms from 2.2ms.

Looking at VTune confirms this:

game_gen_instance_buffer is now 0.9s instead of 1.8s!

You might think “Why don’t we also do this for positions?”. Well, I did attempt this at first. However, due to the conversion code from float to int16_t in the movement code, the performance actually suffered! ai_tick now ran in 6.9 seconds instead of 4.7s, far worse than we would like.

There’s another cost to this size reduction. Our code is no longer vectorized!

If we take a look at our assembly, we see this:

 lea         ebx,[rcx-3]
mov         word ptr [rsi+rbx*2],0
lea         ebx,[rcx-2]
mov         word ptr [rsi+rbx*2],333h


That’s no good! We don’t need these stores, we can vectorize this much better.

As a result, I re-implemented our vectorization code with the new 16 bit data:

__m128i searchAndScale = _mm_set_epi16(FarmerState_Search, AI_FarmerScale, FarmerState_Search, AI_FarmerScale, FarmerState_Search, AI_FarmerScale, FarmerState_Search, AI_FarmerScale);
for (uint32_t i = 0; i spriteIndicesAndScales[(writeIndex + i) * 2], searchAndScale);
}
writeIndex += AI_FarmerSearchCount;


And our assembly now looks like this:

 vmovups     xmmword ptr [rsi+rdx*2],xmm0


Much better, we’re now storing 8 16 bit values in 1 vmovups instruction.

chrome://tracing now tells us that we’re at 1.6ms per game_gen_instance buffer from 1.8ms and our average tick is now 6.8ms.

And VTune:

Says that we’ve reduced game_gen_instance_buffer another 0.2 seconds.

Link to this change is here.

### Non-Temporal Stores?

Following this optimization I looked into ways to beat memcpy. I was attempting to find some way to improve the performance of game_gen_instance_buffer by replacing memcpy with my own implementation. Despite my best attempts, I failed to beat it. However, I stumbled onto _mm_stream_si128. What’s interesting about _mm_stream_si128 is that it’s a store with a non-temporal memory hint. If you don’t know what that means, don’t worry, I didn’t either. But after some research I have a decent grasp on the concept. When using _mm_store_ps, we have to load the cache line into our main cache. This is problematic, as we don’t need this data in the cache and it will evict data that we might need. Instead, we use _mm_stream_si128 that will actually use it’s own personal cache lines for storing that will not pollute the main cache and (if I understand correctly) not load the memory into cache at all. Instead, it will write directly to memory. (Sources can be found here, here, here and here)

The write code now looks like this:

__m128i searchAndScale = _mm_set_epi16(FarmerState_Search, AI_FarmerScale, FarmerState_Search, AI_FarmerScale, FarmerState_Search, AI_FarmerScale, FarmerState_Search, AI_FarmerScale);
_mm_storeu_si128((__m128i*)&buffer->spriteIndicesAndScales[writeIndex * 2], searchAndScale);
for (uint32_t i = (4 - writeIndex % 4); i spriteIndicesAndScales[(writeIndex + i) * 2], searchAndScale);
}


You might notice the _mm_storeu_si128 and then the _mm_stream_si128. The reason for the unaligned store is because our crops could have unaligned our write index from 16 bytes, so we need to write a few elements unaligned and then write the rest of our aligned memory.

chrome://tracing says that we’re now at an average of 1.5ms for game_gen_instance_buffer and our average tick is now 6.6ms.

VTune says that our game_gen_instance_buffer call is now at 0.4s from 0.7s, excellent! Unfortunately, memcpy is taking up quite a bit of time. We might address this in a future part of the series but for now, we’ve reduced game_gen_instance_buffer from 20s to 2.3s and it’s tick from 12ms to 1.5ms.

Link to this change is here.

### Bringing it all to harvest

That’s it for part 3! We made quite a few changes in this part:

• We improved our memory access for Crops
• We implemented draw commands for crops and tiles
• We split the data from a single buffer into 3 buffers
• We took our data and shrunk it to 2 16 bit integers instead of 2 floats
• We re-SIMD-ed the 16 bit stores
• We used non-temporal stores to improve the performance of our stores

As a result, we’re now at 2.3s for game_gen_instance_buffer instead of 20s! Far better than what we started with.

### Part 4?

In part 4 we’re going to take a look at ai_tick again. This time, we’ll try to improve it with computational usage of SIMD and if we manage to become compute bound, we’ll take a look at the assembly and try to reduce the amount of work done. If we remain back-end bound, we’ll take a look at various attempts to reduce the amount of memory accessed.

Find it on github here!

Until Part 4!

### It begins

In Part 1 we looked at the overall implementation of the farming simulator and ran the baseline benchmark with Chrome://Tracing. In part 1, I skimmed a few changes I made to the original implementation in order to gather more consistent benchmarks. The few changes are:

• Run exactly 1000 frames of the program.
• Force the delta time to 0.016f in order to have consistent AI behaviour even if the benchmark runs quickly or slowly.
• Random is not seeded to time, this will allow the randomness to be consistent across runs of the application.

The program ran at 42ms per tick and 25ms of the ticks were taken up by the AI state machine! Without running VTune, I had a few assumptions as to where the performance of the AI code could be going. Here is the whole AI_Tick function:

Our AI_Farmer structure:

typedef struct
{
Vec2 pos;
AI_FarmerState state;

union
{
struct
{
Vec2 vel;
Field_Tile* tile;
} moveState;

struct
{
float farmTimer;
Field_Tile* tile;
} farmState;

struct
{
float searchTimer;
} searchState;
};
} AI_Farmer;


And our ai_tick code:

for (uint32_t ai = 0; ai < AI_FarmerCount; ++ai)
{
AI_Farmer* farmer = &AI_Farmers[ai];

switch (farmer->state)
{
case FarmerState_Search:
{
farmer->searchState.searchTimer -= delta;
farmer->searchState.searchTimer = math_maxf(farmer->searchState.searchTimer, 0.0f);

if (farmer->searchState.searchTimer <= 0.0f)
{
// Search code...
}

break;
}
case FarmerState_Move:
{
Field_Tile* tile = farmer->moveState.tile;
float currentDist = vec2_mag(vec2_sub(tile->pos, farmer->pos));
Vec2 vel = vec2_mul(farmer->moveState.vel, delta);
vel = vec2_mul(vec2_norm(vel), math_minf(vec2_mag(vel), currentDist));

float dist = vec2_mag(vec2_sub(tile->pos, farmer->pos));
{
// Move code...
}

break;
}
case FarmerState_Farm:
{
farmer->farmState.farmTimer -= delta;
if (farmer->farmState.farmTimer <= 0.0f)
{
// Farm Code...
}
break;
}
}
}


Looking at this code snippet, we can see a few things. Most importantly, branch prediction and cache misses.

### Branch Prediction

That switch statement will hit us hard on branch mispredictions. The state of the farmers will get more and more disjointed as the program execution moves forward, essentially making the state variable random, a branch predictor’s worse nightmare.

A branch predictor is a hardware component of your CPU that guesses at the result of a branch. Why does it need to guess? The answer to that is complex and there are plenty of books and resources that address these topics. I will try my best to summarize the important details.

Modern CPUs use a pipelined approach to executing instructions, this means that they will execute several instructions at once. A great introduction to this topic can be found here. However, there is a problem when one of those instructions is a conditional jump, this conditional jump relies on the execution of a compare instruction. If that compare instruction hasn’t completed yet, the conditional jump has to wait! This stalls our pipeline and our number of overall instructions that were executed is reduced! To address this, CPUs will instead guess at the result of the comparison.

In short, the branch predictor will “guess” at the result of a conditional jump (branch) based on the local and global history of a branch. The CPU will then execute the instructions that follow the correct path through the branch. If the branch predictor guessed wrong, the CPU has to roll back all the work it did and start again from the correct instruction. (A few sources can be found here, here and here) However, if the branch predictor guessed correctly, it’s as if there was never a branch in the first place! (almost) If the branch results are essentially random, the branch predictor will have more misses than we’d like which can impact your performance. (To what extent varies). (A great introduction on branch predictors can be found here.)

### Cache Misses

One good thing to come out of this code is that the farmers are stored in an array and accessed linearly, that’s a win for our memory access speeds!

Why do we care about memory access speeds? Because they’re slow! The ideal case for our memory is to have it in the L1 cache when we’re just about to use it. (A rundown of the memory hierarchy can be found here and here.) Our current implementation facilitates that. Since we’re accessing our farmers linearly, the memory prefetcher will happily prefetch the next farmers. (More information about prefetching can be found here on wikipedia and in the intel optimization guide.) This is great because the latency of waiting for our farmers to be loaded from RAM is hidden by our prefetcher. However, if we look at the size of our AI_Farmer structure we can see that it’s a size of 32 bytes in all cases. At the time of writing, cache lines are generally 64 bytes (Source here) and we’re using up half of it with data that we only use sometimes.

### VTune

With these considerations in place, I ran the program through VTune.

Looking at these sampling results, we can see that the results confirm the chrome://tracing result that ai_tick is taking up a significant portion of our execution time! I was surprised to find that only 6.5% of the execution time of ai_tick was attributed to branch mispredictions. The results do confirm that a significant portion of our execution time is spent waiting on memory.

### Low effort, high reward

We’re going to start addressing some of the problems with our memory access patterns. Digging into the profiling results will show us that a pair of lines has quite a dramatic effect on our performance.

Field_Tile* tile = farmer->moveState.tile;
float currentDist = vec2_mag(vec2_sub(tile->pos, farmer->pos));


Looking at the assembly for this snippet we get

mov         rbx,qword ptr [rbp+rdi+1E84818h]
movss       xmm2,dword ptr [rbx+0Ch]
movss       xmm4,dword ptr [rbx+10h]


VTune tells us that the first movss consumes 49% of our instructions in the backend. Why is that? Well, looking at the C code tells us that we’re dereferencing a tile pointer. If we remember in part 1, we got that tile pointer randomly. There’s no way the memory prefetcher could have guessed that we would access that tile and we’re stuck having to wait for that memory to be loaded into our registers.

There was a very simple solution to this problem that works because we know that the tile won’t be moving. We just cache the tile’s position in our move state instead of accessing the tile directly in our movement code.

struct
{
Vec2 vel;
Vec2 tilePos;
Field_Tile* tile;
} moveState;


This simple change transformed our ai_tick function from being the slowest function to being faster than game_gen_instance_buffer (which we will look at in the future).

Looking at our original snapshot, we used to be 79% backend bound for ai_tick and now we’re 63% and only taking 17s instead of 24s.

A quick run through chrome://tracing and we see that our average tick dropped from 42ms to 30ms! 30ms is 71% of our original 42ms; our ai_tick also dropped from 25ms to 11ms. That’s 44% of our original 25ms! Our ai_tick is now running 2x faster than our original benchmark simply by changing our memory access patterns.

However, there’s a trade-off to this caching. Now our structure has grown by 8 more bytes. It’s now standing at 40 bytes per farmer.

Link to this change is here.

### More effort

If we take a look at the VTune sample, we notice that game_gen_instance buffers has become our slowest function. However, we’ll ignore it for now and keep hacking away at the ai_tick function.

This next optimization will be more work than the previous two. We’re going to split the AI_Farmer structure in 3, one for each state. Once these structures are created, we’re going to store them in seperate buffers. There are many reasons for doing this:

• We will avoid the switch statement altogether, states will be processed together instead of seperately. The best way to avoid branch mispredictions is to not have any branches to predict in the first place.
• We can tightly pack our data instead of having enough space for all of our states. Currently, our farmer takes up the same amount of space no matter how much memory the state might actually need. Splitting them up will allow some states to take up less memory than other states. A win for cache occupancy! We’ll be able to process more farmers with a single cache line.

Our structures now look like this:

typedef struct
{
Vec2 pos;
Vec2 vel;
Vec2 tilePos;
Field_Tile* tile;
} AI_FarmerMoveState;

typedef struct
{
Vec2 pos;
float farmTimer;
Field_Tile* tile;
} AI_FarmerFarmState;

typedef struct
{
Vec2 pos;
float searchTimer;
} AI_FarmerSearchState;


Now we can see that AI_FarmerMoveState takes up 32 bytes, AI_FarmerFarmState takes up 24 bytes and AI_FarmerSearchState takes up 12 bytes, all smaller than our original 40 bytes. A great way to make more use of our cache lines.

You will notice that we’re duplicating data in order to improve our performance. This optimization would not be as straightforward if we had hard memory constraints but thankfully our toy program doesn’t mind at all.

Our new ai_tick function looks roughly like this:

uint32_t previousFarmerMoveCount = AI_FarmerMoveCount;
{
for (uint32_t i = 0; i < AI_FarmerSearchCount; i++)
{

// Search Code...
}
}

{
for (uint32_t i = 0; i < AI_FarmerMoveCount; i++)
{
AI_FarmerMoveState* farmer = &AI_FarmersMove[i];

// Move Code...
}
}

{
for (uint32_t i = 0; i < AI_FarmerFarmCount; i++)
{
AI_FarmerFarmState* farmer = &AI_FarmersFarm[i];

// Farm Code...
}
}


You will notice that there’s no more switch statement! That’s right, it’s completely unnecessary at this point. Our number of branch mispredictions should drop significantly.

Now we can take out our trusty profiler and see that our ai_tick performance has once again improved.

We are still heavily back-end bound however our branch mispredictions dropped to 1.4% from 13% and our CPI (Cycles per instruction) improved as well! Our ai_tick now takes 11s instead of 17s.

If we take a look at chrome://tracing, we see that our average tick transitioned from 30ms to 25ms and our ai_tick transitioned from 11ms to 8ms! Pretty good savings for simple changes like these!

Link to this change is here.

### A song of ice and fire

Now that we split the AI_Farmer structure into multiple structures, we can push this even further. What we want to do next, is split the data between “hot” and “cold” data. The hot data is the data that we use at every iteration (Timers, velocity, target position) and the cold data is data that we only use when a condition is met. (We’ve reached a tile, we’re done farming, we’ve found a tile)

We want to do this in order to improve our cache line usage. If we’re only using 4 bytes of 12 (search) or 24 (farm) in a hot loop,  we’re wasting plenty of cache space for data that we might not even need!

We’re incurring the cost of a cache miss when we reach the cold data as a trade-off for faster access to our hot data.

typedef struct
{
Vec2 pos;
Vec2 vel;
Vec2 tilePos;
} AI_FarmerMoveStateHot;

typedef struct
{
Field_Tile* tile;
} AI_FarmerMoveStateCold;

typedef struct
{
float farmTimer;
} AI_FarmerFarmStateHot;

typedef struct
{
Vec2 pos;
Field_Tile* tile;
} AI_FarmerFarmStateCold;

typedef struct
{
float searchTimer;
} AI_FarmerSearchStateHot;

typedef struct
{
Vec2 pos;
} AI_FarmerSearchStateCold;


This change should give us slight savings by improving our cache line utilization.

Looking at our VTune profile, we see that we only saved 1 second off of our original 11 seconds. Not nearly as much as we saved before, however this change will allow us to leverage SIMD so we’re going to keep it.

Running the program with chrome://tracing we see that our program is still at 25ms per tick but our ai_tick dropped to 7.5ms.

Link to this change is here.

### Cutting out the excess

Finally, we’re going to take a look directly at AI_FarmerMoveStateHot:

typedef struct
{
Vec2 pos;
Vec2 vel;
Vec2 tilePos;
} AI_FarmerMoveStateHot;


Position represents our farmer’s position, velocity represents our farmer’s constant velocity towards the tile and tilePosition represents our target tile’s position. What you might notice if you look back at the original ai_tick code, is that all farmer’s have the same velocity. With a small change to our movement code, we can completely remove the velocity from this structure, saving even more cache memory!

typedef struct
{
Vec2 pos;
Vec2 tilePos;
} AI_FarmerMoveStateHot;


Now, taking a look at our movement code:

Field_Tile* tile = farmer->moveState.tile;
float currentDist = vec2_mag(vec2_sub(tile->pos, farmer->pos));
Vec2 vel = vec2_mul(farmer->moveState.vel, delta);
vel = vec2_mul(vec2_norm(vel), math_minf(vec2_mag(vel), currentDist));

float dist = vec2_mag(vec2_sub(tile->pos, farmer->pos));
{
// Switch to farm state
}


We can simplify this a lot. We have 3 vec2_mag calls and 1 vec2_norm call that internally calls vec2_mag. vec2_mag internally invokes sqrt, most of the time, sqrt isn’t a huge performance bottleneck, however when you’re processing 1+ million sqrts per tick, you start seeing a performance impact.

Vec2 tilePos = farmer->tilePos;

float velMag = AI_FarmerSpeed * delta;
Vec2 dirVec = vec2_sub(tilePos, farmer->pos);
float mag = vec2_mag(dirVec);

Vec2 vel = vec2_mul(dirVec, (1.0f / mag) * velMag);

// If our velocity is larger than our magnitude, we're going to move past the tile
// This is a good enough metric to determine that we've reached the tile
if (velMag > mag)
{
// Hardset our position to the tile, this is better than a tile radius!
farmer->pos = farmer->tilePos;
// Switch to farm state
}


We modified our code to have the same functionality but we’ve simplified the math significantly by combining terms, noticing redudent calculations and removing unnecessary logic. Our movement code has now become significantly simpler than before with only 1 mag call instead of 4 and we’re also using up less of our cache space! Let’s take a look at VTune for our results.

ai_tick is now lower than both game_tick and game_gen_instance_buffer. Running in only 5s instead of 10s from our last change and only 50% back-end bound and our CPI dropped to lower than 1!

Running this through chrome://tracing we see that our average tick is now 21ms and our average ai_tick is 4.5ms!

This is the end of our focus on ai_tick for now. We’ve done a good job in improving it’s performance from 32s to 6s a 5x improvement!

Link to this change is here.

### Phew! What did we do again?

We did 4 things to improve the performance of the ai_tick function:

• Cached the target tile’s position instead of accessing memory that we wouldn’t have loaded in the cache.
• Split the AI_Farmer struct into 3 different structs. One for each state, this improved our branch mispredictions and also improved our cache utilization due to the smaller struct sizes.
• Split the state structs in hot and cold data, hot data is the data that we access every loop iteration, cold data is data that we access infrequently. This is a nice setup for future SIMD work.
• Removed the velocity vector from the movement struct, this reduced the size of that struct by 8 bytes, allowing us to use even more of our cache line for more farmers.

### What does the future look like?

In Part 3, we’re going to take a look at game_gen_instance_buffers. As we saw in our profiling results, it’s dominating our performance at 21s compared to 8s for game_tick and 6s for ai_tick!

A little birdie told me that we were slowed down by both reading and writing memory, so we’re going to look into ways of reducing the cost of writing all that data to memory. Some tricks that we’re going to look at are compression, dirty commands, SIMD and some more data manipulation. It’s going to be fun!

Find it on github here!

Until Part 3!

### Intro

I recently ran a workshop at my local tech meetup about optimizing a small farming simulator that I wrote. Unfortunately, a few people couldn’t make it to the meetup. As a result, I decided to write a blog series on optimizing this small farming simulator in order to share the insights gleamed at the workshop and also in an attempt to push the optimizations even further than the few hours provided by the workshop.

The goal of this series will be to push this simulator to execute in as small a time frame as possible. Follow along and provide your own insights as we work towards our lofty goal.

### What will we be tackling?

We’ll be touching on various performance topics as we optimize this program and I hope that you’ll be able to gather some valuable insights along the way. (Or maybe you’ll provide valuable insights to me! :o). Some of the technologies we’ll be touching on are SIMD, Cache, Branch Prediction, Assembly and more! (Maybe…)

Unfortunately, we won’t be tackling the rendering performance or introducing multi-threading. Not only because this would cause the series to span far longer than I’d like but also because I am not skilled enough in these disciplines to provide valuable insights.

### Code, code and more code

Let’s take a look at the code shall we?

The program is very simple. Farmer’s use a simple state machine that consists of 3 states. Searching, Moving and Farming.

The farmer structure is simple, it consists of a position, a state enum that dictates which state the farmer is currently in and a union that holds the information for each state.

typedef struct
{
Vec2 pos;
AI_FarmerState state;

union
{
struct
{
Vec2 vel;
Field_Tile* tile;
} moveState;

struct
{
float farmTimer;
Field_Tile* tile;
} farmState;

struct
{
float searchTimer;
} searchState;
};
} AI_Farmer;


Search Code:

The search code simply decrements a timer, picks a random field tile and if that tile hasn’t been planted yet, targets it and finally changes to the move state. If the farmer failed to find an un-planted tile, it simply restarts the timer.

farmer->searchState.searchTimer -= delta;
farmer->searchState.searchTimer = math_maxf(farmer->searchState.searchTimer, 0.0f);

if (farmer->searchState.searchTimer <= 0.0f)
{
uint32_t tileIndex = rand_range(0U, Field_Width * Field_Height);
Field_Tile* tile = &Field_Tiles[tileIndex];

if (tile->stage != FieldStage_Planted)
{
farmer->state = FarmerState_Move;
farmer->moveState.tile = tile;

farmer->moveState.vel = vec2_mul(vec2_norm(vec2_sub(tile->pos, farmer->pos)), AI_FarmerSpeed);
}
else
{
farmer->searchState.searchTimer = rand_rangef(AI_FarmerSearchSpeedMin, AI_FarmerSearchSpeedMax);
}
}


Move Code:

The move code is also simple, the farmer will move towards it’s target and once it enters a radius from the tile, it moves to the farm state to begin farming the tile.

Field_Tile* tile = farmer->moveState.tile;
float currentDist = vec2_mag(vec2_sub(tile->pos, farmer->pos));
Vec2 vel = vec2_mul(farmer->moveState.vel, delta);
vel = vec2_mul(vec2_norm(vel), math_minf(vec2_mag(vel), currentDist));

float dist = vec2_mag(vec2_sub(tile->pos, farmer->pos));
{
farmer->state = FarmerState_Farm;
farmer->farmState.farmTimer = rand_rangef(AI_FarmerFarmSpeedMin, AI_FarmerFarmSpeedMax);
farmer->farmState.tile = tile;
}


Farm Code:

The farm code decrements a farming timer. Once the timer is complete, the farmer will determine what to do based on the state of the tile. If the tile is already grown, it will release the crop memory (harvesting). If the next state of the field tile was planted, the farmer will allocate memory for the crop and assign it a random crop type. Finally, the farmer moves back to the search state to find another tile.

farmer->farmState.farmTimer -= delta;
if (farmer->farmState.farmTimer <= 0.0f)
{
Field_Tile* tile = farmer->farmState.tile;

if (tile->stage == FieldStage_Grown)
{
free(tile->crop);
tile->crop = NULL;
}

tile->stage = math_max((tile->stage + 1) % FieldState_Max, FieldStage_Fallow);

if (tile->stage == FieldStage_Planted)
{
tile->crop = (Field_Crop*)malloc(sizeof(Field_Crop));
tile->crop->cropType = rand_range(0, Crop_MaxCropType);
}

farmer->state = FarmerState_Search;
farmer->searchState.searchTimer = rand_rangef(AI_FarmerSearchSpeedMin, AI_FarmerSearchSpeedMax);
}


Finally, crops use a very simple timer to determine if they have completed their growth cycle:

for (uint32_t i = 0; i < Field_Width * Field_Height; ++i)
{
if (Field_Tiles[i].stage == FieldStage_Planted)
{
Field_Crop* crop = Field_Tiles[i].crop;

{
Field_Tiles[i].stage = FieldStage_Grown;
}
}
}


### Technologies

My IDE of choice is Visual Studio 2017 community, however instead of using the Visual Studio Compiler, I will be using Clang with the -Ofast flag as my optimizing compiler. Additionally, I will be using VTune to gather performance metrics as well as my home-grown trace profiling library Mist_Profile that utilizes Chrome://tracing for trace profiling. VTune will be used for our profiling work, but Mist_Profile will allow us to get a feel for where our performance is every frame. The project also makes use of sokol for graphics, windowing and timings, as well as stb_image for well… images. We’ll also be coding in C11 instead of C++. This shouldn’t look too different from simple C++ therefore those who are comfortable with C++ should still feel at home in this series.

My CPU is an Intel Core i7 4702HQ with 4 cores running at 2.20GHZ, 4 x 32kb L1 cache (8 way associative), 4 x 32kb L1 icache (8-way), 4 x 256kb L2 cache (8 way) and 6mb L3 cache. I also have 8gb of DDR3 DRAM.

### What’s next?

Some might already see plenty of optimization opportunities. But before we even determine what needs to be optimized, we’re going to run the program through VTune and determine what we should tackle first. That will be our first step in the next part of this series.

In order to get a benchmark for future work, I executed a quick run of the program. Looking at the trace can determine that our average tick time is 57.676ms. Quite far from an interactive 16ms (without rendering!). Let’s see if we can bring it down to less than 16ms as the series goes on! (JSON Trace here, plug it into Chrome://Tracing to take a look!)

In Part 2 we’ll be running VTune and tackling the AI performance which currently runs at an average of 36.166ms per frame. In order to improve it, we’ll be improving our cache utilization as well as reducing the number of branch mispredictions.

Edit: While working on part 2, I noticed that my laptop was running on power saver mode. This was rectified and I determined that the average tick is of 42ms and the average AI tick is of 25ms! …oops…