Managing C++ Compile Times

Preamble

Jumping into a large project that compiles in 30 minutes, you might be faced with 2 thoughts. Either “Wow! This project compiles in only 30 minutes!” Or “What?!? This project takes 30 minutes to compile?” Either way, you likely don’t want to be iterating on your changes at a rate of 30 minutes per compilation. So here’s a few things you can do to improve your iteration times.

Hack Your Prototypes – Short Iterations

This is likely where you spend 90% of your time when solving a problem. In this phase you can be searching for a bug, testing different solutions or even just investigating the behaviour of the program. In any of these scenarios, you likely want to iterate quickly without much down time. Here’s a few things you can do:

Don’t modify your headers

Modifying your headers is a sure fire way to trigger multiple files to recompile. (Or if you’re unlucky, the whole project) as a result, do anything in your power to avoid having to modify a header while you’re in this phase.

Work in a single cpp file

Don’t add static variables, private functions, or type declarations to your header. Add utility functions anywhere in your cpp file but not in the header.

Make ugly globals. (It’s fine, I won’t tell)

If you need some sort of static state for a class, consider making it a global in the cpp file instead. This holds even outside of the prototyping phase, make sure to mark it static, we don’t need pesky external linkage for these variables.

Manually add a function declaration to your cpp

If you need to have a function available to multiple cpp files, consider simply adding the declaration to the necessary cpp files. This wouldn’t fly for a final submission but we’re prototyping here.

Compile a single file

Out of habit, you might build the whole solution, or the specific project when working to see if you’ve introduced any compile errors. Instead, consider if you can afford to simply compile the single file you’re working in. You can typically afford this when making modifications local to a cpp file. This will allow you to avoid having to deal with the linker until you absolutely need to. It might even be worthwhile to make the shortcut convenient.

Use a smaller project

Is there a smaller project you can test in instead of the primary projects? If so, consider using that instead. Your iteration time will improve simply by the fact that your project has a lot less to handle.

In the same vein, consider If you can simply compile the project your files are in without compiling the larger projects that are likely to include a large number of files.

Disable optimizations locally

If you’re running with optimizations and optimizations are removing useful variables you’d like to investigate, consider disabling optimizations in your file instead of changing the project to a debug configuration. This will only require recompiling the single file instead of the whole project.

Be Nice To The Compiler And Yourself – Long Term Changes

At this point, we’re done iterating, we’ve solidified our solution and now we need to clean it up for submission. At this stage, you’re probably not compiling as frequently and a lot of your time is writing all the necessary bits to link it all together. Decisions made here will impact not only you, but all users of the project. We want our compile times to be short. Long compile times impact everyone, and that’s no fun. Here are some things to consider at this stage.

Does this change really need a new file?

If you’re considering creating a new file, consider if It’s really worth it. A new file implies a new header, a new header implies new includes, new includes means more dependencies, more dependencies means more compilation.

This is a balancing act. Adding a new cpp/h pair might actually allow you to reduce compile times by only including the relevant bits without needing to modify a highly used header. But it might also spawn into a monstrous chain of includes. It’s a matter of tradeoffs, consider them carefully.

New class != new file

Creating a new class does not imply that a new file should be created. Does your class really need to have a header? Is it used by any other class or is it more of a utility for a single file? If your class is only in use by a single cpp file, consider putting it in the cpp file. If it needs to become public, we can implement that change later. (Your coding standard might disagree here in some contexts, if it does, follow the standard)

Don’t add unnecessary includes to your headers

Headers are not your friend. Do everything you can to reduce the number of includes that you have to add to a header. The more headers required, the bigger the compilation chain grows. Consider if you can forward declare the required types, or if you’re simply including the header for convenience. A simple rule is to only include the header if you absolutely need it to compile. Avoid “general purpose” headers that include a large chunk of what you need. They’ll eat you alive (and your compile times).

Do you really need this function to be a private function?

When adding a private function to an object, consider if it has better use as a utility function embedded in the cpp file. Adding private methods to a header means that other files have to recompile if you modify functionality that they had no access to in the first place. It’s bad encapsulation. This is especially true for private static functions and private static variables.

Can you remove some includes from that header?

Are you working in a header and notice some unnecessary includes? See if you can remove them! Every little bit helps.

Standard headers are a good target to remove from inclusion. They tend to be heavily templated and can reduce compile times significantly.

If you can forward declare a type instead of including this file in your header, consider doing that.

Use the appropriate include type

Using a relative include “../../File.h” will reduce the load on the preprocessor if the path to the file is correct. Use this format when you reference a file from the same project. Use #include <Project/File.h> when the included file is part of a different project than our current one. Avoid #include “Project/File.h” or #include “../../File.h” if your file is not located at this specific location. This causes the preprocessor to search from the relative location of all the included files which can be quite a bit of work.

Does this need to be a template?

Be very wary of making a template that is included in your header. Templates suffer from a quantity of problems. Is your template needed by anyone else but your cpp? If not, consider embedding it. Does your type really need to be a template? Consider if you have an actual use case for this template. If you absolutely need this template. Consider de-templatifying the functionality that doesn’t require the generic type and putting that functionality either in a base class or a set of utility functions.

What’s problematic about templates? Templates require you to write their implementations in your header so that your compiler can instantiate it correctly. This means that any change to the internal functionality of your template will require a recompilation of all the files that include it. Templates can also be tremendously complex, causing a significant amount of work for your compiler. (Template metaprogramming anyone?) If you need your functionality to be expressed at compile time, consider constexpr, compilers are more efficient at processing these than processing template recursion. Also consider that templates require your compiler to create a new instance for every cpp that includes it. (If one cpp instantiates vector and another also instantiates vector the compiler has to do the work twice!) you might consider extern templates in this situation. But I would strongly recommend dropping the template altogether if you can.

Tools

There are some tools that can help you improve your compile times in a variety of ways. Here’s a few of them.

Ccache

Ccache is a compilation cache tool that maintains a cache of compilations of your cpp files. This works accross full rebuilds, different project version (Maybe you have 2 similar branches) and even across users.

https://ccache.dev/

Compiler Time Tracking Tools

Compilers are starting to incorporate tools to track where time is being spent in compilation. Clang/GCC have “-ftime-report” and MSVC has build insights https://devblogs.microsoft.com/cppblog/introducing-c-build-insights/

An excellent blog post from Aras Pranckevičius details these more thoroughly https://aras-p.info/blog/2019/01/12/Investigating-compile-times-and-Clang-ftime-report/

Show Include List

MSVC has an option to show what’s included in a file. Consider using this to know what’s being included and what doesn’t need to be included.

https://docs.microsoft.com/en-us/cpp/build/reference/showincludes-list-include-files?view=vs-2019

Clang/GCC have the -M and -H options. https://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Preprocessor-Options.html#Preprocessor-Options

Include What You Use

There is also an interesting tool that determines what includes can be removed and what could be forward declared.

https://include-what-you-use.org/

Links

https://stackoverflow.com/questions/8130602/using-extern-template-c11

https://en.cppreference.com/w/cpp/preprocessor/include

http://virtuallyrandom.com/c-compilation-fixing-it/

https://aras-p.info/blog/2019/01/12/Investigating-compile-times-and-Clang-ftime-report/

https://docs.microsoft.com/en-us/cpp/build/reference/showincludes-list-include-files?view=vs-2019

https://stackoverflow.com/questions/5834778/how-to-tell-where-a-header-file-is-included-from

https://include-what-you-use.org/

https://www.goodreads.com/book/show/1370617.Large_Scale_C_Software_Design

Advertisement

Diary of a Path Tracer – BVH Construction Using SAH

Index: https://cranberryking.com/2020/08/22/diary-of-a-path-tracer-the-beginning/

Intro

In our last post, we looked at our SIMD-friendly math library. In this post, we’ll be taking a look at how cranberray builds its BVH. For a primer to BVHs, I recommend going to PBRT’s treatment on the topic here.

Canberray’s BVH is built recursively as a binary tree. It starts with the top level node containing all nodes and recursively splits it into 2 children at each level. Originally, the BVH was split simply by selecting the left and right halves of the children nodes. The next approach was to split according to the median of a particular axis (typically the largest one), and now it makes use of the surface area heuristic.

SAH

The Surface Area Heuristic or SAH is a well known strategy to build an efficient BVH that tries to minimize the number of intersection tests applied on a tree.

PBRT has an excellent treatment on the topic [2], and I’ve linked one of the original papers referencing the idea [1]. I will try to instead provide an intuitive picture of why and how SAH works.

The idea behind SAH is to discover an efficient partition of a collection that will try to minimize the cost of traversing the tree for any ray.

The SAH function is presented as:

C_o = C_{trav} + P_A\sum C_{isect}(a_i) + P_B\sum C_{isect}(b_i)

This cost model is based on probabilities. C_{trav} is the cost of traversing the parent node to determine if any of the two children nodes intersect the ray. In cranberray, this would be the cost of testing the ray against both children AABBs and determining if it intersects either. C_{isect} is the cost of testing the shape itself against the ray. In cranberray, this would be the cost of testing the triangles against the ray. Finally, P_A and P_B are the probabilities that the bounding volumes containing the children nodes intersect the ray.

Now, well make a slight modification and reformulate P_A and P_B as P(A|C) and P(B|C) where C is our parent node. This states that P(A|C) represents the probability of the event A happening given the event C has already happened. In this context, this means that P(A|C) represents the probability that we intersect our volume A given that we’ve already intersected our parent volume C.

With this change, our function is now:

C_o = C_{trav} + P(A|C)\sum C_{isect}(a_i) + P(B|C)\sum C_{isect}(b_i)

Now, let’s imagine that we’re a lonesome ray traveling through our BVH. We arrive at a node and we test both of it’s children to see if we intersect it. This takes us 10 units of time (C_{trav}). At this point we would intersect A with a probability of P(A|C) and we would intersect B with a probability of P(B|C). In this example, we only intersect with A, and so we test our ray against it’s contained shapes. Let’s say that A has 10 shapes, as a result, we have to pay the cost of testing the intersection with all 10 shapes. This is \sum_{i=1}^{10} C_{isect}(a_i). Finally, let’s say that the cost of C_{isect} is 15 for every child shape.

This gives us the final actual cost of:

C_o = 10+sum_{i=1}^{10} 15 = 10+150 = 160

What our original cost function is doing, is calculating the expected cost of doing this operation for any possible ray.

As a result, if we were to build with these values:

C_{trav}=10

C_{isect}=15

P(A|C)=1

P(B|C)=0

A_n = 10

B_n = 0

Then our expected cost for this node would be:

C_o = 10 + 1*\sum_{i=1}^{10} 15 + 0*\sum_{i=1}^{0} 15 = 10+150 = 160

Which is exactly the cost of our example above because if our A node has a probability of 1, we would always experience the scenario above.

Of course, if we were to take some more likely value such as:

P(A|C)=0.5

P(B|C)=0.6

A_n = 10

B_n = 12

C_o = 10 + 0.5*\sum_{i=1}^{10} 15 + 0.6*\sum_{i=1}^{12} 15 = 10+0.5*150+0.6*180 = 193

Which would be the expected value of our node. Notice that our probabilities can add up to more than or less than 1 because it is reasonable that we might be able to intersect with both volumes or none at all.

Now we need to determine how we want to define P(A|C). This probability is the probability of a random ray intersecting with our volume given that we’ve intersected with our parent volume.

P(A|C) is given as the surface area of our volume A divided by the surface area of our parent volume C. As a result, we can define P(A|C) = \frac{S_A}{S_C} as the probability of intersecting A given that we’ve intersected C.

Derivation In 2D

In 2 dimensions, this makes sense, as we can image a rectangle, embedded in another rectangle. The probability of a random point being placed in the child rectangle given that it is contained in the parent rectangle is equal to the area of the embedded rectangle divided by the area of the parent rectangle.

RectangleArea

Now, instead we’ll imagine uniformally random ray instead of a random line. To select a random ray, we can select a random direction and then select a random ray along the plane defined by that direction.

DirectionSelection
The plane defines our ray direction and our rays are generated across that line.

With this in mind, we want to calculate the probability that our ray will intersect the children volume given that it has intersected the parent volume.

To do this, we want to calculate the average projected area of our shape.

Imagine that we’ve selected our random direction for our ray, we can see that our ray will have intersected our shape if it’s projection onto the plane intersects our random ray.

VolumeProjection
Notice that if a ray intersects our volume it also intersects its projection.

Now if we add a second volume embedded in the first volume, we can see that the probability of intersecting the first volume would be given by the projected area of the child volume divided by the projected area of the parent volume.

ChildVolumeProjection
Notice how our child’s projection is a fraction of our parent’s projection.

This example only works for a single direction. However, since we’re looking for a probability given any ray direction, we need to calculate the average of this projection across the sphere of all directions.

In 2 dimensions, the projected area of a circle is always it’s diameter. We will use this fact to test our work.

To calculate the average projected area of an arbitrary convex shape[4][5], we first want to split that shape into a series of small discrete patches.

Integration
These patches would get smaller and smaller approaching the shape of our circle.

This will be our integration domain.

Once we’ve split our shape into discrete patches, we want to calculate the average projected area of a single patch.

Now we’ll integrate across the circle of directions containing the patch and divide it by the area of our integration. This will effectively give us the average projected area for a single patch of our shape. This works because we will have calculated the projection for every single direction.

ProjectionInDifferentDirections
As we test different directions, we get different planes giving us different projections for the same line.

dA_p = \frac{1}{2\pi}\int_{\Omega} |cos\theta| dA d\theta

Integrating with an absolute value is tricky. Instead due to the symmetry of the problem, we can look at only a quarter of our circle and multiply by 4 and remove our absolute term.

dA_p = \frac{2}{\pi} dA \int_{0}^{\frac{\pi}{2}} cos\theta d\theta

Since \int_{0}^{\frac{\pi}{2}} cos\theta d\theta = 1 our expression reduces to

dA_p = \frac{2}{\pi} dA

Finally, we want to divide our value by 2, this is because we’ve now counted our projected area twice.

DualShadow
Notice how there are 2 valid projections. The projection given the left side of the line and the projection given the right side of the line.

dA_p = \frac{1}{\pi} dA

Using this formulation, we can calculate the average projected area of our shape by adding all of our average projected areas together.

A_p = \int_A dA_p

A_p = \int_A \frac{1}{\pi} dA

A_p = \frac{1}{\pi} A

And there you have it, the average projected area of a 2 dimensional convex object is \frac{1}{\pi} the area of said object.

Notice that if we apply this to the surface area of a circle with the formula 2\pi r we get A_p = \frac{2\pi r}{\pi} = 2r which is our diameter which matches up with our expectation.

Derivation In 3D

We can take the same steps in 3D.

Calculate the average projection over our sphere of directions:

dA_p = \frac{1}{4\pi} \int_0^{2\pi} \int_0^{\pi} |cos\theta| sin\theta dA d\theta d\phi

Integrating across the positive hemisphere and multiplying by 2

dA_p = \frac{1}{2\pi} dA \int_0^{2\pi} \int_0^{\frac{\pi}{2}} cos\theta sin\theta d\theta d\phi

Since \int_0^{\frac{\pi}{2}} cos\theta sin\theta d\theta = \frac{1}{2}

dA_p = \frac{1}{2\pi} dA \int_0^{2\pi} \frac{1}{2} d\phi

dA_p = \frac{1}{4\pi} dA \int_0^{2\pi} d\phi

dA_p = \frac{1}{2} dA

Finally, dividing by 2 for our double projection

dA_p = \frac{1}{4} dA

And plugging into our surface area calculation

A_p = \int_A dA_p

A_p = \frac{1}{4} \int_A dA

A_p = \frac{1}{4} A

Putting It Together

Finally we can see that our average projected area in 3D is \frac{1}{4} it’s surface area.

To calculate our probability, we simply want to divide our parent’s projected area (C) divided by our child’s projected area (A)

P(A|C) = \frac{A_p}{C_p}

P(A|C) = \frac{0.25*A}{0.25*C}

P(A|C) = \frac{A}{C}

Where A and C is the surface area of our volumes. And voila, that’s how we got that original formula.

In Depth View

Now that we have all the pieces, we can take a look at the construction of our BVH.

Cranberray builds it’s BVH from the top down. Starting from a containing bounding volume and splitting it in 2 recursively.

Cranberray keeps a queue of bounding volumes to process as a ring buffer. This makes management of front popping and back pushing very simple but makes resizing the queue trickier. As a result, Cranberray simply allocates a large buffer. We could likely allocate the maximum possible number of elements in the ring buffer instead (something along the lines of 2n-1 where n is the next largest power of 2)

Cranberray then selects the axis with the widest breadth of centroids. The code for this looks like so:

cv2 axisSpan[3];
for (uint32_t axis = 0; axis < 3; axis++)
{
	for (uint32_t i = 0; i < count; i++)
	{
		axisSpan[axis].x = fminf(axisSpan[axis].x, caabb_centroid(start[i].bound, axis));
		axisSpan[axis].y = fmaxf(axisSpan[axis].y, caabb_centroid(start[i].bound, axis));
	}
}

uint32_t axis;
if (axisSpan[0].y - axisSpan[0].x > axisSpan[1].y - axisSpan[1].x && axisSpan[0].y - axisSpan[0].x > axisSpan[2].y - axisSpan[2].x)
{
	axis = 0;
}
else if (axisSpan[1].y - axisSpan[1].x > axisSpan[2].y - axisSpan[2].x)
{
	axis = 1;
}
else
{
	axis = 2;
}

Once we’ve selected our axis, we split our axis in 12 distinct buckets as. (See PBRT [2] for more info on this approach)

We then calculate the cost of each seperation by adding up all the volumes on the left of the seperation and all the buckets on the right of the seperation.

BucketSelection
Here we’ve selected the first bucket as our splitting bucket and calculated the cost of this split.

We then store the cost of each seperation and select the seperation with the minimal cost as our split.

We then continue in this recursion until we’ve run out of items in our queue. (When we’ve partitioned all of our child volumes into leaf nodes).

Finally, our BVH is restructured somewhat for an improvement in memory usage.

Our BVH is stored in 2 arrays, a “jump index” array and a bounds array. This allows us to load the bounds array without having to load the jump indices into memory until we absolutely need them.

We read from our bounds memory much more frequently than our jump memory, as a result, splitting them allows us to make more effective use of our caches.

Our BVH structure looks like this:


typedef struct
{
	union
	{
		struct
		{
			uint32_t left;
			uint32_t right;
		} jumpIndices;

		struct
		{
			uint32_t index;
		} leaf;
	};
} bvh_jump_t;

typedef struct
{
	caabb* bounds;
	bvh_jump_t* jumps;
	uint32_t count;
	uint32_t leafCount;
} bvh_t;

The final special format of our BVH is that all the leaf nodes in our tree are stored at the end of our array. This allows us to test if a node is a leaf node by simply comparing the index in our array against the size of our array minus the number of leaves contained in the tree. This allows us to use data that we’ve already loaded into memory instead of requiring us to load extra data to use in our branch introducing a data dependency.

You can find the source for BVH construction here: https://github.com/AlexSabourinDev/cranberries/blob/5fe9c25e1df23d558b7ef8b5475717d9e67a19fc/cranberray.c#L963

We’ll be taking a look at the BVH traversal next. Until then, happy coding!

Future Work

The primary addition to this BVH construction algorithm would likely be to look into parallelizing it’s construction. [6]

References

[1] https://authors.library.caltech.edu/79167/1/04057175.pdf

[2] http://www.pbr-book.org/3ed-2018/Primitives_and_Intersection_Acceleration/Bounding_Volume_Hierarchies.html

[3] http://mathforum.org/library/drmath/view/62924.html

[4] https://arxiv.org/pdf/1109.0595.pdf

[5] https://math.stackexchange.com/questions/3222317/average-area-of-the-shadow-of-a-convex-shape

[6] https://meistdan.github.io/publications/phr/paper.pdf

Diary of a Path Tracer – Math Library

Index: https://cranberryking.com/2020/08/22/diary-of-a-path-tracer-the-beginning/

Intro

This is the first in a series of posts detailing the functionality of my hobby path tracing, we will be looking at the underpinning’s of it all, the math library cranberry_math.

The goals of cranberry_math are simple. Making SIMD accelerated mathematics easier and fun to use. There are already a huge swath of math libraries out there that do things very well (DirectXMath and GLM to name a few), however I decided to write my own for fun!

Ideology

The basis of cranberry_math is the concept of lanes. If you’re not familiar with SIMD, the concept is relatively simple. You can imagine that a SIMD register (or vector register) is a set of 4 “lanes”, these lanes are generally independent of each other but allow you to perform operations on them using another SIMD register. For the purpose of demonstration, we will write a SIMD register as v_n (v_0, v_1, v_2) and the element of one of those registers as v_ne_m (v_0e_0) where v_0e_0 would be element 0 of SIMD register 0.

I.e. if I take v_0 and add it to v_1 the result is the equivalent of v_2 = (v_0e_0+v_1e_0, v_0e_1+v_1e_1, v_0e_2+v_1e_2, v_0e_3+v_1e_3)

VectorExample

On it’s own, this might not be particularly useful. But vector register operations have a number of desirable properties.

Let’s imagine we have a fantasy processor, and this processor takes 1 cycle for every addition. This means that to add 4 pairs of numbers together it would take us roughly 4 cycles or 1 cycle per add. Now if our vector addition takes 4 cycles to complete, that’s still 1 cycle per add. However, consider what happens if by some magic we only take 3 cycles for our vector addition. That means overall, we only spend 3/4=0.75 cycles per add! In our fantasy processor, we can make this as low as 1 cycle for 4 adds or 1/4=0.25 cycles per add. You may ask why we can make our instruction run in less than 4 cycles even though we’re still doing 4 adds? In our fantasy processor, we can do this  by simply running 4 adders at once with our data. This means that although every adder is taking 1 cycle to complete, they are run in parallel, making the overall throughput 1 cycle/4 adds.

This is a highly simplified version of how your processor might implement vector adds.

Now, with this added information, we can see why we would want to use these operations efficiently, we could reduce our program latency by at most 4x! (In reality this is unlikely to happen, but we can get fairly close).

Cranberry_math takes this concept and tries to make the best use of it while still making it approachable to the user.

An intuitive and simple use of these registers would be to store our X,Y,Z components into each component such as v_0 = (x, y, z, unused) and working with them in that manner. In some common operations such as adding vectors, this can give you a nice hypothetical improvement of 3x over your original single element addition. However, we’re still leaving 25% of our optimal performance on the table! This approach also causes some issues with common operations such as the dot product.

Let’s say in our fantasy processor, our dot product operation takes 2 cycles to complete when applied to 1 vector. That means we get a throughput of 2 cycles/dot product. I propose (and has been proposed elsewhere as well) that you instead use each register to store a single component of a vector. Such as v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3), etc. With this in mind, we can achieve the dot product by doing v_x*v_x+v_y*v_y+v_z*v_z which is a total of 5 operations that all take 1 cycle in our fantasy processor for 4 vectors or a throughput of 1.25 cycles/dot product.

For SSE according to the intel intrinsics guide a multiply operation has a latency of 4 cycles (on Skylake), an add has a latency of 4 cycles. As a result a dot product in a fantasy processor with no out of order functionality or multiple issue, our wide dot product would take roughly 20 cycles to complete or 20/4=5 cycles per dot product. While if we look at the dot product instruction, it has a latency of 11 cycles. (See [1] and [2] for info on reading the intel intrinsics guide values) This is not indicative of real performance numbers. Modern processors are significantly more complex than what I’m implying here.

Another benefit of storing vectors this way, is that it scales trivially to SIMD registers of size 8, or 16! You only need to throw more data at it!

Now, this approach to storing vectors in SIMD registers is not fool proof. It introduces difficulty if we want to add similar components together. (x with x, y with y) within one vector, I however have not found this to be a problem while writing my path tracer.

This isn’t a silver bullet, but I recommend considering adding a path for this type of storage if you can afford it.

In Depth View

With that out of the way, here are the primary types for our library:

#define cran_lane_count 4
cran_align(16) typedef union
{
	float f[cran_lane_count];
	__m128 sse;
} cfl;

typedef union
{
	struct
	{
		float x, y, z;
	};

} cv3;

typedef struct
{
	cfl x;
	cfl y;
	cfl z;
} cv3l;

The theme of this library is that SIMD registers are referred to as lanes. Every element is expected to stay within it’s respective lane. cfl represents “cranberry float lane”, cv3 represents “cranberry vector 3” and cv3l represents “cranberry vector 3 lanes”. Notice how every component in cv3l has it’s own set of lanes? That is a direct implementation of the explanation that was presented above.

cran_forceinline cv3l cv3l_add(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_sub(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_mul(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_min(cv3l l, cv3l r);
cran_forceinline cv3l cv3l_max(cv3l l, cv3l r);

The API for cv3l is simple, it looks exactly the same as it’s cv3 counterpart. This makes it relatively easy to switch from one to the other simply by loading the appropriate data correctly.

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount);

One function that we should take a particular look at is cv3l_indexed_load. It’s likely you don’t want to store your data in cv3l format. It doesn’t lend itself particularly well to general processing and works much better in batch processing. As a result your data needs to be transformed from x,y,z,x,y,z to x,x,x,y,y,y,z,z,z to facilitate our maximum throughput.

One way to shuffle our data would be to load each element individually and store it in the appropriate locations. However this is surprisingly slow (I forget the exact reasons, a reference for that would be awesome).

Instead, what you can do is load your vectors as a set of vectors:

v_0 = (x_0, y_0, z_0, u), v_1 = (x_1, y_1, z_1, u), v_2 = (x_2, y_2, z_2, u), v_3 = (x_3, y_3, z_3, u)

and then shuffle the values into a set of temporary registers

v_{xy0} = (x_0, y_0, x_1, y_1), v_{xy1} = (x_2,y_2,x_3,y_3), v_{z0} = (z_0, u, z_1, u), v_{z1} = (z_2, u, z_3, u)

and finally shuffle them into the final registers

v_x = (x_0, x_1, x_2, x_3), v_y = (y_0, y_1, y_2, y_3), v_z =(z_0, z_1, z_2, z_3)

VectorShuffle

(There’s a mistake with the arrows for Z_2 and Z_3 they should be pointed to the 4 register)

Here’s the source for this operation:

cran_forceinline cv3l cv3l_indexed_load(void const* vectors, uint32_t stride, uint32_t offset, uint32_t* indices, uint32_t indexCount)
{
	__m128 loadedVectors[cran_lane_count];
	for (uint32_t i = 0; i < indexCount && i < cran_lane_count; i++)
	{
		uint8_t const* vectorData = (uint8_t*)vectors;
		loadedVectors[i] = _mm_load_ps((float const*)(vectorData + indices[i] * stride + offset));
	}

	__m128 XY0 = _mm_shuffle_ps(loadedVectors[0], loadedVectors[1], _MM_SHUFFLE(1, 0, 1, 0));
	__m128 XY1 = _mm_shuffle_ps(loadedVectors[2], loadedVectors[3], _MM_SHUFFLE(1, 0, 1, 0));
	__m128 Z0 = _mm_shuffle_ps(loadedVectors[0], loadedVectors[1], _MM_SHUFFLE(3, 2, 3, 2));
	__m128 Z1 = _mm_shuffle_ps(loadedVectors[2], loadedVectors[3], _MM_SHUFFLE(3, 2, 3, 2));

	return (cv3l)
	{
		.x = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(2, 0, 2, 0))},
		.y = {.sse = _mm_shuffle_ps(XY0, XY1, _MM_SHUFFLE(3, 1, 3, 1))},
		.z = {.sse = _mm_shuffle_ps(Z0, Z1, _MM_SHUFFLE(2, 0, 2, 0))}
	};
}

With all this, we can see that I can write a vectorized ray/AABB intersection using this new API:

cran_forceinline uint32_t caabb_does_ray_intersect_lanes(cv3 rayO, cv3 rayD, float rayMin, float rayMax, cv3l aabbMin, cv3l aabbMax)
{
	cv3l rayOLanes = cv3l_replicate(rayO);
	cv3l invD = cv3l_replicate(cv3_rcp(rayD));
	cv3l t0s = cv3l_mul(cv3l_sub(aabbMin, rayOLanes), invD);
	cv3l t1s = cv3l_mul(cv3l_sub(aabbMax, rayOLanes), invD);

	cv3l tsmaller = cv3l_min(t0s, t1s);
	cv3l tbigger  = cv3l_max(t0s, t1s);

	cfl rayMinLane = cfl_replicate(rayMin);
	cfl rayMaxLane = cfl_replicate(rayMax);
	cfl tmin = cfl_max(rayMinLane, cfl_max(tsmaller.x, cfl_max(tsmaller.y, tsmaller.z)));
	cfl tmax = cfl_min(rayMaxLane, cfl_min(tbigger.x, cfl_min(tbigger.y, tbigger.z)));
	cfl result = cfl_less(tmin, tmax);
	return cfl_mask(result);
}

That’s the gist of it! Writing vectorized BVH traversal using this API has been a breeze. You can find the rest of the library here: https://github.com/AlexSabourinDev/cranberries/blob/cranberray/cranberry_math.h

Here is the library in use for BVH AABB testing:

uint32_t activeLaneCount = min((uint32_t)(testQueueEnd - testQueueIter), cran_lane_count);
cv3l boundMins = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, min), testQueueIter, activeLaneCount);
cv3l boundMaxs = cv3l_indexed_load(bvh->bounds, sizeof(caabb), offsetof(caabb, max), testQueueIter, activeLaneCount);
uint32_t intersections = caabb_does_ray_intersect_lanes(rayO, rayD, rayMin, rayMax, boundMins, boundMaxs);

We’ll be taking a deeper look at the SSE optimized BVH traversal in the future. For now, happy coding!

References:

[1] https://stackoverflow.com/questions/40203254/intel-intrinsics-guide-latency-and-throughput

[2] https://stackoverflow.com/questions/35859449/why-are-some-haswell-avx-latencies-advertised-by-intel-as-3x-slower-than-sandy-b

[3] https://users.ece.cmu.edu/~franzf/teaching/slides-18-645-simd.pdf

[4] https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf

Diary of a Path Tracer – The Beginning

Within the past few months I’ve been implementing a path tracer during my weekends and evenings. Having reached what I consider to be an acceptable stopping point, I thought it might be interesting to detail my current implementation alongside the references that inspired me. Let’s jump in!

This series will touch on a variety of features that were implemented in the path tracer. Notably:

These posts will try to explain every aspect of the path tracer as they stand but they might not be as comprehensive as the sources referenced to create it. As a result, I will strive to include all relevant sources for each feature.

You can find the path tracer here: https://github.com/AlexSabourinDev/cranberries/tree/cranberray

Next we’ll talk a about the underpinning of the path tracer, the cranberry_math library, see you then!

Derivation – Importance Sampling The Cosine Lobe

Introduction

I’ve recently been diving into the world of importance sampling and I decided to share my derivation for importance sampling the cosine lobe.

Shade

When we’re shading a point in path tracing, we typically shoot a ray from our surface in a uniformity random direction contained on a hemisphere centered about our normal. This has the downside of introducing quite a bit of variance into our renders.

Imagine that we have a very bright light that only occupies a very small projected area from our shaded point.

IMG_2799

We would be very likely to miss this light with most of our samples and our render could turn out much darker than we would expect.

This is where importance sampling comes in.

If you imagine that your illumination is a polar function

IMG_2800

If we were to sample a random variable with a distribution that matches this function, we would be much more likely to hit the important points of our function. (Hence importance sampling)

I won’t dive deeply into this topic, as there are a variety of excellent resources detailing this topic. [1]

The essence of it however, is that you want to find a Probability Density Function (PDF) that matches the shape of your illumination function. Once you’ve define this PDF, you can sample it using the Cumulative Density Function (CDF).

Derivation

Since our cosine lobe illumination will look like this:

IMG_2801

We will use this as the basis to derive our distribution since we’re most likely to get the most light from directions arriving parallel to our normal.

Thankfully, our cosine lobe has an analytical formula that we can use as our PDF.

PDF(\omega) = C*cos(\theta) (1)

Our PDF must integrate to 1, we integrate the PDF across our hemisphere

\int_{\Omega}PDF(\omega)d\omega

\int_0^{2\pi}\int_0^{\frac{\pi}{2}}PDF(\omega)sin\theta d\theta d\phi

Plug in (1)

\int_0^{2\pi}\int_0^{\frac{\pi}{2}}C*cos\theta sin\theta d\theta d\phi

C*\int_0^{2\pi}\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta d\phi

\int cos\theta sin\theta d\theta = -\frac{1}{4}cos2\theta

\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta

-\frac{1}{4}cos\pi+ \frac{1}{4}cos0

\frac{1}{4}+\frac{1}{4}

\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta=\frac{1}{2} (2)

Plug in (2)

C*\int_0^{2\pi}\frac{1}{2} d\phi

C*\frac{1}{2}*2\pi

C*\int_0^{2\pi}\int_0^{\frac{\pi}{2}}cos\theta sin\theta d\theta d\phi=C*\pi (3)

Since our PDF has to integrate to 1,

\int_0^{2\pi}\int_0^{\frac{\pi}{2}}PDF(\omega)sin\theta d\theta d\phi = 1

Plug in (3),

C*\pi=1

C=\frac{1}{pi} (4)

Finally, plug in (4) into our PDF,

PDF(\omega) = \frac{cos(\theta)}{\pi} (5)

Now that we have our PDF, we can calculate our PDF in terms of \theta and \phi.

PDF(\theta,\phi)d\theta d\phi = PDF(\omega)d\omega

PDF(\theta,\phi)d\theta d\phi = PDF(\omega)sin\theta d\theta d\phi

PDF(\theta,\phi)=PDF(\omega)sin\theta

PDF(\theta,\phi)=\frac{cos\theta sin\theta}{\pi} (6)

Now we integrate with respect to \phi to get PDF(\theta)

\int_0^{2\pi}\frac{cos\theta sin\theta}{\pi}d\phi = 2cos\theta sin\theta

PDF(\theta)=2cos\theta sin\theta

And then to get PDF(\phi),

\frac{PDF(\theta,\phi)}{PDF(\theta)}=PDF(\phi)

\frac{cos\theta sin\theta}{2cos\theta sin\theta \pi}=\frac{1}{2\pi}

PDF(\phi)=\frac{1}{2\phi}

Now we want to calculate the CDF of each function,

CDF(\theta)=\int_0^\theta PDF(\theta) d\theta

CDF(\theta)=\int_0^\theta 2cos(\theta)sin(\theta) d\theta

CDF(\theta)=\int_0^\theta sin(2\theta) d\theta

CDF(\theta)=\frac{1}{2}-\frac{cos(2\theta)}{2}

CDF(\phi)=\int_0^\phi PDF(\phi) d\phi

CDF(\phi)=\int_0^\phi\frac{1}{2\pi} d\phi

CDF(\phi)=\frac{\phi}{2\pi}

Now we want to invert our CDF to sample it using our random variable y,

y=CDF(\theta)

y=\frac{1}{2}-\frac{cos(2\theta)}{2}

\frac{1}{2}-y=\frac{cos(2\theta)}{2}

1-2y=cos(2\theta)

\frac{cos^{-1}(1-2y)}{2}=\theta (7)

For \phi,

y=CDF(\phi)

y=\frac{\phi}{2\pi}

y*2\pi = \phi (8)

Now we have our CDFs and our PDFs, we can finally calculate our direction.

In pseudo code you can simply do:

\theta=\frac{cos^{-1}(1-2rand01())}{2}

\phi=rand01()*2\pi

With these directions, you can now sample your scene:

\frac{SampleScene(SphericalTo3D(\theta, \phi))}{PDF(\omega)}

Plug in (5)

\frac{SampleScene(SphericalTo3D(\theta, \phi))\pi}{cos\theta}

Conclusion

That’s it! These formulas will sample the hemisphere where it is receiving more light defined by the cosine lobe. The results are pretty awesome.

Bibliography

[1] https://www.scratchapixel.com/lessons/3d-basic-rendering/global-illumination-path-tracing/global-illumination-path-tracing-practical-implementation

A quest towards intuition – Where does that cosine come from?

I’ve recently been learning more about BRDFs and their properties and I stumbled across the concept of energy conservation. Energy conservation simply states that your BRDF should not emit more energy than it receives. To determine if a BRDF is energy conserving, you accumulate all the light emitted by a surface across a hemisphere centered at the normal of that surface and assure that the accumulated light is less than or equal to the amount of received light.

The mathematical equation for this is:

\int_{\Omega}BRDF*cos(i)*Li*d\omega<=Li

Where Li is the incoming light, i is the angle from our normal to our current accumulation direction.

Since Li is a constant, we can remove it and we’re left with:

\int_{\Omega}BRDF*cos(i)*d\omega<=1

Faced with this equation, I was struck. Where did that cosine come from?

Where did you come from? Where did you go? Where did you come from Cosine I Joe.

My first instinct was to brush it off. “It’s just the Lambert cosine factor”. False! The Lambert cosine factor was applied in Li which we removed.

I then began my slow decline into madness.

 

I was very confused. With a lot of stumbling I came up with various answers, each not as satisfying as I would like them to be.

Radiant exitance I yelped! Is it related to this “physics” and this “poynting vector” that you find when researching radiant flux? It might be, but I’m not proficient in the dark magic of physics. And so I moved on.

Can we instead think of it using the Helmholtz reciprocity principle that says that we can reverse our view and light vectors? This would allow us to flip the meaning of cos(i) from our viewing direction to our light direction and then simply say it’s the Lambertian cosine factor. But this didn’t feel right. We’re not talking about Lambertian materials, we’re talking about accumulating light from a generalized material. I decided to move on from this theory.

Finally, it struck me. The circle had done it! It’s always the circle.

Let’s forget everything we talked about. Forget about incoming radiance. Forget about 3D. Forget about Helmholtz. Forget about physics.

It’s just geometry

Let’s take a look at what we’re doing when we’re calculating how much light is being emitted by a surface in 2D.

If we start with the integration of the BRDF across a 2D circle without the cosine factor, we get:

\int_{\Omega}BRDF*dS

Where dS is our integration segment.

What this means is that We’re splitting the circle into equal lines and multiplying our emitted light by the width of each viewing segment.

 

 

Let’s take a look at each integration part in succession. Our integration method implies that we want to multiply our current light value by the width of our integration, as a result, we’ll draw 2 parallel lines for each line segment to represent the shape that would be the result of our multiplication.

When viewed top down, everything is great!

TopDownIntegration
What about at an angle?

AngledIntegration
Oops! That doesn’t line up with our observed area, our perceived area is containing more than our desired dA. What do we do about it?

Let’s look at it from the perspective of dA projecting onto a line at different angles

 

Notice how our area is getting smaller as our angle gets steeper? That’s what our integration segment should be doing to remove the extra area that we observed.

Let’s figure out the proportion of that shrinking area.

Notice that we can calculate the length of our projection using a right angle triangle (It always comes back to triangles).

 

 

Triangle

Our hypotenuse is dA, our angle is i and our projected area is dP.
Remembering our identities, SOH CAH TOA, Sine of our angle is equal to our opposite over our hypotenuse. This gives us dP/dA = sin(i) which we can reformulate as dP=sin(i)*dA.

dP here is the area of our integration segment, it is the value by which we want to multiply our accumulated light. Now that we know how to get dP, we want to formulate dA in terms of our integration segment dS.

Since we know that dS from the top down (when i is \pi/2) matches dA we can say that dS = sin(\pi/2)*dA which simplifies to dS = dA, as a result, we can replace our dA in the above to get dP = sin(i)*dS.

We can therefore modify our integral to be \int_{\Omega}BRDF*dP and substituting dP we get \int_{\Omega}BRDF*sin(i)*dS. Close! But why is it a sine and not a cosine? This is because our angle i is the angle from the plane of our normal to our normal but the information that we have is the angle from our normal to our plane (using N dot V). Let’s call the angle from the normal to the viewing direction \theta. We know that \theta=\pi/2-i. If we solve for i we get \pi/2-\theta=i and substituting to get \int_{\Omega}BRDF*sin(\pi/2-\theta)*dS which finally becomes \int_{\Omega}BRDF*cos(\theta)*dS!

CorrectIntegration

We’ve reached the end. This is where our cosine comes from!

Final thoughts

I spent quite some time trying to unravel this mystery, but it’s always exciting to get just that little bit of extra intuition. Hopefully this helps you understand it just a little bit more!

Side notes

Q: Why don’t we always do this cosine when integrating over a hemisphere?

A: Because we’re not always integrating a differential area (dA). It would be wrong to always apply that rule if we’re instead integrating a smaller hemisphere, or some other geometrical concept

Q: Why don’t we multiply our light by this cosine factor when we’re rendering?

A: Because it’s implicit in our perspective projection. When we’re viewing the surface it already covers less of our viewing surface when viewed at an angle. If we applied the cosine when rendering we would be applying it twice. We apply it here because we don’t have any perspective projection applied when doing our integration.

Resources

 

 

Think Piece – Expecting Failure

Premise

I’ve recently been thinking about different methods to improve my code quality. In this case quality equates to the number of failure states (crashes) a program has. One option I’ve been considering is programming through what I affectionately called Expected Failure. Or EF (pronounced F!)

More