Something interesting came up about memory alignment yesterday and I figured it could be neat to write up a quick synopsis of memory alignment.
What is memory alignment?
Memory alignment is the requirement that the address of your object must be a multiple of its alignment.
i.e. a 4 byte integer has a memory alignment of 4 bytes, this mean that you can find the integer at addresses 0, 4, 8, 12, 16, 20, 24, etc
Unaligned memory accesses used to be a pretty significant problem since CPUs could only load memory on particular alignment boundaries. (You could only load a 4 byte word at addresses that were a multiple of 4 bytes, I assume that compilers could likely handle the cast where memory was not aligned at the price of some performance – needs a reference)
However, x86-64’s load instruction (mov) supports unaligned accesses [1]. This is not necessarily the case for other architectures, so be careful!
Why do we care?
Usually we don’t!
Because:
- Most memory will be aligned on an 8 byte boundary if you are using a standard allocator (see max_align_t [2] which defines the alignment of all allocations by the standard allocators).
- Most memory can be loaded on an unaligned address without any problems anyways (sortof).
However there is a case that can cause problems.
Welcome to SIMD
SIMD stands for Single Instruction, Multiple Data. I’ll be glossing over it for now, feel free to reach out if you’d like to talk about it some more.
The important bit about SIMD in x86-64, is that it’s primary type, __m128, is 16 bytes large and has an alignment requirement of 16 bytes.
This type is not loaded using the standard x86 load instruction. It uses a particular instruction that _requires_ the type to be aligned on a 16 byte boundary (movaps). If the address being loaded isn’t a multiple of 16 bytes, then your application will throw an exception.
This means that if your standard allocator allocates on an 8 byte boundary, you are likely to crash if you allocate __m128 without telling it to align on a 16 byte boundary.
There is an instruction that allows you to load this type without alignment (movups) and the compiler _can_ emit this instruction.
However, this is easy to break.
Consider this small program.
uint8_t bytes[40];
__m128& vector1 = *(__m128*)&bytes[8];
__m128& vector2 = *(__m128*)&bytes[24];
__m128 vector = _mm_add_ps(vector1, vector2);
MSVC will generate this (authored to make more readable)
movups xmm0,xmmword[vector1]
addps xmm0,xmmword[vector2]
Notice that the first instruction is an unaligned load (yay!) HOWEVER! Notice that we’re also referencing vector2 in our addps instruction.
Das bad.
addps requires it’s source arguments to be aligned to a 16 byte boundary. And our memory is very likely not. (24 is not a multiple of 16!)
This will crash.
The worst part? Inconsistently. It all depends on “bytes” original address. If it was at address 8, then vector1 will be at address 8+8=16 and vector2 will be at address 8+24=32, no crash. However it it was at address 0, well…….
More things!
How can we break this alignment?
Packing into a buffer
Let’s say I allocate a large block of memory, maybe 1000TB… or 32 bytes for simplicity.
[................................]
Maybe what I want to do, is add a series of objects into that buffer to make then nice and packed.
First I add a byte into it.
[1...............................]
Then maybe I want to add a 4 byte integer into it
[12222...........................]
Then maybe I want to add our __m128 object
[122223333333333333333...........]
Uh oh! Now our __m128 object is at address 5, which definitely isn’t aligned to a 16 byte boundary.
Custom allocation
Another possibility, is to allocate our __m128 object using a standard allocator.
say I do this:
__m128* myVector = new __m128{};
This can easily cause a crash later on since standard allocators are only guaranteed to align to max_align_t which on most platforms is 8 bytes. (This seems it might change in C++17 [3])
What are the benefits of alignment?
There are a few that I can think of:
CPU Architectural simplicity.
Disclaimer: This is pure speculation. If I can guarantee that all my memory loads are aligned at particular boundaries, I can likely make my CPU much simpler which takes up less space, less power, less heat. More space is more good.
Cross cache loads/cross page loads.
If I have memory aligned at powers of 2 such as 2,4,8,16,etc, then I can guarantee that I will not have a small data type that will cross a cache line boundary.
Notice that with an 8 byte cache line, I can fit one 8 byte object. However, if the object is unaligned, the object could be in 2 cache lines!
[........][........]
[11111111][........] // Good
[....1111][1111....] // Bad
Our CPUs load memory as these singular cache line, requiring 2 cache lines to be loaded for a single object can introduce some undesired latency in a hot loop. [6]
This gets even worse with memory pages, however, due to a bout of extreme laziness I will not be going into details about it.
Conclusion
And that’s it! Thanks for coming by, hope this was informative! Got any questions? Please reach out!
Resources:
[1] https://blog.quarkslab.com/unaligned-accesses-in-cc-what-why-and-solutions-to-do-it-properly.html
[2] https://en.cppreference.com/w/cpp/types/max_align_t
[4] https://c9x.me/x86/html/file_module_x86_id_180.html
[5] https://blog.ngzhian.com/sse-avx-memory-alignment.html
[6] https://bits.houmus.org/2020-01-28/this-goes-to-eleven-pt1