Why the nested vectors are about the same speed as flat in your microbenchmark, after fixing the indexing order: You'd expect the flat array to be faster (see Tobias's answer about potential locality problems, and my other answer for why nested vectors suck in general, but not too badly for sequential access). But your specific test is doing so many things that let out-of-order execution hide the overhead of using nested vectors, and/or that just slow things down so much that the extra overhead is lost in measurement noise.
I put your performance-bugfixed source code up on Godbolt so we can look at the asm of the inner loop as compiled by g++5.2, with -O3
. (Apple's fork of clang might be similar to clang3.7, but I'll just look at the gcc version.) There's a lot of code from C++ functions, but you can right-click on a source line to scroll the asm windows to the code for that line. Also, mouseover a source line to bold the asm that implements that line, or vice versa.
gcc's inner two loops for the nested version are as follows (with some comments added by hand):
## outer-most loop not shown
.L213: ## middle loop (over `y`)
test rbp, rbp # Z
je .L127 # inner loop runs zero times if Z==0
mov rax, QWORD PTR [rsp+80] # MEM[(struct vector * *)&vec3D], MEM[(struct vector * *)&vec3D]
xor r15d, r15d # z = 0
mov rax, QWORD PTR [rax+r12] # MEM[(struct vector * *)_195], MEM[(struct vector * *)_195]
mov rdx, QWORD PTR [rax+rbx] # D.103857, MEM[(double * *)_38]
## Top of inner-most loop.
.L128:
lea rdi, [rsp+5328] # tmp511, ## function arg: pointer to the RNG object, which is a local on the stack.
lea r14, [rdx+r15*8] # D.103851, ## r14 = &(vec3D[x][y][z])
call double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&) #
addsd xmm0, xmm0 # D.103853, D.103853 ## return val *= 2.0: [0.0, 2.0]
mov rdx, QWORD PTR [rsp+80] # MEM[(struct vector * *)&vec3D], MEM[(struct vector * *)&vec3D] ## redo the pointer-chasing from vec3D.data()
mov rdx, QWORD PTR [rdx+r12] # MEM[(struct vector * *)_150], MEM[(struct vector * *)_150]
subsd xmm0, QWORD PTR .LC6[rip] # D.103859, ## and subtract 1.0: [-1.0, 1.0]
mov rdx, QWORD PTR [rdx+rbx] # D.103857, MEM[(double * *)_27]
movsd QWORD PTR [r14], xmm0 # *_155, D.103859 # store into vec3D[x][y][z]
movsd xmm0, QWORD PTR [rsp+64] # D.103853, tmp1 # reload volatile tmp1
addsd xmm0, QWORD PTR [rdx+r15*8] # D.103853, *_62 # add the value just stored into the array (r14 = rdx+r15*8 because nothing else modifies the pointers in the outer vectors)
add r15, 1 # z,
cmp rbp, r15 # Z, z
movsd QWORD PTR [rsp+64], xmm0 # tmp1, D.103853 # spill tmp1
jne .L128 #,
#End of inner-most loop
.L127: ## middle-loop
add r13, 1 # y,
add rbx, 24 # sizeof(std::vector<> == 24) == the size of 3 pointers.
cmp QWORD PTR [rsp+8], r13 # %sfp, y
jne .L213 #,
## outer loop not shown.
And for the flat loop:
## outer not shown.
.L214:
test rbp, rbp # Z
je .L135 #,
mov rax, QWORD PTR [rsp+280] # D.103849, vec1D._Y
mov rdi, QWORD PTR [rsp+288] # D.103849, vec1D._Z
xor r15d, r15d # z
mov rsi, QWORD PTR [rsp+296] # D.103857, MEM[(double * *)&vec1D + 24B]
.L136: ## inner-most loop
imul rax, r12 # D.103849, x
lea rax, [rax+rbx] # D.103849,
imul rax, rdi # D.103849, D.103849
lea rdi, [rsp+5328] # tmp520,
add rax, r15 # D.103849, z
lea r14, [rsi+rax*8] # D.103851, # &vec1D(x,y,z)
call double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&) #
mov rax, QWORD PTR [rsp+280] # D.103849, vec1D._Y
addsd xmm0, xmm0 # D.103853, D.103853
mov rdi, QWORD PTR [rsp+288] # D.103849, vec1D._Z
mov rsi, QWORD PTR [rsp+296] # D.103857, MEM[(double * *)&vec1D + 24B]
mov rdx, rax # D.103849, D.103849
imul rdx, r12 # D.103849, x # redo address calculation a 2nd time per iteration
subsd xmm0, QWORD PTR .LC6[rip] # D.103859,
add rdx, rbx # D.103849, y
imul rdx, rdi # D.103849, D.103849
movsd QWORD PTR [r14], xmm0 # MEM[(double &)_181], D.103859 # store into the address calculated earlier
movsd xmm0, QWORD PTR [rsp+72] # D.103853, tmp2
add rdx, r15 # tmp374, z
add r15, 1 # z,
addsd xmm0, QWORD PTR [rsi+rdx*8] # D.103853, MEM[(double &)_170] # tmp2 += vec1D(x,y,z). rsi+rdx*8 == r14, so this is a reload of the store this iteration.
cmp rbp, r15 # Z, z
movsd QWORD PTR [rsp+72], xmm0 # tmp2, D.103853
jne .L136 #,
.L135: ## middle loop: increment y
add rbx, 1 # y,
cmp r13, rbx # Y, y
jne .L214 #,
## outer loop not shown.
Your MacBook Pro (Late 2012) has an Intel IvyBridge CPU, so I'm using numbers for that microarchitecture from Agner Fog's instruction tables and microarch guide. Things should be mostly the same on other Intel/AMD CPUs.
The only 2.5GHz mobile IvB i5 is the i5-3210M, so your CPU has 3MiB of L3 cache. This means even your smallest test case (100^3 * 8B per double
~= 7.63MiB) is larger than your last-level cache, so none of your test cases fit in cache at all. That's probably a good thing, because you allocate and default-initialize both nested and flat before testing either of them. However, you do test in the same order you allocate, so if the nested array is still cache after zeroing the flat array, the flat array may still be hot in L3 cache after the timing loop over the nested array.
If you'd used a repeat-loop to loop over the same array multiple times, you could have got times large enough to measure for smaller array sizes.
You're doing several things here that are super-weird and make this so slow that out-of-order execution can hide the extra latency of changing y
, even if your inner z
vectors are not perfectly contiguous.
You run a slow PRNG inside the timed loop. std::uniform_real_distribution<double> urd(-1, 1);
is extra overhead on top of std::mt19937 rng{rd()};
, which is already slow compared to FP-add latency (3 cycles), or compared to the L1D cache load throughput of 2 per cycle. All this extra time running the PRNG gives out-of-order execution a chance to run the array-indexing instructions so the final address is ready by the time the data is. Unless you have a lot of cache misses, you're mostly just measuring PRNG speed, because it produces results much slower than 1 per clock cycle.
g++5.2 doesn't fully inline the urd(rng)
code, and the x86-64 System V calling convention has no call-preserved XMM registers. So tmp1
/tmp2
have to be spilled/reloaded for every element, even if they weren't volatile
.
It also loses its place in the Z vector, and has to redo the outer 2 levels of indirection before