Even at -O0
, gcc doesn't emit definitions for static inline
functios unless there's a caller. In that case, it doesn't actually inline: instead it emits a stand-alone definition. So I guess your disassembly is from that.
Are you using a really old gcc version? gcc 4.6.4 puts the vars in that order on the stack, but 4.7.3 and later use the other order:
movb $1, -5(%rbp) #, tmp
movl $0, -4(%rbp) #, i
In your asm, they're stored in order of initialization rather than declaration, but I think that's just by chance, since the order changed with gcc 4.7. Also, tacking on an initializers like int i=1;
doesn't change the allocation order, so that completely torpedoes that theory.
Remember that gcc is designed around a series of transformations from source to asm, so -O0
doesn't mean "no optimization". You should think of -O0
as leaving out some things that -O3
normally does. There is no option that tries to make a literal-as-possible translation from source to asm.
Once gcc does decide which order to allocate space for them:
the char
at rbp-1
: That's the first available location that can hold a char
. If there was another char
that needed storing, it could go at rbp-2
.
the int
at rbp-8
: Since the 4 bytes from rbp-1
to rbp-4
isn't free, the next available naturally-aligned location is rbp-8
.
Or with gcc 4.7 and newer, -4 is the first available spot for an int, and -5 is the next byte below that.
RE: space saving:
It's true that putting the char at -5 makes the lowest touched address %rsp-5
, instead of %rsp-8
, but this doesn't save anything.
The stack pointer is 16B-aligned in the AMD64 SysV ABI. (Technically, %rsp+8
(the start of stack args) is aligned on function entry, before you push anything.) The only way for %rbp-8
to touch a new page or cache-line that %rbp-5
wouldn't is for the stack to be less than 4B-aligned. This is extremely unlikely, even in 32bit code.
As far as how much stack is "allocated" or "owned" by the function: In the AMD64 SysV ABI, the function "owns" the red zone of 128B below %rsp
(That size was chosen because a one-byte displacement can go up to -128
). Signal handlers and any other asynchronous users of the user-space stack will avoid clobbering the red zone, which is why the function can write to memory below %rsp
without decrementing %rsp
. So from that perspective, it doesn't matter how much of the red-zone we use; the chances of a signal handler running out of stack is unaffected.
In 32bit code, where there's no redzone, for either order gcc reserves space on the stack with sub $16, %esp
. (try with -m32
on godbolt). So again, it doesn't matter whether we use 5 or 8 bytes, because we reserve in units of 16.
When there are many char
and int
variables, gcc packs the char
s into 4B groups, instead of losing space to fragmentation, even when the declarations are mixed together:
void many_vars(void) {
char tmp = 1; int i=1;
char t2 = 2; int i2 = 2;
char t3 = 3; int i3 = 3;
char t4 = 4;
}
with gcc 4.6.4 -O0 -fverbose-asm
, which helpfully labels which store is for which variable, which is why compiler asm output is preferable to disassembly:
pushq %rbp #
movq %rsp, %rbp #,
movb $1, -4(%rbp) #, tmp
movl $1, -16(%rbp) #, i
movb $2, -3(%rbp) #, t2
movl $2, -12(%rbp) #, i2
movb $3, -2(%rbp) #, t3
movl $3, -8(%rbp) #, i3
movb $4, -1(%rbp) #, t4
popq %rbp #
ret
I think variables go in either forward or reverse order of declaration, depending on gcc version, at -O0
.
I made a version of your read_array
function that works with optimization on:
// assumes that size is non-zero. Use a while() instead of do{}while() if you want extra code to check for that case.
void read_array_good(const char* array, size_t size) {
const volatile char *vp = array;
do {
(void) *vp; // this counts as accessing the volatile memory, with gcc/clang at least
vp += CACHE_LINE_SIZE/sizeof(vp[0]);
} while (vp < array+size);
}
Compiles to the following, with gcc 5.3 -O3 -march=haswell:
addq %rdi, %rsi # array, D.2434
.L11:
movzbl (%rdi), %eax # MEM[(const char *)array_1], D.2433
addq $64, %rdi #, array
cmpq %rsi, %rdi # D.2434, array
jb .L11 #,
ret
Casting an expression to void is the canonical way to tell the compiler that a value is used. e.g. to suppress unused-variable warnings, you can write (void)my_unused_var;
.
For gcc and clang, doing that with a volatile
pointer dereference does generate a memory access, with no need for a tmp variable. The C standard is very non-specific about what constitutes access to something that's volatile
, so this probably isn't perfectly portable. Another way is to xor
the values you read into an accumulator, and then store that to a global. As long as you don't use whole-program optimization, the compiler doesn't know that nothing reads the global, so it can't optimize away the calculation.
See the vmtouch
source code for an example of this second technique. (It actually uses a global variable for the accumulator, which makes clunky code. Of course, that hardly matters since it's touching pages, not just cache lines, so it very quickly bottlenecks on TLB misses and page faults, even with a memory read-modify-write in the loop-carried dependency chain.)
I tried and failed to write something that gcc or clang would compile to a function with no prologue (which assumes that size
is initially non-zero). GCC always wants to add rsi,rdi
for a cmp/jcc
loop condition, even with -march=haswell
where sub rsi,64
/jae
can macro-fuse just as well as cmp/jcc
. But in general on AMD, what GCC has fewer uops inside the loop.
read_array_handtuned_haswell:
.L0
movzx eax, byte [rdi] ; overwrite the full RAX to avoid any partial-register false deps from writing AL
add rdi, 64
sub rsi, 64
jae .L0 ; or ja, depending on what semantics you want
ret
Godbolt Compiler Explorer link with all my attempts and trial versions
I can get similar if the loop-termination condition is je
, in a loop like do { ... } while( size -= CL_SIZE );
But I can't seem to convince gcc to catch unsigned borrow when subtracting. It want to subtract and then cmp -64/jb
to detect underflow. It's not that hard to get compilers to check the carry flag after an add to detect carry :/
It's also easy to get compilers to make a 4-insn loop, but not without prologue. e.g. calculate an end pointer (array+size) and increment a pointer until it's greater or equal.
Fortunately this is not a big deal; the loop we do get is good.