It depends on the platform. For a Xenon PowerPC, for example, it can be an order of magnitude difference due to a load-hit-store issue with passing data on the stack. I empirically timed the overhead of a cdecl
function at about 45 cycles compared to ~4 for a fastcall
.
For an out-of-order x86 (Intel and AMD), the impact may be much less, because the registers are all shadowed and renamed anyway.
The answer really is that you need to benchmark it yourself on the particular platform you care about.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…