vextracti128
and vextractf128
have not only the same functionality, parameters, and return values. They have the same instruction length. And they have the same throughput (according to Agner Fog's optimization manuals).
What is not completely clear is their latency values (performance in tight loops with dependency chains). Latency of instructions themselves is 3 cycles. But after reading section 2.1.3 ("Execution Engine") of Intel Optimization Manual we may suspect that vextracti128
should get additional 1 clock delay when working with floating point data and vextractf128
should get additional 1 clock delay when working with integer data. Measurements show that this is not true and latency always remains exactly 3 cycles (at least for Haswell processors). And as far as I know this is not documented anywhere in the Optimization Manual.
Still instruction set is only an interface to processor. Haswell is the only implementation of this interface containing both these instructions (for now). We could ignore the fact that implementations of these instructions are (most likely) identical. And use these instructions as intended - vextracti128
for integer data and vextractf128
for FP data. (If we only need to reorder data without performing any int/FP operations, the obvious choice is vextractf128
as it is supported by several older processors). Also experience shows that Intel sometimes decreases performance of some instructions in next generations of CPUs, so it would be wise to observe these instructions' affinity to avoid any possible speed degradation in the future.
Since Intel Optimization Manual is not very detailed describing relationship between int/FP domains for SIMD instructions, I've made some more measurements (on Haswell) and got some interesting results:
Shuffle instructions
There is no additional delay for any transitions between SSE integer and shuffle instructions. And there is no additional delay for any transitions between SSE FP and shuffle instructions. (Though I didn't test every instruction). For example you could insert such "obviously integer" instruction as pshufb
between two FP instructions with no extra delay. Inserting shufpd
in the middle of integer code also gives no extra delay.
Since vextracti128
and vextractf128
are executed by shuffle unit, they also have this "no delay" property.
This may be useful to optimize mixed int+FP code. If you need to reinterpret FP data as integers and at the same time shuffle the register, just make sure all FP instructions stand before the shuffle and all integer instructions are after it.
FP logical instructions
andps
and other FP logical instructions also have the property of ignoring FP/int domains.
If you add integer logical instruction (like pand
) into FP code, you get additional 2 cycle delay (one to get to int domain and other one to get back to FP). So the obvious choice for SIMD FP code is andps
. The same andps
may be used in the middle of integer code without any delays. Even better is to use such instructions right in between int and FP instructions. Interestingly, FP logical instructions are using the same port number 5 as all shuffle instructions.
Register access
Intel Optimization Manual describes bypass delays between producer and consumer micro-ops. But it does not say anything how micro-ops interact with registers.
This piece of code needs only 3 clocks per iteration (just as required by vaddps
):
vxorps ymm7, ymm7, ymm7
_benchloop:
vaddps ymm0, ymm0, ymm7
jmp _benchloop
But this one needs 2 clocks per iteration (1 more than needed for vpaddd
):
vpxor ymm7, ymm7, ymm7
_benchloop:
vpaddd ymm0, ymm0, ymm7
jmp _benchloop
The only difference here are calculations in integer domain instead of FP domain. To get 1 clock/iteration we need to add an instruction:
vpxor ymm7, ymm7, ymm7
_benchloop:
vpand ymm6, ymm7, ymm7
vpaddd ymm0, ymm0, ymm6
jmp _benchloop
Which hints that (1) all values stored in SIMD registers belong to FP domain, and (2) reading from SIMD register increases integer operation's latency by one. (The difference between {ymm0, ymm6} and ymm7 here is that ymm7 is stored in some scratch memory and works as real "register" while ymm0 and ymm6 are temporary and are represented by state of internal CPU's interconnections rather than some permanent storage, so ymm0 and ymm6 are not "read" but just passed between micro-ops).