Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
394 views
in Technique[技术] by (71.8m points)

x86 - What's the difference between vextracti128 and vextractf128?

vextracti128 and vextractf128 have the same functionality, parameters, and return values. In addition one is AVX instruction set while the other is AVX2. What is the difference?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

vextracti128 and vextractf128 have not only the same functionality, parameters, and return values. They have the same instruction length. And they have the same throughput (according to Agner Fog's optimization manuals).

What is not completely clear is their latency values (performance in tight loops with dependency chains). Latency of instructions themselves is 3 cycles. But after reading section 2.1.3 ("Execution Engine") of Intel Optimization Manual we may suspect that vextracti128 should get additional 1 clock delay when working with floating point data and vextractf128 should get additional 1 clock delay when working with integer data. Measurements show that this is not true and latency always remains exactly 3 cycles (at least for Haswell processors). And as far as I know this is not documented anywhere in the Optimization Manual.

Still instruction set is only an interface to processor. Haswell is the only implementation of this interface containing both these instructions (for now). We could ignore the fact that implementations of these instructions are (most likely) identical. And use these instructions as intended - vextracti128 for integer data and vextractf128 for FP data. (If we only need to reorder data without performing any int/FP operations, the obvious choice is vextractf128 as it is supported by several older processors). Also experience shows that Intel sometimes decreases performance of some instructions in next generations of CPUs, so it would be wise to observe these instructions' affinity to avoid any possible speed degradation in the future.

Since Intel Optimization Manual is not very detailed describing relationship between int/FP domains for SIMD instructions, I've made some more measurements (on Haswell) and got some interesting results:


Shuffle instructions

There is no additional delay for any transitions between SSE integer and shuffle instructions. And there is no additional delay for any transitions between SSE FP and shuffle instructions. (Though I didn't test every instruction). For example you could insert such "obviously integer" instruction as pshufb between two FP instructions with no extra delay. Inserting shufpd in the middle of integer code also gives no extra delay.

Since vextracti128 and vextractf128 are executed by shuffle unit, they also have this "no delay" property.

This may be useful to optimize mixed int+FP code. If you need to reinterpret FP data as integers and at the same time shuffle the register, just make sure all FP instructions stand before the shuffle and all integer instructions are after it.


FP logical instructions

andps and other FP logical instructions also have the property of ignoring FP/int domains.

If you add integer logical instruction (like pand) into FP code, you get additional 2 cycle delay (one to get to int domain and other one to get back to FP). So the obvious choice for SIMD FP code is andps. The same andps may be used in the middle of integer code without any delays. Even better is to use such instructions right in between int and FP instructions. Interestingly, FP logical instructions are using the same port number 5 as all shuffle instructions.


Register access

Intel Optimization Manual describes bypass delays between producer and consumer micro-ops. But it does not say anything how micro-ops interact with registers.

This piece of code needs only 3 clocks per iteration (just as required by vaddps):

    vxorps ymm7, ymm7, ymm7
_benchloop:
    vaddps ymm0, ymm0, ymm7
    jmp _benchloop

But this one needs 2 clocks per iteration (1 more than needed for vpaddd):

    vpxor ymm7, ymm7, ymm7
_benchloop:
    vpaddd ymm0, ymm0, ymm7
    jmp _benchloop

The only difference here are calculations in integer domain instead of FP domain. To get 1 clock/iteration we need to add an instruction:

    vpxor ymm7, ymm7, ymm7
_benchloop:
    vpand ymm6, ymm7, ymm7
    vpaddd ymm0, ymm0, ymm6
    jmp _benchloop

Which hints that (1) all values stored in SIMD registers belong to FP domain, and (2) reading from SIMD register increases integer operation's latency by one. (The difference between {ymm0, ymm6} and ymm7 here is that ymm7 is stored in some scratch memory and works as real "register" while ymm0 and ymm6 are temporary and are represented by state of internal CPU's interconnections rather than some permanent storage, so ymm0 and ymm6 are not "read" but just passed between micro-ops).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...