Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

How to compare elements in array in assembly

How to compare all the elements in an array in Assembly language x86? I have to compare all the elements in an array and print the biggest one

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I'm going to pretend this wasn't a terrible trivial question, and actually discuss the interesting parts of doing this in assembly language (instead of letting a compiler optimize it for you).


In asm, you could do it like you would in any other language. However, if you're programming for a machine with vector instructions, you can and should use them. Compilers will usually do this for you, but in asm you have to do it yourself.

Since the main reason for writing code in asm is high performance, let's consider some of the issues:


Without vector instructions, it might or might not be a good idea to use a conditional-move to do the usual x=max(x, a[i]). cmov would introduce a loop-carried dependency, which might hurt performance more than the occasional branch mispredict. (google for more about this).

Branch mispredicts are probably not frequent when finding the max, unless your values are noisy but on average increasing. (e.g. a new max is seen every 1 to 10 elements would be near worst-case.) Otherwise, you probably tend to go for a long time either always seeing a new max or never seeing a new max.


x86 has vector min/max instructions that work like a cmp/cmov on a per-element basis.

So if your array is made up of 32bit signed integers, you could use start by loading the first 4 elements into a vector register (say xmm0), then use add rsi, 16 / PMAXSD xmm0, [rsi] inside a loop to do 4 packed x=max(x,src) operations. PMAXSD in English is: Packed(integer) Max of Signed DWord elements. See the links in the wiki for reference guides. PMAXSD is part of SSE4.1, so it's only supported on CPUs with that feature-bit.

If your array was made up of uint8_t elements, you'd use PMINUB (Packed(int) Min of Unsigned Byte elements). PMIN/MAXUB and PMIN/MAXSW are in SSE2, so they're baseline for x86-64 (and for x86-32 on OSes which require new enough hardware with SSE2 support).

After looping over the array (maybe using PALIGNR or PSRLDQ to handle the last non-multiple-of-16B bit of the array), you'll have 4 elements in your accumulator vector. Each one is the max of every 4th element, for four different offsets. To get the overal max, you need to horizontally find the max element in the vector. Do this by shuffling it (e.g. byte-shifting it right, so the high two elements move to the position of the low two elements), then and comparing it using PMAXSD against the un-shuffled value. Then repeat the process, to get the max of the last two elements.

Now you can store that 32bit int to memory, or use movd to transfer it directly from xmm0 to eax as a function return value.


There's some room for improvement here, since even though pmaxsd has a latency of one cycle (Intel Haswell for example), it has a throughput of 2 per cycle. So ideally we can sustain a throughput of two PMAX per clock, with memory operands. (Since Intel SnB and onward have two load ports, L1 cache can keep up with this.) We need to use multiple accumulators to allow parallel operation. (And then PMAX all the accumulators together at the end, before doing the horizontal operation).

;;; probably buggy, use at your own risk.  edits welcome
global max_array
max_array:
    ; function args: int *rsi, uint64_t rdi
    ; requirements: src is aligned on a 16B boundary, size is a multiple of 32bytes (8 elements), and >=8 on entry
    ; TODO: support unaligned with some startup code, and a partial final iteration with some cleanup

    lea rdx, [rsi + 4*rdi]   ; end pointer

    movdqa  xmm0, [rsi]      ; two accumulators
    movdqa  xmm1, [rsi + 16]  

    add   rsi, 32
    cmp   rsi, rdx
    jae .out        ; early exit if we shouldn't run the loop even once.  unsigned compare for addresses

.loop:
    pmaxsd  xmm0, [rsi]
    pmaxsd  xmm1, [rsi+16]
    add     rsi,  32
    cmp     rsi, rdx   ;; loop is 4 uops on Intel, since this cmp/branch macro-fuses
    jb   .loop

.out:
    ;; TODO: cleanup code to handle any non-multiple-of-8 iterations.

    pmaxsd  xmm0, xmm1
    movhlps xmm1, xmm0    ; xmm0 = { d, c, b, a}.  xmm1 = { d, c, d, c }
    pmaxsd  xmm0, xmm1    ; xmm0 = { d, c, max(d,b), max(c, a) }
                          ; if we were using AVX 3-operand instructions, we'd use PSRLDQ and another pmax because it's easy.

    ; do the final stage of horizontal MAX in integer registers, just for fun.
    ; pshufd/pmax to do the last level would be faster than this shld/cmp/cmov.

    movq    rax, xmm0     ; rax = { max(d,b), max(c,a) }
             ;  two-reg shift to unpack rax into edx:eax (with garbage in the high half of both)
    shld    rdx, rax, 32  ; rax = unchanged (eax=max(c,a)),  edx = max(d,b).
    cmp     edx, eax
    cmovg   eax, edx      ; eax = max( max(c,a), max(d,b) )
    ret

In theory, this runs at one iteration per clock on Intel SnB-family microarchitectures. 4 fused-domain uops per clock saturates the pipe, but unrolling more (and using more accumulators) just makes the cleanup code for a non-toy version of this a bigger headache.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...