Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
196 views
in Technique[技术] by (71.8m points)

c++ - Is there some meaningful statistical data to justify keeping signed integer arithmetic overflow undefined?

The C Standard explicitly specifies signed integer overflow as having undefined behavior. Yet most CPUs implement signed arithmetics with defined semantics for overflow (except maybe for division overflow: x / 0 and INT_MIN / -1).

Compilers writers have been taking advantage of the undefinedness of such overflows to add more aggressive optimisations that tend to break legacy code in very subtle ways. For example this code may have worked on older compilers but does not anymore on current versions of gcc and clang:

/* Tncrement a by a value in 0..255, clamp a to positive integers.
   The code relies on 32-bit wrap-around, but the C Standard makes
   signed integer overflow undefined behavior, so sum_max can now 
   return values less than a. There are Standard compliant ways to
   implement this, but legacy code is what it is... */
int sum_max(int a, unsigned char b) {
    int res = a + b;
    return (res >= a) ? res : INT_MAX;
}

Is there hard evidence that these optimisations are worthwhile? Are there comparative studies documenting the actual improvements on real life examples or even on classical benchmarks?

I came up with this question as I was watching this: C++Now 2018: John Regehr “Closing Keynote: Undefined Behavior and Compiler Optimizations”

I am tagging c and c++ as the problem is similar in both languages but the answers might be different.

question from:https://stackoverflow.com/questions/56047702/is-there-some-meaningful-statistical-data-to-justify-keeping-signed-integer-arit

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I don't know about studies and statistics, but yes, there are definitely optimizations taking this into account that compilers actually do. And yes, they are very important (tldr loop vectorization for example).

Besides the compiler optimizations, there is another aspect to be taken into account. With UB you get C/C++ signed integers to behave arithmetically as you would expect mathematically. For instance x + 10 > x holds true now (for valid code of course), but would not on a wrap-around behavior.

I've found an excellent article How undefined signed overflow enables optimizations in GCC from Krister Walfridsson’s blog listing some optimizations that take signed overflow UB into account. The following examples are from it. I am adding c++ and assembly examples to them.

If the optimizations look too simple, uninteresting or unimpactful, remember that these optimization are just steps in a much much larger chain of optimizations. And the butterfly effect does happen as a seemingly unimportant optimization at an earlier step can trigger a much more impactful optimization at a later step.

If the examples look nonsensical (who would write x * 10 > 0) keep in mind that you can very easily get to this kind of examples in C and C++ with constants, macros, templates. Besides the compiler can get to this kind of examples when applying transformations and optimizations in its IR.

Signed integer expression simplification

  • Eliminate multiplication in comparison with 0

    (x * c) cmp 0   ->   x cmp 0 
    
    bool foo(int x) { return x * 10 > 0 }
    
    foo(int):
            test    edi, edi
            setg    al
            ret
    
  • Eliminate division after multiplication

    (x * c1) / c2 -> x * (c1 / c2) if c1 is divisible by c2

    int foo(int x) { return (x * 20) / 10; }
    
    foo(int):
            lea     eax, [rdi+rdi]
            ret
    
  • Eliminate negation

    (-x) / (-y) -> x / y

    int foo(int x, int y) { return (-x) / (-y); }
    
    foo(int, int):
            mov     eax, edi
            cdq
            idiv    esi
            ret
    
  • Simplify comparisons that are always true or false

    x + c < x       ->   false
    x + c <= x      ->   false
    x + c > x       ->   true
    x + c >= x      ->   true
    
    bool foo(int x) { return x + 10 >= x; }
    
    foo(int):
            mov     eax, 1
            ret
    
  • Eliminate negation in comparisons

    (-x) cmp (-y)   ->   y cmp x
    
    bool foo(int x, int y) { return -x < -y; }
    
    foo(int, int):
            cmp     edi, esi
            setg    al
            ret
    
  • Reduce magnitude of constants

    x + c > y       ->   x + (c - 1) >= y
    x + c <= y      ->   x + (c - 1) < y
    
    bool foo(int x, int y) { return x + 10 <= y; }
    
    foo(int, int):
            add     edi, 9
            cmp     edi, esi
            setl    al
            ret
    
  • Eliminate constants in comparisons

    (x + c1) cmp c2         ->   x cmp (c2 - c1)
    (x + c1) cmp (y + c2)   ->   x cmp (y + (c2 - c1)) if c1 <= c2
    

    The second transformation is only valid if c1 <= c2, as it would otherwise introduce an overflow when y has the value INT_MIN.

    bool foo(int x) { return x + 42 <= 11; }
    
    foo(int):
            cmp     edi, -30
            setl    al
            ret
    

Pointer arithmetic and type promotion

If an operation does not overflow, then we will get the same result if we do the operation in a wider type. This is often useful when doing things like array indexing on 64-bit architectures — the index calculations are typically done using 32-bit int, but the pointers are 64-bit, and the compiler may generate more efficient code when signed overflow is undefined by promoting the 32-bit integers to 64-bit operations instead of generating type extensions.

One other aspect of this is that undefined overflow ensures that a[i] and a[i+1] are adjacent. This improves analysis of memory accesses for vectorization etc.

This is a very important optimization as loop vectorization one of the most efficient and effective optimization algorithms.

This is an example when changing an index from an unsigned index to a signed improves the generated assembly:

Unsigned version

#include <cstddef>

auto foo(int* v, std::size_t start)
{
    int sum = 0;

    for (std::size_t i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}

With unsigned the case where start + 4 wraps around must be taken into account and a branch is generated to deal with this case (branches are bad for performance):

; gcc on x64 with -march=skylake

foo1(int*, unsigned long):
        cmp     rsi, -5
        ja      .L3
        vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret
.L3:
        xor     eax, eax
        ret
; clang on x64 with -march=skylake

foo1(int*, unsigned long):                             # @foo1(int*, unsigned long)
        xor     eax, eax
        cmp     rsi, -4
        jae     .LBB0_2
        vpbroadcastq    xmm0, qword ptr [rdi + 4*rsi + 8]
        vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
.LBB0_2:
        ret

As a side note, using a narrower type would result in even worst assembly, inhibiting the use of SSE vectorized instructions:

#include <cstddef>

auto foo(int* v, unsigned start)
{
    int sum = 0;

    for (unsigned i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}
; gcc on x64 with -march=skylake

foo(int*, unsigned int):
        cmp     esi, -5
        ja      .L3
        mov     eax, esi
        mov     eax, DWORD PTR [rdi+rax*4]
        lea     edx, [rsi+1]
        add     eax, DWORD PTR [rdi+rdx*4]
        lea     edx, [rsi+2]
        add     eax, DWORD PTR [rdi+rdx*4]
        lea     edx, [rsi+3]
        add     eax, DWORD PTR [rdi+rdx*4]
        ret
.L3:
        xor     eax, eax
        ret
; clang on x64 with -march=skylake

foo(int*, unsigned int):                              # @foo(int*, unsigned int)
        xor     eax, eax
        cmp     esi, -5
        ja      .LBB0_3
        mov     ecx, esi
        add     esi, 4
        mov     eax, dword ptr [rdi + 4*rcx]
        lea     rdx, [rcx + 1]
        cmp     rdx, rsi
        jae     .LBB0_3
        add     eax, dword ptr [rdi + 4*rcx + 4]
        add     eax, dword ptr [rdi + 4*rcx + 8]
        add     eax, dword ptr [rdi + 4*rcx + 12]
.LBB0_3:
        ret

Signed version

Using a signed index however results in nice vectorized branchless code:

#include <cstddef>

auto foo(int* v, std::ptrdiff_t start)
{
    int sum = 0;

    for (std::ptrdiff_t i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}
; gcc on x64 with -march=skylake

foo(int*, long):
        vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret
; clang on x64 with -march=skylake

foo(int*, long):                              # @foo(int*, long)
        vpbroadcastq    xmm0, qword ptr [rdi + 4*rsi + 8]
        vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret

Vectorized instruction are still used when using a narrower signed type:

#include <cstddef>

auto foo(int* v, int start)
{
    int sum = 0;

    for (int i = start; i < start + 4; ++i)
        sum += v[i];

    return sum;
}
; gcc on x64 with -march=skylake

foo(int*, int):
        movsx   rsi, esi
        vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret
; clang on x64 with -march=skylake

foo(int*, int):                              # @foo(int*, int)
        movsxd  rax, esi
        vpbroadcastq    xmm0, qword ptr [rdi + 4*rax + 8]
        vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rax]
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        ret

Value range calculations

The compiler keeps track of the variables' range of possible values at each point in the program, i.e. for code such as

int x = foo();
if (x > 0) {
  int y = x + 5;
  int z = y / 4;

it determines that x has the range [1, INT_MAX] after the if-statement, and can thus determine that y has the range [6, INT_MAX] as overflow is not allowed. And the next line can be optimized to int z = y >> 2; as the compiler knows that y is non-negative.

auto foo(int x)
{
    if (x <= 0)
        __builtin_unreachable();
    
    return (x + 5) / 4;
}
foo(int):
        lea     eax, [rdi+5]
        sar     eax, 2
        ret

The undefined overflow helps optimizations that need to compare two values (as the wrapping case would give possible values of the form [INT_MIN, (INT_MIN+4)] or [6, INT_MAX] that prevents all useful comparisons with < or >), such as

  • Changing comparisons x<y to true or false if the ranges for x and y does not overlap
  • Changing min(x,y) or max(x,y) to x or y if the ranges do not overlap
  • Changing abs(x) to x or -x if the range does not cross 0
  • Changing

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...