c - Why is uint_least16_t faster than uint_fast16_t for multiplication in x86_64?

Question

Welcome To Ask or Share your Answers For Others

c - Why is uint_least16_t faster than uint_fast16_t for multiplication in x86_64?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c - Why is uint_least16_t faster than uint_fast16_t for multiplication in x86_64?

The C standard is quite unclear about the uint_fast*_t family of types. On a gcc-4.4.4 linux x86_64 system, the types uint_fast16_t and uint_fast32_t are both 8 bytes in size. However, multiplication of 8-byte numbers seems to be fairly slower than multiplication of 4-byte numbers. The following piece of code demonstrates that:

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>

int
main ()
{
  uint_least16_t p, x;
  int count;

  p = 1;
  for (count = 100000; count != 0; --count)
    for (x = 1; x != 50000; ++x)
      p*= x;

  printf("%"PRIuLEAST16, p);
  return 0;
}

Running the time command on the program, I get

real 0m7.606s
user 0m7.557s
sys  0m0.019s

If I change the type to uint_fast16_t (and the printf modifier), the timing becomes

real 0m12.609s
user 0m12.593s
sys  0m0.009s

So, would it not be much better if the stdint.h header defined uint_fast16_t (and also uint_fast32_t) to be a 4-byte type?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:39:51+0000

AFAIK compilers only define their own versions of (u)int_(fast/least)XX_t types if these are not already defined by the system. That is because it is very important that these types are equally defined across all libraries/binaries on a single system. Otherwise, if different compilers would define those types differently, a library built with CompilerA may have a different uint_fast32_t type than a binary built with CompilerB, yet this binary may still link against the library; there is no formal standard requirement that all executable code of a system has to be built by the same compiler (actually on some systems, e.g. Windows, it is rather common that code has been compiled by all kind of different compilers). If now this binary calls a function of the library, things will break!

So the question is: Is it really GCC defining uint_fast16_t here, or is it actually Linux (I mean the kernel here) or maybe even the Standard C Lib (glibc in most cases), that defines those types? Since if Linux or glibc defines these, GCC built on that system has no choice other than to adopt whatever conventions these have established. Same is true for all other variable width types, too: char, short, int, long, long long; all these types have only a minimum guaranteed bit width in the C Standard (and for int it is actually 16 bit, so on platforms where int is 32 bit, it is already much bigger than would be required by the standard).

Other than that, I actually wonder what is wrong with your CPU/compiler/system. On my system 64 bit multiplication is equally fast to 32 bit multiplication. I modified your code to test 16, 32, and 64 bit:

#include <time.h>
#include <stdio.h>
#include <inttypes.h>

#define RUNS 100000

#define TEST(type)                                  
    static type test ## type ()                     
    {                                               
        int count;                                  
        type p, x;                                  
                                                    
        p = 1;                                      
        for (count = RUNS; count != 0; count--) {   
            for (x = 1; x != 50000; x++) {          
                p *= x;                             
            }                                       
        }                                           
        return p;                                   
    }

TEST(uint16_t)
TEST(uint32_t)
TEST(uint64_t)

#define CLOCK_TO_SEC(clock) ((double)clockTime / CLOCKS_PER_SEC)

#define RUN_TEST(type)                             
    {                                              
        clock_t clockTime;                         
        unsigned long long result;                 
                                                   
        clockTime = clock();                       
        result = test ## type ();                  
        clockTime = clock() - clockTime;           
        printf("Test %s took %2.4f s. (%llu)
",   
            #type, CLOCK_TO_SEC(clockTime), result 
        );                                         
    }

int main ()
{
    RUN_TEST(uint16_t)
    RUN_TEST(uint32_t)
    RUN_TEST(uint64_t)
    return 0;
}

Using unoptimized code (-O0), I get:

Test uint16_t took 13.6286 s. (0)
Test uint32_t took 12.5881 s. (0)
Test uint64_t took 12.6006 s. (0)

Using optimized code (-O3), I get:

Test uint16_t took 13.6385 s. (0)
Test uint32_t took 4.5455 s. (0)
Test uint64_t took 4.5382 s. (0)

The second output is quite interesting. @R.. wrote in a comment above:

On x86_64, 32-bit arithmetic should never be slower than 64-bit arithmetic, period.

The second output shows that the same thing cannot be said for 32/16 bit arithmetic. 16 bit arithmetic can be significantly slower on a 32/64 bit CPU, even though my x86 CPU can natively perform 16 bit arithmetic; unlike some other CPUs, like a PPC, for example, that can only perform 32 bit arithmetic. However, this only seems to apply to multiplication on my CPU, when changing the code to do addition/subtraction/division, there is no significant difference between 16 and 32 bit any longer.

The results above are from an Intel Core i7 (2.66 GHz), yet if anyone is interested, I can run this benchmark also on an Intel Core 2 Duo (one CPU generation older) and on an Motorola PowerPC G4.

Categories

c - Why is uint_least16_t faster than uint_fast16_t for multiplication in x86_64?

c - Why is uint_least16_t faster than uint_fast16_t for multiplication in x86_64?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags