c - Efficiently dividing unsigned value by a power of two, rounding up

Question

Welcome To Ask or Share your Answers For Others

c - Efficiently dividing unsigned value by a power of two, rounding up

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c - Efficiently dividing unsigned value by a power of two, rounding up

I want to implement unsigned^a integer division by an arbitrary power of two, rounding up, efficiently. So what I want, mathematically, is ceiling(p/q)⁰. In C, the strawman implementation, which doesn't take advantage of the restricted domain of q could something like the following function¹:

/** q must be a power of 2, although this version works for any q */
uint64_t divide(uint64_t p, uint64_t q) {
  uint64_t res = p / q;
  return p % q == 0 ? res : res + 1;
}

... of course, I don't actually want to use division or mod at the machine level, since that takes many cycles even on modern hardware. I'm looking for a strength reduction that uses shifts and/or some other cheap operation(s) - taking advantage of the fact that q is a power of 2.

You can assume we have an efficient lg(unsigned int x) function, which returns the base-2 log of x, if x is a power-of-two.

Undefined behavior is fine if q is zero.

Please note that the simple solution: (p+q-1) >> lg(q) doesn't work in general - try it with p == 2^64-100 and q == 256² for example.

Platform Details

I'm interested in solutions in C, that are likely to perform well across a variety of platforms, but for the sake of concreteness, awarding the bounty and because any definitive discussion of performance needs to include a target architecture, I'll be specific about how I'll test them:

Skylake CPU
gcc 5.4.0 with compile flags -O3 -march=haswell

Using gcc builtins (such as bitscan/leading zero builtins) is fine, and in particular I've implemented the lg() function I said was available as follows:

inline uint64_t lg(uint64_t x) {
  return 63U - (uint64_t)__builtin_clzl(x);
}

inline uint32_t lg32(uint32_t x) {
  return 31U - (uint32_t)__builtin_clz(x);
}

I verified that these compile down to a single bsr instruction, at least with -march=haswell, despite the apparent involvement of a subtraction. You are of course free to ignore these and use whatever other builtins you want in your solution.

Benchmark

I wrote a benchmark for the existing answers, and will share and update the results as changes are made.

Writing a good benchmark for a small, potentially inlined operation is quite tough. When code is inlined into a call site, a lot of the work of the function may disappear, especially when it's in a loop³.

You could simply avoid the whole inlining problem by ensuring your code isn't inlined: declare it in another compilation unit. I tried to that with the bench binary, but really the results are fairly pointless. Nearly all implementations tied at 4 or 5 cycles per call, but even a dummy method that does nothing other than return 0 takes the same time. So you are mostly just measuring the call + ret overhead. Furthermore, you are almost never really going to use the functions like this - unless you messed up, they'll be available for inlining and that changes everything.

So the two benchmarks I'll focus the most on repeatedly call the method under test in a loop, allowing inlining, cross-function optmization, loop hoisting and even vectorization.

There are two overall benchmark types: latency and throughput. The key difference is that in the latency benchmark, each call to divide is dependent on the previous call, so in general calls cannot be easily overlapped⁴:

uint32_t bench_divide_latency(uint32_t p, uint32_t q) {
    uint32_t total = p;                         
    for (unsigned i=0; i < ITERS; i++) {                
      total += divide_algo(total, q);                       
      q = rotl1(q);                         
    }
    return total;
  }

Note that the running total depends so on the output of each divide call, and that it is also an input to the divide call.

The throughput variant, on the other hand, doesn't feed the output of one divide into the subsequent one. This allows work from one call to be overlapped with a subsequent one (both by the compiler, but especially the CPU), and even allows vectorization:

uint32_t bench_divide_throughput(uint32_t p, uint32_t q) { 
    uint32_t total = p;                         
    for (unsigned i=0; i < ITERS; i++) {                
      total += fname(i, q);                     
      q = rotl1(q);                     
    }                                   
    return total;                           
  }

Note that here we feed in the loop counter as the the dividend - this is variable, but it doesn't depend on the previous divide call.

Furthermore, each benchmark has three flavors of behavior for the divisor, q:

Compile-time constant divisor. For example, a call to divide(p, 8). This is common in practice, and the code can be much simpler when the divisor is known at compile time.
Invariant divisor. Here the divisor is not know at compile time, but is constant for the whole benchmarking loop. This allows a subset of the optimizations that the compile-time constant does.
Variable divisor. The divisor changes on each iteration of the loop. The benchmark functions above show this variant, using a "rotate left 1" instruction to vary the divisor.

Combining everything you get a total of 6 distinct benchmarks.

Results

Overall

For the purposes of picking an overall best algorithm, I looked at each of 12 subsets for the proposed algorithms: (latency, throughput) x (constant a, invariant q, variable q) x (32-bit, 64-bit) and assigned a score of 2, 1, or 0 per subtest as follows:

The best algorithm(s) (within 5% tolerance) receive a score of 2.
The "close enough" algorithms (no more than 50% slower than the best) receive a score of 1.
The remaining algorithms score zero.

Hence, the maximum total score is 24, but no algorithm achieved that. Here are the overall total results:

╔═══════════════════════╦═══════╗
║       Algorithm       ║ Score ║
╠═══════════════════════╬═══════╣
║ divide_user23_variant ║    20 ║
║ divide_chux           ║    20 ║
║ divide_user23         ║    15 ║
║ divide_peter          ║    14 ║
║ divide_chrisdodd      ║    12 ║
║ stoke32               ║    11 ║
║ divide_chris          ║     0 ║
║ divide_weather        ║     0 ║
╚═══════════════════════╩═══════╝

So the for the purposes of this specific test code, with this specific compiler and on this platform, user2357112 "variant" (with ... + (p & mask) != 0) performs best, tied with chux's suggestion (which is in fact identical code).

Here are all the sub-scores which sum to the above:

╔══════════════════════════╦═══════╦════╦════╦════╦════╦════╦════╗
║                          ║ Total ║ LC ║ LI ║ LV ║ TC ║ TI ║ TV ║
╠══════════════════════════╬═══════╬════╬════╬════╬════╬════╬════╣
║ divide_peter             ║     6 ║  1 ║  1 ║  1 ║  1 ║  1 ║  1 ║
║ stoke32                  ║     6 ║  1 ║  1 ║  2 ║  0 ║  0 ║  2 ║
║ divide_chux              ║    10 ║  2 ║  2 ║  2 ║  1 ║  2 ║  1 ║
║ divide_user23            ║     8 ║  1 ║  1 ║  2 ║  2 ║  1 ║  1 ║
║ divide_user23_variant    ║    10 ║  2 ║  2 ║  2 ║  1 ║  2 ║  1 ║
║ divide_chrisdodd         ║     6 ║  1 ║  1 ║  2 ║  0 ║  0 ║  2 ║
║ divide_chris             ║     0 ║  0 ║  0 ║  0 ║  0 ║  0 ║  0 ║
║ divide_weather           ║     0 ║  0 ║  0 ║  0 ║  0 ║  0 ║  0 ║
║                          ║       ║    ║    ║    ║    ║    ║    ║
║ 64-bit Algorithm         ║       ║    ║    ║    ║    ║    ║    ║
║ divide_peter_64          ║     8 ║  1 ║  1 ║  1 ║  2 ║  2 ║  1 ║
║ div_stoke_64             ║     5 ║  1 ║  1 ║  2 ║  0 ║  0 ║  1 ║
║ divide_chux_64           ║    10 ║  2 ║  2 ║  2 ║  1 ║  2 ║  1 ║
║ divide_user23_64         ║     7 ║  1 ║  1 ║  2 ║  1 ║  1 ║  1 ║
║ divide_user23_variant_64 ║    10 ║  2 ║  2 ║  2 ║  1 ║  2 ║  1 ║
║ divide_chrisdodd_64      ║     6 ║  1 ║  1 ║  2 ║  0 ║  0 ║  2 ║
║ divide_chris_64          ║     0 ║  0 ║  0 ║  0 ║  0 ║  0 ║  0 ║
║ divide_weather_64        ║     0 ║  0 ║  0 ║  0 ║  0 ║  0 ║  0 ║
╚══════════════════════════╩═══════╩════╩════╩════╩════╩════╩════╝

Here, each test is named like XY, with X in {Latency, Throughput} and Y in {Constant Q, Invariant Q, Variable Q}. So for example, LC is "Latency test with constant q".

Analysis

At the highest level, the solutions can be roughly divided into two categories: fast (the top 6 finishers) and slow (the bottom two). The difference is larger: all of the fast algorithms were the fastest on at least two subtests and in general when they didn't finish first they fell into the "close enough" category (they only exceptions being failed vectorizations in the case of stoke and chrisdodd). The slow algorithms however scored 0 (not even close) on every test. So you can mostly eliminate the slow algorithms from further consideration.

Auto-vectorization

Among the fast algorithms, a large differentiator was the ability to auto-vectorize.

None of the algorithms were able to auto-vectorize in the latency tests, which makes sense since the latency tests are designed to feed their result directly into the next iteration. So you can really only calculate results in a serial fashion.

For the throughput tests, however, many algorithms were able to auto-vectorize for the constant Q and invariant Q case. In both of these tests tests the divisor q is loop-invariant (and in the former case it is a compile-time constant). The dividend is the loop counter, so it is variable, but predicable (and in particular a vector of dividends can be trivially calculated by adding 8 to the previous input vector: [0, 1, 2, ..., 7] + [8, 8, ..., 8] == [8, 9, 10, ..., 15]).

In this scenario, gcc was able to vectorize peter, stoke, chux, user23 and user23_variant. It wasn't able to vectorize chrisdodd for some reason, likely because it included a branch (but conditionals don't strictly prevent vectorization since many other solutions have conditional elements but still vectorized). The impact was huge: algorithms that vectorized showed about an 8x improvement in throughput over variants that didn't but were otherwise fast.

Vectorization isn't free, though! Here are the function sizes for the "constant" variant of each function, with the Vec? column showing whether a function vectorized or not:

Size Vec? Name
 045    N bench_c_div_stoke_64
 049    N bench_c_divide_chrisdodd_64
 059    N bench_c_stoke32_64
 212    Y bench_c_divide_chux_64
 227    Y bench_c_divide_peter_64
 220    Y bench_c_divide_user23_64
 212    Y bench_c_divide_use

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:38:56+0000

This answer is about what's ideal in asm; what we'd like to convince the compiler to emit for us. (I'm not suggesting actually using inline asm, except as a point of comparison when benchmarking compiler output. https://gcc.gnu.org/wiki/DontUseInlineAsm).

I did manage to get pretty good asm output from pure C for ceil_div_andmask, see below. (It's worse than a CMOV on Broadwell/Skylake, but probably good on Haswell. Still, the user23/chux version looks even better for both cases.) It's mostly just worth mentioning as one of the few cases where I got the compiler to emit the asm I wanted.

It looks like Chris Dodd's general idea of return ((p-1) >> lg(q)) + 1 with special-case handling for d=0 is one of the best options. I.e. the optimal implementation of it in asm is hard to beat with an optimal implementation of anything else. Chux's (p >> lg(q)) + (bool)(p & (q-1)) also has advantages (like lower latency from p->result), and more CSE when the same q is used for multiple divisions. See below for a latency/throughput analysis of what gcc does with it.

If the same e = lg(q) is reused for multiple dividends, or the same dividend is reused for multiple divisors, different implementations can CSE more of the expression. They can also effectively vectorize with AVX2.

Branches are cheap and very efficient if they predict very well, so branching on d==0 will be best if it's almost never taken. If d==0 is not rare, branchless asm will perform better on average. Ideally we can write something in C that will let gcc make the right choice during profile-guided optimization, and compiles to good asm for either case.

Since the best branchless asm implementations don't add much latency vs. a branchy implementation, branchless is probably the way to go unless the branch would go the same way maybe 99% of the time. This might be likely for branching on p==0, but probably less likely for branching on p & (q-1).

It's hard to guide gcc5.4 into emitting anything optimal. This is my work-in-progress on Godbolt).

I think the optimal sequences for Skylake for this algorithm are as follows. (Shown as stand-alone functions for the AMD64 SysV ABI, but talking about throughput/latency on the assumption that the compiler will emit something similar inlined into a loop, with no RET attached).

Branch on carry from calculating d-1 to detect d==0, instead of a separate test & branch. Reduces the uop count nicely, esp on SnB-family where JC can macro-fuse with SUB.

ceil_div_pjc_branch:
 xor    eax,eax          ; can take this uop off the fast path by adding a separate xor-and-return block, but in reality we want to inline something like this.
 sub    rdi, 1
 jc    .d_was_zero       ; fuses with the sub on SnB-family
 tzcnt  rax, rsi         ; tzcnt rsi,rsi also avoids any false-dep problems, but this illustrates that the q input can be read-only.
 shrx   rax, rdi, rax
 inc    rax
.d_was_zero:
 ret

Fused-domain uops: 5 (not counting ret), and one of them is an xor-zero (no execution unit)
HSW/SKL latency with successful branch prediction:
- (d==0): No data dependency on d or q, breaks the dep chain. (control dependency on d to detect mispredicts and retire the branch).
- (d!=0): q->result: tzcnt+shrx+inc = 5c
- (d!=0): d->result: sub+shrx+inc = 3c
Throughput: probably just bottlenecked on uop throughput

I've tried but failed to get gcc to branch on CF from the subtract, but it always wants to do a separate comparison. I know gcc can be coaxed into branching on CF after subtracting two variables, but maybe this fails if one is a compile-time constant. (IIRC, this typically compiles to a CF test with unsigned vars: foo -= bar; if(foo>bar) carry_detected = 1;)

Branchless with ADC / SBB to handle the d==0 case. Zero-handling adds only one instruction to the critical path (vs. a version with no special handling for d==0), but also converts one other from an INC to a sbb rax, -1 to make CF undo the -= -1. Using a CMOV is cheaper on pre-Broadwell, but takes extra instructions to set it up.

ceil_div_pjc_asm_adc:
 tzcnt  rsi, rsi
 sub    rdi, 1
 adc    rdi, 0          ; d? d-1 : d.  Sets CF=CF
 shrx   rax, rdi, rsi
 sbb    rax, -1         ; result++ if d was non-zero
 ret

Fused-domain uops: 5 (not counting ret) on SKL. 7 on HSW
SKL latency:
- q->result: tzcnt+shrx+sbb = 5c
- d->result: sub+adc+shrx(dep on q begins here)+sbb = 4c
Throughput: TZCNT runs on p1. SBB, ADC, and SHRX only run on p06. So I think we bottleneck on 3 uops for p06 per iteration, making this run at best one iteration per 1.5c.

If q and d become ready at the same time, note that this version can run SUB/ADC in parallel with the 3c latency of TZCNT. If both are coming from the same cache-miss cache line, it's certainly possible. In any case, introducing the dep on q as late as possible in the d->result dependency chain is an advantage.

Getting this from C seems unlikely with gcc5.4. There is an intrinsic for add-with-carry, but gcc makes a total mess of it. It doesn't use immediate opera

Categories

c - Efficiently dividing unsigned value by a power of two, rounding up

c - Efficiently dividing unsigned value by a power of two, rounding up

Platform Details

Benchmark

Results

Overall

Analysis

Auto-vectorization

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags