Compiler: MinGW/GCC
Issues: No GPL/LGPL code allowed (GMP or any bignum library for that matter, is overkill for this problem, as I already have the class implemented).
I have constructed my own 128-bit fixed-size big integer class (intended for use in a game engine but may be generalized to any usage case) and I find the performance of the current multiply and divide operations to be quite abysmal (yes, I have timed them, see below), and I'd like to improve on (or change) the algorithms that do the low-level number crunching.
When it comes to the multiply and divide operators, they are unbearably slow compared to just about everything else in the class.
These are the approximate measurements relative to my own computer:
Raw times as defined by QueryPerformanceFrequency:
1/60sec 31080833u
Addition: ~8u
Subtraction: ~8u
Multiplication: ~546u
Division: ~4760u (with maximum bit count)
As you can see, just doing the multiplication is many, many times slower than add or subtract. Division is about 10 times slower than multiplication.
I'd like to improve the speed of these two operators because there may be a very high amount of computations being done per frame (dot products, various collision detection methods, etc).
The structure (methods omitted) looks somewhat like:
class uint128_t
{
public:
unsigned long int dw3, dw2, dw1, dw0;
//...
}
Multiplication is currently done using the typical long-multiplication method (in assembly so that I can catch the EDX
output) while ignoring the words that go out of range (that is, I'm only doing 10 mull
's compared to 16).
Division uses the shift-subtract algorithm (speed depends on bit counts of the operands). However, it is not done in assembly. I found that a little too difficult to muster and decided to let the compiler optimize it.
I have Google'd around for several days looking at pages describing algorithms such as Karatsuba Multiplication, high-radix division, and Newton-Rapson Division but the math symbols are a little too far over my head. I'd like to use some of these advanced methods to speed up my code, but I'd have to translate the "Greek" into something comprehensible first.
For those that may deem my efforts "premature optimization"; I consider this code to be a bottleneck because the very elementary-math operations themselves become slow. I can ignore such types of optimization on higher-level code, but this code will be called/used enough for it to matter.
I'd like suggestions on which algorithm I should use to improve multiply and divide (if possible), and a basic (hopefully easy to understand) explanation on how the suggested algorithm works would be highly appreciated.
EDIT: Multiply Improvements
I was able to improve the multiply operation by inlining code into operator*= and it seems as fast as possible.
Updated raw times:
1/60sec 31080833u
Addition: ~8u
Subtraction: ~8u
Multiplication: ~100u (lowest ~86u, highest around ~256u)
Division: ~4760u (with maximum bit count)
Here's some bare-bones code for you to examine (note that my type names are actually different, this was edited for simplicity):
//File: "int128_t.h"
class int128_t
{
uint32_t dw3, dw2, dw1, dw0;
// Various constrctors, operators, etc...
int128_t& operator*=(const int128_t& rhs) __attribute__((always_inline))
{
int128_t Urhs(rhs);
uint32_t lhs_xor_mask = (int32_t(dw3) >> 31);
uint32_t rhs_xor_mask = (int32_t(Urhs.dw3) >> 31);
uint32_t result_xor_mask = (lhs_xor_mask ^ rhs_xor_mask);
dw0 ^= lhs_xor_mask;
dw1 ^= lhs_xor_mask;
dw2 ^= lhs_xor_mask;
dw3 ^= lhs_xor_mask;
Urhs.dw0 ^= rhs_xor_mask;
Urhs.dw1 ^= rhs_xor_mask;
Urhs.dw2 ^= rhs_xor_mask;
Urhs.dw3 ^= rhs_xor_mask;
*this += (lhs_xor_mask & 1);
Urhs += (rhs_xor_mask & 1);
struct mul128_t
{
int128_t dqw1, dqw0;
mul128_t(const int128_t& dqw1, const int128_t& dqw0): dqw1(dqw1), dqw0(dqw0){}
};
mul128_t data(Urhs,*this);
asm volatile(
"push %%ebp
movl %%eax, %%ebp
movl $0x00, %%ebx
movl $0x00, %%ecx
movl $0x00, %%esi
movl $0x00, %%edi
movl 28(%%ebp), %%eax #Calc: (dw0*dw0)
mull 12(%%ebp)
addl %%eax, %%ebx
adcl %%edx, %%ecx
adcl $0x00, %%esi
adcl $0x00, %%edi
movl 24(%%ebp), %%eax #Calc: (dw1*dw0)
mull 12(%%ebp)
addl %%eax, %%ecx
adcl %%edx, %%esi
adcl $0x00, %%edi
movl 20(%%ebp), %%eax #Calc: (dw2*dw0)
mull 12(%%ebp)
addl %%eax, %%esi
adcl %%edx, %%edi
movl 16(%%ebp), %%eax #Calc: (dw3*dw0)
mull 12(%%ebp)
addl %%eax, %%edi
movl 28(%%ebp), %%eax #Calc: (dw0*dw1)
mull 8(%%ebp)
addl %%eax, %%ecx
adcl %%edx, %%esi
adcl $0x00, %%edi
movl 24(%%ebp), %%eax #Calc: (dw1*dw1)
mull 8(%%ebp)
addl %%eax, %%esi
adcl %%edx, %%edi
movl 20(%%ebp), %%eax #Calc: (dw2*dw1)
mull 8(%%ebp)
addl %%eax, %%edi
movl 28(%%ebp), %%eax #Calc: (dw0*dw2)
mull 4(%%ebp)
addl %%eax, %%esi
adcl %%edx, %%edi
movl 24(%%ebp), %%eax #Calc: (dw1*dw2)
mull 4(%%ebp)
addl %%eax, %%edi
movl 28(%%ebp), %%eax #Calc: (dw0*dw3)
mull (%%ebp)
addl %%eax, %%edi
pop %%ebp
"
:"=b"(this->dw0),"=c"(this->dw1),"=S"(this->dw2),"=D"(this->dw3)
:"a"(&data):"%ebp");
dw0 ^= result_xor_mask;
dw1 ^= result_xor_mask;
dw2 ^= result_xor_mask;
dw3 ^= result_xor_mask;
return (*this += (result_xor_mask & 1));
}
};
As for division, examining the code is rather pointless, as I will need to change the mathematical algorithm to see any substantial benefits. The only feasible choice seems to be high-radix division, but I have yet to iron out (in my mind) just how it will work.
See Question&Answers more detail:
os