c# - Fastest implementation of log2(int) and log2(float)

Question

Welcome To Ask or Share your Answers For Others

c# - Fastest implementation of log2(int) and log2(float)

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

c# - Fastest implementation of log2(int) and log2(float)

The question is

Are there any other (and/or faster) implementations of a basic 2log?

Applications

The log2(int) and log2(float) operations are very useful in a lot of different contexts. To name a few: compression algorithms, 3d engines and machine learning. In almost all of these contexts they are used in the low-level code that is called billions of times... Especially the log2(int) operation is very useful.

Because I find myself using log2 all the time, I don't want to give a specific application I'm working on here. What is the same is the fact that this is a real performance drainer (as shown by performance tests of various applications). For me it's key to get this as fast as possible.

The complete source code to test all implementations is added at the bottom, so you can see for yourself.

And of course... run your tests at least 3 times and make sure the counters are big enough to hit multiple seconds. Also I do the 'add' operation to ensure the whole loop isn't magically removed by the JIT'ter. So let's get started with the real work.

Trivial implementation

The trivial implementation of a 2log in C# is:

(int)(Math.Log(x) / Math.Log(2))

This implementation is trivial, but also very slow. It requires 2 Log operations, that are in itself quite slow already. Of course, we can optimize this by making 1.0/Math.Log(2) a constant.

Note that we need to modify this constant a bit to get the right results (as a result of floating point errors) or add a small number to get the correct results. I chose the latter, but it doesn't really matter - the end result is slow in all cases.

Table lookup

A faster solution for this is to use a lookup table. While you can use a lookup table of any power of 2, I usually use a table size of 256 or 64K entries.

First we create the lookup table:

lookup = new int[256];
for (int i = 1; i < 256; ++i)
{
    lookup[i] = (int)(Math.Log(i) / Math.Log(2));
}

Next, we implement the 2log as follows:

private static int LogLookup(int i)
{
    if (i >= 0x1000000) { return lookup[i >> 24] + 24; }
    else if (i >= 0x10000) { return lookup[i >> 16] + 16; }
    else if (i >= 0x100) { return lookup[i >> 8] + 8; }
    else { return lookup[i]; }
}

As you can see, table lookups are a much, much faster implementation - but as a con it cannot be used to calculate log2(float).

Branch removal

As we all know, processors aren't very good at branching, so I figured that table lookups can be improved by removing the branches. Instead of the bunches of if's I introduced a second table with the values and shift bits around to find the entry in the table:

nobranch = new int[16] { 0, 0, 8, 8, 16, 16, 16, 16, 24, 24, 24, 24, 24, 24, 24, 24 };

private static int LogDoubleLookup(int i)
{
    int n = (i | (i >> 4));
    n = (n | (n >> 2));
    n = (n | (n >> 1));
    n = ((n & 0x1000000) >> 21) | ((n & 0x10000) >> 14) | ((n & 0x100) >> 7) | (n & 1);
    int br = nobranch[n];
    return lookup[i >> br] + br;
}

If you run this test, you will find that it is actually slower than the if-then-else solution.

And then there was the Intel 80386

Intel understood years ago that this is an important operation, so they implemented Bit-Scan-Forward (BSF) into their processors. Other processors have similar instructions. This is by far the fastest way to do a 2log that I know of - but unfortunately I know of now way to use these nice functions from C#... I don't like the idea of having an implementation that doesn't run anymore when a new tablet or phone hits the market - and I don't know of any cross-platform solution that enables me to use this function directly.

Other implementations

As l4V pointed out (thanks!) there are a couple of other implementations, specifically:

Trivial loop. I omitted this because it's trivial this isn't really fast. Implemented in TestTrivial.
64-bit IEEE / int union's that can be used. Implemented in TestFloat
DeBruijn lookup tables. Implemented in TestDeBruijn
Binary search. Implemented in TestBinary

Apart that I like the name, the DeBruijn lookup tables are just as fast as the normal lookup tables, making it one of the fastest algorithms here... all the other algorithms I've tried are much slower.

Complete test code

public class Log2Test
{
    public static void TestNaive()
    {
        Stopwatch sw = new Stopwatch();
        sw.Start();
        int n = 0;
        for (int i = 1; i < 100000000; ++i)
        {
            n += (int)(Math.Log(i) / Math.Log(2.0));
        }
        sw.Stop();
        Console.WriteLine("Result: {0} - naive implementation took {1:0.000}s", n, sw.Elapsed.TotalSeconds);
    }

    public static int LogTrivialLoop(int v)
    {
        int r = 0;
        while ((v >>= 1) > 0) // unroll for more speed...
        {
            r++;
        }
        return r;
    }

    public static void TestTrivialLoop()
    {
        Stopwatch sw = new Stopwatch();
        sw.Start();
        int n = 0;
        for (int i = 1; i < 100000000; ++i)
        {
            n += LogTrivialLoop(i);
        }
        sw.Stop();
        Console.WriteLine("Result: {0} - loop implementation took {1:0.000}s", n, sw.Elapsed.TotalSeconds);
    }

    public static int LogFloat(int v)
    {
        Helper h = new Helper() { U1 = v, U2 = 0x43300000 };
        h.D -= 4503599627370496.0;
        return (h.U2 >> 20) - 0x3FF;
    }

    public static void TestFloat()
    {
        Stopwatch sw = new Stopwatch();
        sw.Start();
        int n = 0;
        for (int i = 1; i < 100000000; ++i)
        {
            n += LogFloat(i);
        }
        sw.Stop();
        Console.WriteLine("Result: {0} - IEEE float implementation took {1:0.000}s", n, sw.Elapsed.TotalSeconds);
    }

    [StructLayout(LayoutKind.Explicit)]
    private struct Helper
    {
        [FieldOffset(0)]
        public int U1;
        [FieldOffset(4)]
        public int U2;
        [FieldOffset(0)]
        public double D;
    }

    public static void TestConstant()
    {
        double c = 1.0 / Math.Log(2.0);
        Stopwatch sw = new Stopwatch();
        sw.Start();
        int n = 0;
        for (int i = 1; i < 100000000; ++i)
        {
            n += (int)(0.00000000001 + Math.Log(i) * c);
        }
        sw.Stop();
        Console.WriteLine("Result: {0} - naive 2 implementation took {1:0.000}s", n, sw.Elapsed.TotalSeconds);
    }

    private static int LogLookup(int i)
    {
        if (i >= 0x1000000) { return lookup[i >> 24] + 24; }
        else if (i >= 0x10000) { return lookup[i >> 16] + 16; }
        else if (i >= 0x100) { return lookup[i >> 8] + 8; }
        else { return lookup[i]; }
    }

    public static void TestLookup()
    {
        lookup = new int[256];
        for (int i = 1; i < 256; ++i)
        {
            lookup[i] = (int)(Math.Log(i) / Math.Log(2));
        }
        Stopwatch sw = new Stopwatch();
        sw.Start();
        int n = 0;
        for (int i = 1; i < 100000000; ++i)
        {
            n += LogLookup(i);
        }
        sw.Stop();
        Console.WriteLine("Result: {0} - table lookup implementation took {1:0.000}s", n, sw.Elapsed.TotalSeconds);
    }

    private static int LogDoubleLookup(int i)
    {
        int n = (i | (i >> 4));
        n = (n | (n >> 2));
        n = (n | (n >> 1));
        n = ((n & 0x1000000) >> 21) | ((n & 0x10000) >> 14) | ((n & 0x100) >> 7) | (n & 1);
        int br = nobranch[n];
        return lookup[i >> br] + br;
    }

    public static void TestDoubleLookup()
    {
        // Lookup table was already constructed earlier
        Stopwatch sw = new Stopwatch();
        sw.Start();
        int n = 0;
        for (int i = 1; i < 100000000; ++i)
        {
            n += LogDoubleLookup(i);
        }
        sw.Stop();
        Console.WriteLine("Result: {0} - double table lookup implementation took {1:0.000}s", n, sw.Elapsed.TotalSeconds);
    }

    private static int LogBinary(int v)
    {
        /* This is the worst implementation ever... - apparently C# is a slow-branching language

        int[] b = { 0x2, 0xC, 0xF0, 0xFF00, 0x7FFF0000 };
        int[] S = { 1, 2, 4, 8, 16 };

        int r = 0; // result of log2(v) will go here
        for (int i = 4; i >= 0; i--) // unroll for speed...
        {
            if ((v & b[i]) != 0)
            {
                v >>= S[i];
                r |= S[i];
            }
        }
        return r;

         */

        int r = (((v > 0xFFFF)) ? 0x10 : 0); 
        v >>= r;
        int shift = ((v > 0xFF) ? 0x8 : 0); 
        v >>= shift; 
        r |= shift;
        shift = ((v > 0xF) ? 0x4 : 0); 
        v >>= shift;
        r |= shift;
        shift = ((v > 0x3) ? 0x2 : 0); 
        v >>= shift;
        r |= shift;
        r |= (v >> 1);
        return r;
    }

    public static void TestBinary()
    {
        // Lookup table was already constructed earlier
        Stopwatch sw = new Stopwatch();
        sw.Start();
        int n = 0;
        for (int i = 1; i < 100000000; ++i)
        {
            n += LogBinary(i);
        }
        sw.Stop();
        Console.WriteLine("Result: {0} - binary search implementation took {1:0.000}s", n, sw.Elapsed.TotalSeconds);
    }

    private static readonly int[] MultiplyDeBruijnBitPosition = new int[32]
    {
        0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
        8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
    };

    private static int LogDeBruijn(int v)
    {
        v |= v >> 1; // first round down to one less than a power of 2 
        v |= v >> 2;
        v |= v >> 4;
        v |= v >> 8;
        v |= v >> 16;

        return MultiplyDeBruijnBitPosition[(uint)(v * 0x07C4ACDDU) >> 27];
    }

    public static void TestDeBruijn()
    {
        // Lookup table was already constructed earlier
        Stopwatch sw = new Stopwatch();
        sw.Start();
        int n = 0;
        for (int i = 1; i < 100000000; ++i)
        {
            n += LogDeBruijn(i);
        }
        sw.Stop();
        Console.WriteLine("Result: {0} - de Bruijn implementation took {1:0.000}s", n, sw.Elapsed.TotalSeconds);
    }

    private static int[] lookup;
    private static readonly int[] nobranch = new int[16] { 0, 0, 8, 8, 16, 16, 16, 16, 24, 24, 24, 24, 24, 24, 24, 24 };

    static void Main(string[] args)
    {
        TestConstant();
        TestNaive();
        TestDeBruijn();
        TestBinary();
        TestFloat();
        TestTrivialLoop();
        TestLookup();
        TestDoubleLookup();
        Console.ReadLine();
    }
}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T02:47:41+0000

Took the binary solution already mentioned and removed the branching. Did some testing and it turned out to be 1.3 times faster than DeBruijn.

public static int Log2(int v)
{
    int r = 0xFFFF - v >> 31 & 0x10;
    v >>= r;
    int shift = 0xFF - v >> 31 & 0x8;
    v >>= shift; 
    r |= shift;
    shift = 0xF - v >> 31 & 0x4;
    v >>= shift;
    r |= shift;
    shift = 0x3 - v >> 31 & 0x2;
    v >>= shift;
    r |= shift;
    r |= (v >> 1);
    return r;
}

Categories

c# - Fastest implementation of log2(int) and log2(float)

c# - Fastest implementation of log2(int) and log2(float)

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags