Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
117 views
in Technique[技术] by (71.8m points)

python - How is `min` of two integers just as fast as 'bit hacking'?

I was watching a lecture series on 'Bit Hacking' and came across the following optimization for finding the minimum of two integers:

return x ^ ((y ^ x) & -(x > y))

Which said to be faster than:

if x < y:
    return x
else:
    return y

Since the min function can handle more than just two integers (floats, strings, lists, and even custom objects) I assumed that calling min(x, y) would take longer than the optimized bit hack above. To my surprise, they were nearly identical:

>>> python -m timeit "min(4, 5)"
1000000 loops, best of 3: 0.203 usec per loop

>>> python -m timeit "4 ^ ((5 ^ 4) & -(4 > 5))"
10000000 loops, best of 3: 0.19 usec per loop

This is true even for numbers greater than 255 (pre allocated python integer objects)

>>> python -m timeit "min(15456, 54657)"
10000000 loops, best of 3: 0.191 usec per loop

python -m timeit "15456 ^ ((54657 ^ 15456) & -(54657 > 15456))"
10000000 loops, best of 3: 0.18 usec per loop

How is it that a function so versatile like min can still be so fast and optimized?

Note: I ran the above code using Python 3.5. I'm assuming that this is the same for Python 2.7+ but haven't tested


I've created the following c module:

#include <Python.h>

static PyObject * my_min(PyObject *self, PyObject *args){
    const long x;
    const long y;

    if (!PyArg_ParseTuple(args, "ll", &x, &y))
        return NULL;

    return PyLong_FromLong(x ^ ((y ^ x) & -(x > y)));
}

static PyMethodDef MyMinMethods[] = 
{
    { "my_min", my_min, METH_VARARGS, "bit hack min"
    },
    {NULL, NULL, 0, NULL}
};

PyMODINIT_FUNC
initmymin(void)
{
    PyObject *m;

    m = Py_InitModule("mymin", MyMinMethods);
    if (m == NULL)
        return;

}

Compiled it, and installed it onto my system (an ubuntu VM machine). I then ran the following:

>>> python -m timeit 'min(4, 5)'
10000000 loops, best of 3: 0.11 usec per loop

>>> python -m timeit -s 'import mymin' 'mymin.my_min(4,5)'
10000000 loops, best of 3: 0.129 usec per loop

While I understand that this is a VM machine, shouldn't there still be a greater gap in execution time with the 'bit hacking' being offloaded into native c?

question from:https://stackoverflow.com/questions/33784519/how-is-min-of-two-integers-just-as-fast-as-bit-hacking

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Lets do a slightly deeper dive here to find out the real reason behind this weirdness (if any).

Lets create 3 methods and look at their python bytecode and runtimes...

import dis

def func1(x, y):
    return min(x, y)

def func2(x, y):
    if x < y:
        return x
    return y

def func3(x, y):
    return x ^ ((y ^ x) & -(x > y))

print "*" * 80
dis.dis(func1)
print "*" * 80
dis.dis(func2)
print "*" * 80
dis.dis(func3)

The output from this program is...

*****************************************************
  4           0 LOAD_GLOBAL              0 (min)
              3 LOAD_FAST                0 (x)
              6 LOAD_FAST                1 (y)
              9 CALL_FUNCTION            2
             12 RETURN_VALUE        
*****************************************************
  7           0 LOAD_FAST                0 (x)
              3 LOAD_FAST                1 (y)
              6 COMPARE_OP               0 (<)
              9 POP_JUMP_IF_FALSE       16

  8          12 LOAD_FAST                0 (x)
             15 RETURN_VALUE        

  9     >>   16 LOAD_FAST                1 (y)
             19 RETURN_VALUE        
*****************************************************
 12           0 LOAD_FAST                0 (x)
              3 LOAD_FAST                1 (y)
              6 LOAD_FAST                0 (x)
              9 BINARY_XOR          
             10 LOAD_FAST                0 (x)
             13 LOAD_FAST                1 (y)
             16 COMPARE_OP               4 (>)
             19 UNARY_NEGATIVE      
             20 BINARY_AND          
             21 BINARY_XOR          
             22 RETURN_VALUE        

Here are the running times of each of these functions

%timeit func1(4343,434234)
1000000 loops, best of 3: 282 ns per loop

%timeit func2(23432, 3243424)
10000000 loops, best of 3: 137 ns per loop

%timeit func3(928473, 943294)
1000000 loops, best of 3: 246 ns per loop

func2 is the fastest because it has the least amount of work to do in the python interpreter. How?. Looking at the bytecode for func2, we see that in either case of x > y or x < y, the python interpreter will execute 6 instructions.

func3 will execute 11 instructions (and is thus almost twice as slow as func2... in fact, its extremely close to 137.0 * 11 / 6 = 251 ns).

func1 has just 5 python instructions, and by the logic in the previous 2 points, we might think that func1 should probably be the fastest. However, there is a CALL_FUNCTION in there... and function calls have a lot of overhead in Python (because it creates a new eval frame for the function call - that's the thing that we see in the python stacktrace - a stack of eval frames).

More details : Because python is interpreted, each python bytecode instruction takes a lot longer than a single C/asm statement. In fact, you can take a look at the python interpreter source code to see that each instruction has an overhead of 30 or so C statements (this is from a very rough look at ceval.c python main interpreter loop). The for (;;) loop executes one python instruction per loop cycle (ignoring optimizations).

https://github.com/python/cpython/blob/master/Python/ceval.c#L1221

So, with so much overhead for each instruction, there is no point in comparing 2 tiny C code snippets in python. One will take 34 and the other will take 32 cpu cycles, because the python interpreter adds 30 cycles overhead for each instruction.

In OP's C module, if we loop inside the C function to do the comparison a million times, that loop will not have the python interpreter's overhead for each instruction. It will probably run 30 to 40 times faster.

Tips for python optimization...

Profile your code to find hotspots, refactor hot code into its own function (write tests for hotspot before that to make sure refactor does not break stuff), avoid function calls from the hot code (inline functions if possible), use the dis module on new function to find ways to reduce the number of python instructions (if x is faster than if x is True... surprised?), and lastly modify your algorithm. Finally, if python speedup is not enough, reimplement your new function in C.

ps : The explanation above is simplified to keep the answer within reasonable size. For example, not all python instructions take the same amount of time, and there are optimizations, so not every instruction has the same overhead... and lot more things. Please ignore such omissions for the sake of brevity.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...