Lets do a slightly deeper dive here to find out the real reason behind this weirdness (if any).
Lets create 3 methods and look at their python bytecode and runtimes...
import dis
def func1(x, y):
return min(x, y)
def func2(x, y):
if x < y:
return x
return y
def func3(x, y):
return x ^ ((y ^ x) & -(x > y))
print "*" * 80
dis.dis(func1)
print "*" * 80
dis.dis(func2)
print "*" * 80
dis.dis(func3)
The output from this program is...
*****************************************************
4 0 LOAD_GLOBAL 0 (min)
3 LOAD_FAST 0 (x)
6 LOAD_FAST 1 (y)
9 CALL_FUNCTION 2
12 RETURN_VALUE
*****************************************************
7 0 LOAD_FAST 0 (x)
3 LOAD_FAST 1 (y)
6 COMPARE_OP 0 (<)
9 POP_JUMP_IF_FALSE 16
8 12 LOAD_FAST 0 (x)
15 RETURN_VALUE
9 >> 16 LOAD_FAST 1 (y)
19 RETURN_VALUE
*****************************************************
12 0 LOAD_FAST 0 (x)
3 LOAD_FAST 1 (y)
6 LOAD_FAST 0 (x)
9 BINARY_XOR
10 LOAD_FAST 0 (x)
13 LOAD_FAST 1 (y)
16 COMPARE_OP 4 (>)
19 UNARY_NEGATIVE
20 BINARY_AND
21 BINARY_XOR
22 RETURN_VALUE
Here are the running times of each of these functions
%timeit func1(4343,434234)
1000000 loops, best of 3: 282 ns per loop
%timeit func2(23432, 3243424)
10000000 loops, best of 3: 137 ns per loop
%timeit func3(928473, 943294)
1000000 loops, best of 3: 246 ns per loop
func2 is the fastest because it has the least amount of work to do in the python interpreter. How?. Looking at the bytecode for func2, we see that in either case of x > y
or x < y
, the python interpreter will execute 6 instructions.
func3 will execute 11 instructions (and is thus almost twice as slow as func2... in fact, its extremely close to 137.0 * 11 / 6 = 251 ns).
func1 has just 5 python instructions, and by the logic in the previous 2 points, we might think that func1 should probably be the fastest. However, there is a CALL_FUNCTION
in there... and function calls have a lot of overhead in Python (because it creates a new eval frame for the function call - that's the thing that we see in the python stacktrace - a stack of eval frames).
More details : Because python is interpreted, each python bytecode instruction takes a lot longer than a single C/asm statement. In fact, you can take a look at the python interpreter source code to see that each instruction has an overhead of 30 or so C statements (this is from a very rough look at ceval.c python main interpreter loop). The for (;;)
loop executes one python instruction per loop cycle (ignoring optimizations).
https://github.com/python/cpython/blob/master/Python/ceval.c#L1221
So, with so much overhead for each instruction, there is no point in comparing 2 tiny C code snippets in python. One will take 34 and the other will take 32 cpu cycles, because the python interpreter adds 30 cycles overhead for each instruction.
In OP's C module, if we loop inside the C function to do the comparison a million times, that loop will not have the python interpreter's overhead for each instruction. It will probably run 30 to 40 times faster.
Tips for python optimization...
Profile your code to find hotspots, refactor hot code into its own function (write tests for hotspot before that to make sure refactor does not break stuff), avoid function calls from the hot code (inline functions if possible), use the dis
module on new function to find ways to reduce the number of python instructions (if x
is faster than if x is True
... surprised?), and lastly modify your algorithm. Finally, if python speedup is not enough, reimplement your new function in C.
ps : The explanation above is simplified to keep the answer within reasonable size. For example, not all python instructions take the same amount of time, and there are optimizations, so not every instruction has the same overhead... and lot more things. Please ignore such omissions for the sake of brevity.