TL:DR: a slice compare is some Python overhead + a highly-optimized memcmp
(unless there's UTF-8 processing?). Ideally, use slice compares to find the first mismatch to within less than 128 bytes or something, then loop a char at a time.
Or if it's an option and the problem is important, make a modified copy of an asm-optimized memcmp
that returns the position of the first difference, instead of equal/not-equal; it will run as fast as a single ==
of the whole strings. Python has ways to call native C / asm functions in libraries.
It's a frustrating limitation that the CPU can do this blazingly fast, but Python doesn't (AFAIK) give you access to an optimized compare loop that tells you the mismatch position instead of just equal / greater / less.
It's totally normal that interpreter overhead dominates the cost of the real work in a simple Python loop, with CPython. Building an algorithm out of optimized building blocks is worth it even if it means doing more total work. This is why NumPy is good, but looping over a matrix element-by-element is terrible. The speed difference might be something like a factor of 20 to 100, for CPython vs. a compiled C (asm) loop for comparing one byte at a time (made up numbers, but probably right to within an order of magnitude).
Comparing blocks of memory for equality is probably one of the biggest mismatches between Python loops vs. operating on a whole list / slice. It's a common problem with highly-optimized solutions (e.g. most libc implementations (including OS X) have a manually-vectorized hand-coded asm memcmp
that uses SIMD to compare 16 or 32 bytes in parallel, and runs much faster than a byte-at-a-time loop in C or assembly). So there's another factor of 16 to 32 (if memory bandwidth isn't a bottleneck) multiplying the factor of 20 to 100 speed difference between Python and C loops. Or depending on how optimized your memcmp
is, maybe "only" 6 or 8 bytes per cycle.
With data hot in L2 or L1d cache for medium-sized buffers, it's reasonable to expect 16 or 32 bytes per cycle for memcmp
on a Haswell or later CPU. (i3/i5/i7 naming started with Nehalem; i5 alone is not sufficient to tell us much about your CPU.)
I'm not sure if either or both of your comparisons are having to process UTF-8 and check for equivalency classes or different ways to encode the same character. The worst case is if your Python char-at-a-time loop has to check for potentially-multi-byte characters but your slice compare can just use memcmp
.
Writing an efficient version in Python:
We're just totally fighting against the language to get efficiency: your problem is almost exactly the same as the C standard library function memcmp
, except you want the position of the first difference instead of a - / 0 / + result telling you which string is greater. The search loop is identical, it's just a difference in what the function does after finding the result.
Your binary search is not the best way to use a fast compare building block. A slice compare still has O(n)
cost, not O(1)
, just with a much smaller constant factor. You can and should avoid re-comparing the starts of the buffers repeatedly by using slices to compare large chunks until you find a mismatch, then go back over that last chunk with a smaller chunk size.
# I don't actually know Python; consider this pseudo-code
# or leave an edit if I got this wrong :P
chunksize = min(8192, len(smaller))
# possibly round chunksize down to the next lowest power of 2?
start = 0
while start+chunksize < len(smaller):
if smaller[start:start+chunksize] == bigger[start:start+chunksize]:
start += chunksize
else:
if chunksize <= 128:
return char_at_a_time(smaller[start:start+chunksize], bigger[start:start+chunksize])
else:
chunksize /= 8 # from the same start
# TODO: verify this logic for corner cases like string length not a power of 2
# and/or a difference only in the last character: make sure it does check to the end
I chose 8192 because your CPU has a 32kiB L1d cache, so the total cache footprint of two 8k slices is 16k, half your L1d. When the loop finds a mismatch, it will re-scan the last 8kiB in 1k chunks, and these compares will loop over data that's still hot in L1d. (Note that if ==
found a mismatch, it probably only touched data up to that point, not the whole 8k. But HW prefetch will keep going somewhat beyond that.)
A factor of 8 should be a good balance between using large slices to localize quickly vs. not needing many passes over the same data. This is a tunable parameter of course, along with chunk size. The bigger the mismatch between Python and asm, the smaller this factor should be to reduce Python loop iterations.)
Hopefully 8k is big enough to hide the Python loop / slice overhead; hardware prefetching should still be working during the Python overhead between memcmp
calls from the interpreter so we don't need the granularity to be huge. But for really big strings, if 8k doesn't saturate memory bandwidth then maybe make it 64k (your L2 cache is 256kiB; i5 does tell us that much).
How exactly is memcmp
so fast:
I am running this on Intel Core i5 but I would imagine I would get the same results on most modern CPUs.
Even in C, Why is memcmp so much faster than a for loop check? memcmp
is faster than a byte-at-a-time compare loop, because even C compilers aren't great at (or totally incapable of) auto-vectorizing search loops.
Even without hardware SIMD support, an optimized memcmp
could check 4 or 8 bytes at a time (word size / register width) even on a simple CPU without 16-byte or 32-byte SIMD.
But most modern CPUs, and all x86-64, have SIMD instructions. SSE2 is baseline for x86-64, and available as an extension in 32-bit mode.
An SSE2 or AVX2 memcmp
can use pcmpeqb
/ pmovmskb
to compare 16 or 32 bytes in parallel. (I'm not going to go into detail about how to write memcmp in x86 asm or with C intrinsics. Google that separately, and/or look up those asm instructions in an x86 instruction-set reference. like http://felixcloutier.com/x86/index.html. See also the x86 tag wiki for asm and performance links. e.g. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? has some info about single-core memory bandwidth limitations.)
I found an old version from 2005 of Apple's x86-64 memcmp
(in AT&T syntax assembly language) on their opensource web site. It could definitely be better; for large buffers it should align one source pointer and only use movdqu
for the other one, allowing movdqu
then pcmpeqb
with a memory operand instead of 2x movdqu
, even if the strings are misaligned relative to each other. xorl $0xFFFF,%eax
/ jnz
is also not optimal on CPUs where cmp/jcc
can macro fuse but xor / jcc
can't.
Unrolling to check a whole 64-byte cache line at once would also hide loop overhead. (This is the same idea of a large chunk and then looping back over it when you find a hit). Glibc's AVX2-movbe
version does this with vpand
to combine compare results in the main large-buffer loop, with the final combine being a vptest
that also sets flags from the result. (Smaller code-size but no fewer uops than vpand
/vpmovmskb
/cmp
/jcc
; but no downside and maybe lower latency to reduce branch-mispredict penalties on loop exit). Glibc does dynamic CPU dispatching at dynamic link time; it picks this version on CPUs that support it.
Hopefully Apple's memcmp
is better these days; I don't see source for it at all in the most recent Libc
directory, though. Hopefully they dispatch at runtime to an AVX2 version for Haswell and later CPUs.
The LLoopOverChunks
loop in the version I linked would only run at 1 iteration (16 bytes from each input) per ~2.5 cycles on Haswell; 10 fused-domain uops. But that's still much faster than 1 byte per cycle for a naive C loop, or much much worse than that for a Python loop.
Glibc's L(loop_4x_vec):
loop is 18 fused-domain uops, and can thus run at just slightly less than 32 bytes (from each input) per clock cycle, when data is hot in L1d cache. Otherwise it will bottleneck on L2 bandwidth. It could have been 17 uops if they hadn't used an extra instruction inside the loop decrementing a separate loop counter, and calculated an end-pointer outside the loop.
Finding instructions / hot spots in the Python interpreter's own code
How could I drill down to find the C instructions and CPU instructions that my code calls?
On Linux you could run perf record python ...
then perf report -Mintel
to see which functions the CPU was spending the most time in, and which instructions in those functions were the hottest. You'll get results something like I posted here: Why is float() faster than int()?. (Drill down into any function to see the actual machine instructions that ran, shown as assembly language because perf
has a disassembler built in.)
For a more nuanced view that samples the call-graph on each event, see linux perf: how to interpret and find hotspots.
(When you're looking to actually optimize a program, you want to know which function calls are expensive so you can try to avoid them in the first place. Profiling for just "self" time will find hot spots, but you won't always know which different callers caused a given loop to run most of the iterations. See Mike Dunlavey's answer on that perf question.)
But for this specific case, profiling the interpreter running a slice-compare version over big strings should hopefully find the memcmp
loop where I think it will be spending most of its time. (Or for the char-at-a-time version, find the interpreter code that's "hot".)
Then you can directly see what asm instructions are in the loop. From the function names, assuming your binary has any symbols, you can probably find the source. Or if you have a version of Python built with debug info , you can get to the source directly from profile info. (Not a debug build with optimization disabled, just with full symbols).