As a baseline for Python, I wrote this in C (and then decided to use C++ for more convenient array I/O). It compiles efficiently for x86-64 with clang++. This runs 82x faster than CPython3.6.2 with the code in the question, on a Skylake x86, so even your faster Python versions are still a couple factors away from keeping up with near-optimal machine code. (Yes, I looked at the compiler's asm output to check that it did a good job).
Letting a good JIT or ahead-of-time compiler see the loop logic is key for performance here. The problem logic is inherently serial, so there's no scope for getting Python to run already-compiled C to do something over the whole array (like NumPy), because there won't be compiled C for this specific problem unless you use Cython or something. Having each step of the problem go back to the CPython interpreter is death for performance, when its slowness isn't hidden by memory bottlenecks or anything.
Update: transforming the array of offsets into an array of pointers speeds it up by another factor of 1.5x (simple addressing mode + removing an add
from the critical path loop-carried dependency chain, bringing it down to just the 4 cycle L1D load-use latency for a simple addressing mode (when the pointer comes from another load), not 6c = 5c + 1c for an indexed addressing mode + add
latency).
But I think we can be generous to Python and not expect it to keep up with algorithm transformations to suit modern CPUs :P (Especially because I used 32-bit pointers even in 64-bit mode to make sure the 4585 element array was still only 18kiB so it fits in the 32kiB L1D cache. Like the Linux x32 ABI does, or the AArch64 ILP32 ABI.)
Also, an alternate expression for the update gets gcc to compile it branchless, like clang does. (Commented out and original perf stat
output left in this answer, because it's interesting the see the effect of branchless vs. branchy with mispredicts.)
unsigned jumps(int offset[], unsigned size) {
unsigned location = 0;
unsigned counter = 0;
do {
//location += offset[location]++; // simple version
// >=3 conditional version below
int off = offset[location];
offset[location] += (off>=3) ? -1 : 1; // branchy with gcc
// offset[location] = (off>=3) ? off-1 : off+1; // branchless with gcc and clang.
location += off;
counter++;
} while (location < size);
return counter;
}
#include <iostream>
#include <iterator>
#include <vector>
int main()
{
std::ios::sync_with_stdio(false); // makes cin faster
std::istream_iterator<int> begin(std::cin), dummy;
std::vector<int> values(begin, dummy); // construct a dynamic array from reading stdin
unsigned count = jumps(values.data(), values.size());
std::cout << count << '
';
}
With clang4.0.1 -O3 -march=skylake
, the inner loop is branchless; it uses a conditional-move for the >=3
condition. I used ? :
because that's what I was hoping the compiler would do. Source + asm on the Godbolt compiler explorer
.LBB1_4: # =>This Inner Loop Header: Depth=1
mov ebx, edi ; silly compiler: extra work inside the loop to save code outside
mov esi, dword ptr [rax + 4*rbx] ; off = offset[location]
cmp esi, 2
mov ecx, 1
cmovg ecx, r8d ; ecx = (off>=3) ? -1 : 1; // r8d = -1 (set outside the loop)
add ecx, esi ; off += -1 or 1
mov dword ptr [rax + 4*rbx], ecx ; store back the updated off
add edi, esi ; location += off (original value)
add edx, 1 ; counter++
cmp edi, r9d
jb .LBB1_4 ; unsigned compare against array size
Here's the output of perf stat ./a.out < input.txt
(for the clang version), on my i7-6700k Skylake CPU:
21841249 # correct total, matches Python
Performance counter stats for './a.out':
36.843436 task-clock (msec) # 0.997 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
119 page-faults # 0.003 M/sec
143,680,934 cycles # 3.900 GHz
245,059,492 instructions # 1.71 insn per cycle
22,654,670 branches # 614.890 M/sec
20,171 branch-misses # 0.09% of all branches
0.036953258 seconds time elapsed
The average instructions-per-clock is much lower than 4 because of the data dependency in the loop. The load address for the next iteration depends on the load+add for this iteration, and out-of-order execution can't hide that. It can overlap all the work of updating the value of the current location, though.
Changing from int
to short
has no performance downside (as expected; movsx
has the same latency as mov
on Skylake), but halves the memory consumption so a larger array could fit in L1D cache if needed.
I tried compiling the array into the program (as int offsets[] = { file contents with commas added };
so it doesn't have to be read and parsed. It also made the size a compile-time constant. This reduced the run time to ~36.2 +/- 0.1 milliseconds, down from ~36.8, so the version that reads from a file was still spending most of its time on the actual problem, not parsing the input. (Unlike Python, C++ has negligible startup overhead, and my Skylake CPU ramps up to max clock speed in well under a millisecond thanks to hardware P-state management in Skylake.)
As described earlier, pointer-chasing with a simple addressing mode like [rdi]
instead of [rdi + rdx*4]
has 1c lower latency, and avoids the add
(index += offset
becomes current = target
). Intel since IvyBridge has zero-latency integer mov
between registers so that doesn't lengthen the critical path. Here's <a href="https://gcc.godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAKxAEZSBnVAV2OUxAHIBSAJgGY8AO2QAbZlgDU3fgGEGBfEIIA6BDOzcADAEFtOgPQAqI5ICSyzMryohAQ1GT%2BvALQAjQpIAOqYQUzEkgh2yADWwsCSdgiYduiSJiYG%2BgYGkm6YyHbMDJjmkgDudsqSBDFRxMR2AJ6lqOl5tAAc4QBCksIKsfGoAGZOAGxtKWlEkr2ewk68oZIAMrQAIpJZyDEq%2BiOSAPJCorUFJKEMhYQIkn29uQTcAKytd8tTZXmiqAXT7oQnfVHo6MRMAwTgwvCFMFtMCpgCpJABbOyhPKeOySYBvNwOKJCeJoOFePCiPIFM5o5DISQuXpCVAuAl5Fw0ul4CG6VIXQJiYqRKlM%2BmUmn0zZstIAdVJqJ8fgCpWCBEk6FQQKEYE48om8qmflIW08JNEjgyF2YBBcfRcVSEwDyxDwwAQ8rsRWqIBWLGUMkWtGFehFOyE8Nqc2EzAAHpIGNUunDSB15cEvF4rCcxodiLMSWUohHFCAQAA3TJEMh1VBbDJZHJ5F4K6r2OF4CkON5ZAg2AMEVhCH4muoNDod8pvD7OdLfDa6La0FS3cO9OxdQK2SQAZVC1VEiLyEGK8S8xFQmLc%2B0kAAkFwVMAbYwjqkazPnWgBKLYAFSHqFQXhcWUqLPiWBJjiVjILUeAnFQOTylKljED%2BwQMBEpxZqiiH4kSfwAkCiFWvCGDbgALBSm7%2BCI1Qvn6%2BYMLCtwUr0JDYgOWChpg8RvHE0i8O0tB0QxxSYQOsp5KgLyBIWxCIbYE46Mw3Z2kIrGSJB%2BIMAA%2Bl4BDEBAfgXL0VyYDc9yPLGsmIcACnxIhABemBPtIADsDyTloACc7JlOBkhvF%2BUQmrShbIMWeA2ScPgkPssKSK%2Bl6IkhaDMKI8R4PiB6FnGsZuL2irKqq8oIgQ/jENJ2gufRgQQKZ8mKXg0j8MsWi1e0eAyPIwV5DITV8A8XF2dwjnCi5JXuV2kixGsun6Zqyj1KiEDOF88qabJrasXZMGFX2NYEHYxDWvKV6YHCVg3M5g2uZc1x3K0zW3MsnqSBVcnmdVyhPhdBlXTdixGI9ZkWQOBBPtZmB9BA72Ga0WiPL1XEPZVz1Ja9FV%2BBpxCqYDfADODn2PB1A19Ys%2BiSMTkjuSJza1DS8oMHYvTCYEgJ4kdwHoLGZhKVBhQHv4srgQNZNogZQmSKGTQDFEDBwvkRSWPEvQHlLwDkqTKybrhzwIJ5WS5GAYADfD/1GKjJz3b9VXxEYb16Zd/BOb6LkG4pRhWOgXi1csxuce0wN46djsW8grCAiU93G77vr2/7bqyRt91Q7bPolYqDl2yVJVR0Y227ULpv%2B5bcMo5p6NPkYgeVMdHUk6TaQ2QeLiYKGpHxGMkq%2BLB%2BOuRnzBeOgdg86bWd7bVmh1SsQcV1x/C9fwABipQ7XtLi0JIrqDx9XHegnp0laX48h6PZsI0%2ByPKKjxfd73/iV8T7IX33eQ7VUBwSp8HjU8WrL2yVZfB/K91rzcLeX9XLxXblxLq4c072WWAUTWGEIA/2OsPUOmkTa8AGJjMe5d96yFGjiLwL4gFDTSHiAkRIUz1F4BGTAYIqg8zcJaNYQIQD8zSLYascpJA0mIAiA0tR5wGhTAgA8zB7Re2zNtJEAZvLfgYcUNYrDJCAgKLaNsuEaxoBxIQds0dErwhqEaYAeA0o5EYnMbAOhxHIHxAYKg5JFGZXlKeFcooDBrg3FuFY/EETIDrr0KspRDjpBCKEFwRB3AhJWDYqgbhvB2DwBJYqp1ASdmIAGUBhVIHQJ9FHZSXgGDaRKNjIyt0TJPX%2BsDXq/U/blMUi2PuOi46QM7rU3E7pY6j3jqnU6yc%2BrdLOm5Aw9S2zLggaPYprRhntlxuA8Btsq4LJJuyNCXgMLiUkkIROrkSo6UuO7Cal17hTNsLjIh2yWl/UUqgRKqkWz7OOZss5AzUgTOuegW5qBkCPC9qbS4GhPRT1qnPJeK9JCb3aIs9kfBeCaPwCM%2BwohoWSHWdM7e51rYfXuG8j5Xzbr7LBnpf5dVAUyDnpcEFro/kb2vkstIciRAIHIScTM5wlbIHsioXgRNFleU%2BQ00Zczlh/LOcAlyGSAgQLmf0gmhQ4HbgebVXBlTmn2xSSNcVxAsmE2cnoAQwgxASHanIGwChAR2DhP8yceqRDiCkC1QgAQ%2B4kEtbqwQNrDWKoCsWF1%2BgdIImEBACiehqn2wUOgPMJq8yRhEKpFlqkw02AgAI3IhCIW0v0UiE4yApjzkXANMNEbTWxDhKpB1dDnXGo9PwbADQjFCAgAWlAwgnyxnQMwOEcJqjNJco2r1FbZB%2BA0MihwzAgQQAyHW1t7bO2pp9FXW%2BrTo772WHkgp%2BYR1AhUJfOwgbYzrvEJu4GgbZ01MuW0mO%2BzV3qU0hAfdo7qLbt3cOg91Ej1PhPaG3MKAWB/zkC1Jdv7ZD/tVHcWQKoeBEOyboTgLbRBcFuJwUgQguBaEQ6gLgQHZnexYGwdqAh