You should read time(7). Be aware that even written in assembler, your program will be rescheduled at arbitrary moments (perhaps a context switch every millisecond; look also into /proc/interrupts
and see proc(5)). Then any hardware timer is meaningless. Even using the RDTSC
x86-64 machine instruction to read the hardware timestamp counter is useless (since after any context switch it would be wrong, and the Linux kernel is doing preemptive scheduling, which does happen at any time).
You should consider clock_gettime(2). It is really fast (about 3.5 or 4 nanoseconds on my i5-4690S, when measuring thousands of calls to it) because of vdso(7). BTW it is a system call, so you might code directly the assembler instructions doing them. I don't think it is worth the trouble (and could be slower than the vdso call).
BTW, any kind of profiling or benchmarking is somehow intrusive.
At last, if your benchmarked function runs very quickly (much less than a microsecond), cache misses become significant and even dominant (remember that an L3 cache miss requiring effective access to DRAM modules lasts several hundred nanoseconds, enough to run hundreds of machine instructions in L1 I-cache). You might (and probably should) try to benchmark several (hundreds of) consecutive calls. But you won't be able to measure precisely and accurately.
Hence I believe that you cannot do much better than using clock_gettime
and I don't understand why it is not good enough for your case... BTW, clock(3) is calling clock_gettime
with CLOCK_PROCESS_CPUTIME_ID
so IMHO it should be enough, and simpler.
In other words, I believe that avoiding any function calls is a misconception from your part. Remember that function call overhead is a lot cheaper than cache misses!
See this answer to a related question (as unclear as yours); consider also using perf(1), gprof(1), oprofile(1), time(1). See this.
At last, you should consider asking more optimizations from your compiler. Have you considered compiling and linking with g++ -O3 -flto -march=native
(with link-time optimizations).
If your code is of numerical and vectorial nature (so obviously and massively parallelisable), you could even consider spending months of your development time to port its core code (the numerical compute kernels) on your GPGPU in OpenCL or CUDA. But are you sure it is worth such an effort? You'll need to tune and redevelop your code when changing hardware!
You could also redesign your application to use multi-threading, JIT compilation and partial evaluation and metaprogramming techniques, multiprocessing or cloud-computing (with inter-process communication, such as socket(7)-s, maybe using 0mq or other messaging libraries). This could take years of development. There is No Silver Bullet.
(Don't forget to take development costs into account; prefer algorithmic improvements when possible.)