Intro
Section Old Question contains the initial question (Further Investigation and Conclusion have been added since).
Skip to the section Further Investigation below for a detailed comparison of the different timing methods (rdtsc
, clock_gettime
and QueryThreadCycleTime
).
I believe the erratic behaviour of CGT can be attributed to either a buggy kernel or a buggy CPU (see section Conclusion).
The code used for testing is at the bottom of this question (see section Appendix).
Apologies for the length.
Old Question
In short: I am using clock_gettime
to measure the execution time of many code segments. I am experiencing very inconsistent measurements between separate runs. The method has an extremely high standard deviation when compared to other methods (see Explanation below).
Question: Is there a reason why clock_gettime
would give so inconsistent measurements when compared to other methods? Is there an alternative method with the same resolution that accounts for thread idle time?
Explanation: I am trying to profile a number of small parts of C code. The execution time of each of the code segments is not more than a couple of microseconds. In a single run, each of the code segments will execute some hundreds of times, which produces runs × hundreds
of measurements.
I also have to measure only the time the thread actually spends executing (which is why rdtsc
is not suitable). I also need a high resolution (which is why times
is not suitable).
I've tried the following methods:
rdtsc
(on Linux and Windows),
clock_gettime
(with 'CLOCK_THREAD_CPUTIME_ID'; on Linux), and
QueryThreadCycleTime
(on Windows).
Methodology: The analysis was performed on 25 runs. In each run, separate code segments repeat a 101 of times. Therefore I have 2525 measurements. Then I look at a histogram of the measurements, and also calculate some basic stuff (like the mean, std.dev., median, mode, min, and max).
I do not present how I measured the 'similarity' of the three methods, but this simply involved a basic comparison of proportion of times spent in each code segment ('proportion' means that the times are normalised). I then look at the pure differences in these proportions. This comparison showed that all 'rdtsc', 'QTCT', and 'CGT' measure the same proportions when averaged over the 25 runs. However, the results below show that 'CGT' has a very large standard deviation. This makes it unusable in my use case.
Results:
A comparison of clock_gettime
with rdtsc
for the same code segment (25 runs of 101 measurements = 2525 readings):
Discussion:
rdtsc
gives very similar results on both Linux and Windows. It has an acceptable standard deviation--it is actually quite consistent/stable. However, it does not account for thread idle time. Therefore, context switches make the measurements erratic (on Windows I have observed this quite often: a code segment with an average of 1000 ticks or so will take ~30000 ticks every now and then--definitely because of pre-emption).
QueryThreadCycleTime
gives very consistent measurements--i.e. much lower standard deviation when compared to rdtsc
. When no context switches happen, this method is almost identical to rdtsc
.
clock_gettime
, on the other hand, is producing extremely inconsistent results (not just between runs, but also between measurements). The standard deviations are extreme (when compared to rdtsc
).
I hope the statistics are okay. But what could be the reason for such a discrepancy in the measurements between the two methods? Of course, there is caching, CPU/core migration, and other things. But none of this should be responsible for any such differences between 'rdtsc' and 'clock_gettime'. What is going on?
Further Investigation
I have investigated this a bit further. I have done two things:
Measured the overhead of just calling clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t)
(see code 1 in Appendix), and
in a plain loop called clock_gettime
and stored the readings into an array (see code 2 in Appendix). I measure the delta times (difference in successive measurement times, which should correspond a bit to the overhead of the call of clock_gettime
).
I have measured it on two different computers with two different Linux Kernel versions:
CGT:
CPU: Core 2 Duo L9400 @ 1.86GHz
Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386
Results:
- Estimated
clock_gettime
overhead: between 690-710 ns
Delta times:
- Average: 815.22 ns
- Median: 713 ns
- Mode: 709 ns
- Min: 698 ns
- Max: 23359 ns
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
697 < x ≤ 800 -> 78111 <-- cached?
800 < x ≤ 1000 -> 16412
1000 < x ≤ 1500 -> 3
1500 < x ≤ 2000 -> 4836 <-- uncached?
2000 < x ≤ 3000 -> 305
3000 < x ≤ 5000 -> 161
5000 < x ≤ 10000 -> 105
10000 < x ≤ 15000 -> 53
15000 < x ≤ 20000 -> 8
20000 < x -> 5
CPU: 4 × Dual Core AMD Opteron Processor 275
Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
Results:
- Estimated
clock_gettime
overhead: between 279-283 ns
Delta times:
- Average: 320.00
- Median: 1
- Mode: 1
- Min: 1
- Max: 3495529
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
--------------------+-----------
x ≤ 1 -> 86738 <-- cached?
282 < x ≤ 300 -> 13118 <-- uncached?
300 < x ≤ 440 -> 78
2000 < x ≤ 5000 -> 52
5000 < x ≤ 30000 -> 5
3000000 < x -> 8
RDTSC:
Related code rdtsc_delta.c
and rdtsc_overhead.c
.
CPU: Core 2 Duo L9400 @ 1.86GHz
Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386
Results:
- Estimated overhead: between 39-42 ticks
Delta times:
- Average: 52.46 ticks
- Median: 42 ticks
- Mode: 42 ticks
- Min: 35 ticks
- Max: 28700 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
34 < x ≤ 35 -> 16240 <-- cached?
41 < x ≤ 42 -> 63585 <-- uncached? (small difference)
48 < x ≤ 49 -> 19779 <-- uncached?
49 < x ≤ 120 -> 195
3125 < x ≤ 5000 -> 144
5000 < x ≤ 10000 -> 45
10000 < x ≤ 20000 -> 9
20000 < x -> 2
CPU: 4 × Dual Core AMD Opteron Processor 275
Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
Results:
- Estimated overhead: between 13.7-17.0 ticks
Delta times:
- Average: 35.44 ticks
- Median: 16 ticks
- Mode: 16 ticks
- Min: 14 ticks
- Max: 16372 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
13 < x ≤ 14 -> 192
14 < x ≤ 21 -> 78172 <-- cached?
21 < x ≤ 50 -> 10818
50 < x ≤ 103 -> 10624 <-- uncached?
5825 < x ≤ 6500 -> 88
6500 < x ≤ 8000 -> 88
8000 < x ≤ 10000 -> 11
10000 < x ≤ 15000 -> 4
15000 < x ≤ 16372 -> 2
QTCT:
Related code qtct_delta.c
and qtct_overhead.c
.
CPU: Core 2 6700 @ 2.66GHz
Kernel: Windows 7 64-bit
Results:
- Estimated overhead: between 890-940 ticks
Delta times:
- Average: 1057.30 ticks
- Median: 890 ticks
- Mode: 890 ticks
- Min: 880 ticks
- Max: 29400 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
879 < x ≤ 890 -> 71347 <-- cached?
895 < x ≤ 1469 -> 844
1469 < x ≤ 1600 -> 27613 <-- uncached?
1600 < x ≤ 2000 -> 55
2000 < x ≤ 4000 -> 86
4000 < x ≤ 8000 -> 43
8000 < x ≤ 16000 -> 10
16000 < x -> 1
Conclusion
I believe the answer to my question would be a buggy implementation on my machine (the one with AMD CPUs with an old Linux kernel).
The CGT results of the AMD machine with the old kernel show some extreme readings. If we look at the delta times, we'll see that the most frequent delta is 1 ns. This means t