TL;DR
rdtscp
and lfence/rdtsc
have the same exact upstream serialization properties On Intel processors. On AMD processors with a dispatch-serializing lfence
, both sequences have also the same upstream serialization properties. With respect to later instructions, rdtsc
in the lfence/rdtsc
sequence may be dispatched for execution simultaneously with later instructions. This behavior may not be desirable if you also want to precisely time these later instructions as well. This is generally not a problem because the reservation station scheduler prioritizes older uops for dispatching as long as there are no structural hazards. After lfence
retires, rdtsc
uops would be the oldest in the RS with probably no structural hazards, so they will be immediately dispatched (possibly together with some later uops). You could also put an lfence
after rdtsc
.
The Intel manual V2 says the following about rdtscp
(emphasis mine):
The RDTSCP instruction is not a serializing instruction, but it does
wait until all previous instructions have executed and all previous
loads are globally visible. But it does not wait for previous stores
to be globally visible, and subsequent instructions may begin execution before the read operation is performed.
The "read operation" part here refers to reading the time-stamp counter. This suggests that rdtscp
internally works like lfence
followed by rdtsc
+ reading IA32_TSC_AUX
. That is, lfence
is performed first then the two reads from the registers are executed (possibly at the same time).
On most Intel and AMD processors that support these instructions, lfence/rdtsc
have a slightly larger number of uops than rdtscp
. The number of lfence
uops mentioned in Agner's tables is for the case where the lfence
instructions are executed back-to-back, which makes it appear that lfence
is decoded into a smaller number of uops (1 or 2) than what a single lfence
is actually decoded into (5 or 6 uops). Usually, lfence
is used without other back-to-back lfence
s. That's why lfence/rdtsc
contains more uops than rdtscp
. Agner's tables also show that on some processors, rdtsc
and rdtscp
have the same number of uops, which I'm not sure is correct. It makes more sense for rdtscp
to have one or more uops than rdtsc
. That said, the latency may be more important than the difference in the number of uops because that's what directly impacts the measurement overhead.
In terms of portability, rdtsc
is older than rdtscp
; rdtsc
was first supported on the Pentium processors while the first processors that support rdtscp
were released in 2005-2006 (See: What is the gcc cpu-type that includes support for RDTSCP?). But most Intel and AMD processors that are in use today support rdtscp
. Another dimension for comparing between the two sequences is that rdtscp
pollutes one more register (i.e., ECX
) than rdtsc
.
In summary, if you don't care about reading the IA32_TSC_AUX
MSR, there is no particularly big reason why you should choose one over the other. I would use rdtscp
and fall back to lfence/rdtsc
(or lfence/rdtsc/lfence
) on processors that don't support it. If you want maximum timing precision, use the method discussed in Memory latency measurement with time stamp counter.
As Andreas Abel pointed out, you still need an lfence
after the last rdtsc(p)
as it is not ordered w.r.t. subsequent instructions:
lfence lfence
rdtsc -- ALLOWED --> B
B rdtsc
rdtscp -- ALLOWED --> B
B rdtscp
This is also addressed in the manuals.
Regarding the use of rdtscp
, it seems correct to me to think of it as a compact lfence + rdtsc
.
The manuals use different terminology for the two instructions (e.g. "completed locally" vs "globally visible" for loads) but the behavior described seems to be the same.
I'm assuming so in the rest of this answer.
However rdtscp
is a single instruction, while lfence + rdtscp
are two, making the lfence
part of the profiled code.
Granted that lfence
should be lightweight in terms of backend execution resources (it is just a marker) it still occupies front-end resources (two uops?) and a slot in the ROB.
rdtscp
is decoded into a greater number of uops due to its ability to read IA32_TSC_AUX
, so while it saves front-end (part of) resources, it occupies the backend more.
If the read of the TSC is done first (or concurrently) with the processor ID then this extra uops are only relevant for the subsequent code.
This could be a reason why it is used at the end but not at the start of the benchmark (where the extra uops would affect the code).
This is enough to bias/complicate some micro-architectural benchmarks.
You cannot avoid the lfence
after an rdtsc(p)
but you can avoid the one before with rdtscp
.
This seems unnecessary for the first rdtsc
as the preceding lfence
is not profiled anyway.
Another reason to use rdtscp
at the end is that it was (according to Intel) meant to detect a migration to a different CPU (that's why it atomically also load IA32_TSC_AUX
), so at the end of the profiled code you may want to check that the code has not been scheduled to another CPU.
User mode software can use RDTSCP to detect if CPU migration has occurred between successive reads of the TSC.
This, of course, requires to have read IA32_TSC_AUX
before (to have something to compare to) so one should have a rdpid
or rdtscp
before the profiling code.
If one can afford to not use ecx
, the first rdtsc
can be a rdtscp
too (but see above), otherwise (rather than storing the processor id while in the profiled code), rdpid
can be used first (thus, having a rdtsc + rdtscp
pair around the profiled code).
This is open to ABA problem, so I don't think Intel has a strong point on this (unless we restrict ourselves to code short enough to be rescheduled at most once).
EDIT
As PeterCordes pointed out, from the point of view of the elapsed time measure, having a migration A->B->A is not an issue as the reference clock is the same.
More information on why rdtsc(p)
is not fully serializing: Why isn't RDTSC a serializing instruction?
.