The mem-loads
event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3
performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_*
are special and can only be counted by using the p
modifier. That is, you have to specify mem-loads:p
to perf to use the event correctly.
MEM_TRANS_RETIRED.LOAD_LATENCY_*
is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):
When a user elects to sample one of these events, special hardware is
used that can keep track of a data load from issue to completion.
This is more complicated than simply counting instances of an event
(as with normal event-based sampling), and so only some loads are
tracked. Loads are randomly chosen, the latency determined for each,
and the correct event(s) incremented (latency >4, >8, >16, etc). Due
to the nature of the sampling for this event, only a small percentage
of an application's data loads can be tracked at any one time.
As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_*
by no means count the total number of loads and it is not designed for that purpose at all.
If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_*
is the right performance event to use. In fact, that is exactly the purpose of perf-mem
and it achieves its purpose by using this event.
If you want to count the total number of load uops retired, then you should use L1-dcache-loads
, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS
performance event on Intel processors.
On the other hand, mem-stores
and L1-dcache-stores
are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES
, which does count all retired store uops.
So in summary, if you are using perf-stat
, you should (almost) always use L1-dcache-loads
and L1-dcache-stores
to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…