Checkout Edit3
I was getting the wrong results because I was measuring without including prefetch triggered events as discussed here. That being said AFAIK I am only see a reduction in RFO requests with rep movsb
as compared to Temporal Store memcpy
because of better prefetching on loads and no prefetching on stores. NOT due to RFO requests being optimized out for full cache line stores. This kind of makes sense as we don't see RFO requests optimized out for vmovdqa
with a zmm
register which we would expect if that where really the case for full cache line stores. That being said the lack of prefetching on stores and lack of non-temporal writes makes it hard to see how rep movsb
has reasonable performance.
Edit: It is possible that the RFO requests from rep movsb
for different those those for vmovdqa
in that for rep movsb
it might not request data, just take the line in exclusive state. This could also be the case for stores with a zmm
register. I don't see any perf metrics to test this however. Does anyone know any?
Questions
- Why am I not seeing a reduction in RFO requests when I use
rep movsb
for memcpy
as compared to a memcpy
implemented with vmovdqa
?
- Why am I seeing more RFO requests when I used
rep movsb
for memcpy
as compared to a memcpy
implemented with vmovdqa
Two seperate questions because I believe I should be seeing a reduction in RFO requests with rep movsb
, but if that is not the case, should I be seeing an increase as well?
Background
CPU - Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
I was trying to test out the number of RFO requests when using different methods of memcpy
including:
- Temporal Stores ->
vmovdqa
- Non-Temporal Stores ->
vmovntdq
- Enhanced REP MOVSB ->
rep movsb
And have been unable to see a reduction in RFO requests using rep movsb
. In fact I have been seeing more RFO requests with rep movsb
than with Temporal Stores. This is counter-intuitive given that the consensus understanding seems be that for ivybridge and new rep movsb
is able to avoid RFO requests and in turn save memory bandwidth:
When a rep movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:
- Avoiding the RFO request when it knows the entire cache line will be overwritten.
Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not
I wrote a simple test program to verify this but was unable to do so.
Test Program
#include <assert.h>
#include <errno.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
#define TEMPORAL 0
#define NON_TEMPORAL 1
#define REP_MOVSB 2
#define NONE_OF_THE_ABOVE 3
#define TODO 1
#if TODO == NON_TEMPORAL
#define store(x, y) _mm256_stream_si256((__m256i *)(x), y)
#else
#define store(x, y) _mm256_store_si256((__m256i *)(x), y)
#endif
#define load(x) _mm256_load_si256((__m256i *)(x))
void *
mmapw(uint64_t sz) {
void * p = mmap(NULL, sz, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
assert(p != NULL);
return p;
}
void BENCH_ATTR
bench() {
uint64_t len = 64UL * (1UL << 22);
uint64_t len_alloc = len;
char * dst_alloc = (char *)mmapw(len);
char * src_alloc = (char *)mmapw(len);
for (uint64_t i = 0; i < len; i += 4096) {
// page in before testing. perf metrics appear to still come through
dst_alloc[i] = 0;
src_alloc[i] = 0;
}
uint64_t dst = (uint64_t)dst_alloc;
uint64_t src = (uint64_t)src_alloc;
uint64_t dst_end = dst + len;
asm volatile("lfence" : : : "memory");
#if TODO == REP_MOVSB
// test rep movsb
asm volatile("rep movsb" : "+D"(dst), "+S"(src), "+c"(len) : : "memory");
#elif TODO == TEMPORAL || TODO == NON_TEMPORAL
// test vmovtndq or vmovdqa
for (; dst < dst_end;) {
__m256i lo = load(src);
__m256i hi = load(src + 32);
store(dst, lo);
store(dst + 32, hi);
dst += 64;
src += 64;
}
#endif
asm volatile("lfence
mfence" : : : "memory");
assert(!munmap(dst_alloc, len_alloc));
assert(!munmap(src_alloc, len_alloc));
}
int
main(int argc, char ** argv) {
bench();
}
- Build (assuming file name is rfo_test.c):
gcc -O3 -march=native -mtune=native rfo_test.c -o rfo_test
- Run (assuming executable is rfo_test):
perf stat -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test
Test Data
Note: Data with less noise in edit2
583,912,867 cpu-cycles
9,352,817 l2_rqsts.all_rfo
188,343,479 offcore_requests_outstanding.cycles_with_demand_rfo
11,560,370 offcore_requests.demand_rfo
0.166557783 seconds time elapsed
0.044670000 seconds user
0.121828000 seconds sys
560,933,296 cpu-cycles
7,428,210 l2_rqsts.all_rfo
123,174,665 offcore_requests_outstanding.cycles_with_demand_rfo
8,402,627 offcore_requests.demand_rfo
0.156790873 seconds time elapsed
0.032157000 seconds user
0.124608000 seconds sys
566,898,220 cpu-cycles
11,626,162 l2_rqsts.all_rfo
178,043,659 offcore_requests_outstanding.cycles_with_demand_rfo
12,611,324 offcore_requests.demand_rfo
0.163038739 seconds time elapsed
0.040749000 seconds user
0.122248000 seconds sys
521,061,304 cpu-cycles
7,527,122 l2_rqsts.all_rfo
123,132,321 offcore_requests_outstanding.cycles_with_demand_rfo
8,426,613 offcore_requests.demand_rfo
0.139873929 seconds time elapsed
0.007991000 seconds user
0.131854000 seconds sys
Test Results
The baseline RFO requests with just the setup but without the memcpy
is in TODO = NONE_OF_THE_ABOVE
with 7,527,122 RFO requests.
With TODO = TEMPORAL
(using vmovdqa
) we can see 9,352,817 RFO requests. This is lower than with TODO = REP_MOVSB
(using rep movsb
) which has 11,626,162 RFO requests. ~2 million more RFO requests with rep movsb
than with Temporal Stores. The only case I was able to see RFO requests avoided was the TODO = NON_TEMPORAL
(using vmovntdq
) which has 7,428,210 RFO requests, about the same as the baseline indicating none from the memcpy
itself.
I played around with different sizes for memcpy thinking I might need to decrease / increase the size for rep movsb
to make that optimization but I have been seeing the same general results. For all sizes I tested I see the number of RFO requests in the following order NON_TEMPORAL
< TEMPORAL
< REP_MOVSB
.
Theories
- [Unlikely] Something new on Icelake?
Edit: @PeterCordes was able to reproduc the results on Skylake
I don't think this is an Icelake specific thing as the only changes I could find in the Intel Manual on rep movsb
for Icelake are:
Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short
operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long.
Support for fast-short REP MOVSB is enumerated by the CPUID feature flag: CPUID [EAX=7H,
ECX=0H).EDX.FAST_SHORT_REP_MOVSB[bit 4] = 1. There is no change in the REP STOS performance.
Which should not be playing a factor in the test program I am using given that len
is well above 128.
- [Likelier] My test program is broken
I don't see any issues but this is a very surprising result. At the very least verified that the compiler is not optimizing out the tests here
Edit: Fixed build instructions to use G++
instead of GCC
and file postfix from .c
to .cc
Edit2:
Back to C and GCC.
perf stat --all-user -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test
Numbers with better perf recipe (same trend but less noise):
161,214,341 cpu-cycles
1,984,998 l2_rqsts.all_rfo
61,238,129 offcore_requests_outstanding.cycles_with_demand_rfo
3,161,504 offcore_requests.demand_rfo
0.169413413 seconds time elapsed
0.044371000 seconds user
0.125045000 seconds sys
142,689,742 cpu-cycles
3,106 l2_rqsts.all_rfo
4,581 offcore_requests_outstanding.cycles_with_demand_rfo
30 offcore_requests.demand_rfo
0.166300952 seconds time elapsed