The answer to this question strongly depends upon the architecture and the cache level, along with where the threads are actually running.
For example, recent Intel multi core CPUs have a L1 caches that are per-core, and an L2 cache that is shared among cores that are in the same CPU package; however different CPU packages will have their own L2 caches.
Even in the case when your threads are running on two cores within the one package though, if both threads access data within the same cacheline you will have that cacheline bouncing between the two L1 caches. This is very inefficient, and you should design your algorithm to avoid this situation.
A few comments have asked about how to go about avoiding this problem.
At heart, it's really not particularly complicated - you just want to avoid two threads from simultaneously trying to access data that is located on the same cache line, where at least one thread is writing to the data. (As long as all the threads are only reading the data, there's no problem - on most architectures, read-only data can be present in multiple caches).
To do this, you need to know the cache line size - this varies by architecture, but currently most x86 and x86-64 family chips use a 64 byte cache line (consult your architecture manual for other architectures). You will also need to know the size of your data structures.
If you ask your compiler to align the shared data structure of interest to a 64 byte boundary (for example, your array output
), then you know that it will start at the start of a cache line, and you can also calculate where the subsequent cache line boundaries are. If your int
is 4 bytes, then each cacheline will contain exactly 8 int
values. As long as the array starts on a cacheline boundary, then output[0]
through output[7]
will be on one cache line, and output[8]
through output[15]
on the next. In this case, you would design your algorithm such that each thread works on a block of adjacent int
values that is a multiple of 8.
If you are storing complicated struct
types rather than plain int
, the pahole
utility will be of use. It will analyse the struct
types in your compiled binary, and show you the layout (including padding) and total size. You can then adjust your struct
s using this output - for example, you may want to manually add some padding so that your struct
is a multiple of the cache line size.
On POSIX systems, the posix_memalign()
function is useful for allocating a block of memory with a specified alignment.