You are asking a few things, but I'll focus on the ordering constraints used by a typical shared_ptr
implementation because I think that covers the key part of your question.
An atomic operation is always atomic with respect to the variable (or POD) it applies to; modifications to a single variable will become visible to all threads in a consistent order.
The way relaxed atomic operations work is described in your question:
std::memory_order_relaxed
is still atomic with respect to the same variable but does not provide any other constraints about loads and stores with respect to other data
The following are 2 typical scenario's whereby ordering constraints on an atomic operation can be omitted (i.e. by using std::memory_order_relaxed
):
Memory ordering is not necessary because there are no dependencies on other operations, or as a commenter puts it, (..) is not part of an invariant involving other memory locations.
A common example is an atomic counter, incremented by multiple threads to keep track of the number of times a particular event has occurred.
The increment operation (fetch_add
) can be relaxed if the counter represents a value that has no dependency on other operations.
I find the example given by cppreference not very convincing because the shared_ptr
reference count does have a dependency; i.e. memory is deleted once its value becomes zero.
A better example is a web server keeping track of the number of incoming requests only for reporting purposes.
Memory ordering is necessary, but there is no need to use ordering constraints because the required synchronization has already passed
(IMO this explains better why shared_ptr
's reference count increment can be relaxed, see example below).
The shared_ptr
copy/move constructor can only be called while it has a synchronized view of the (reference to the) copied/moved-from instance (or it would be undefined behavior)
and as such, no additional ordering is necessary.
The following example illustrates how memory ordering is typically used by a shared_ptr
implementation to modify its reference count. Assume that all threads run in parallel
after sp_main
has been released (the shared_ptr
reference count is then 10).
int main()
{
std::vector<std::thread> v;
auto sp_main = std::make_shared<int>(0);
for (int i = 1; i <= 10; ++i)
{
// sp_main is passed by value
v.push_back(thread{thread_func, sp_main, i});
}
sp_main.reset();
for (auto &t : v) t.join();
}
void thread_func(std::shared_ptr<int> sp, int n)
{
// 10 threads are created
if (n == 7)
{
// Only thread #7 modifies the integer
*sp = 42;
}
// The only thead with a synchronized view of the managed integer is #7
// All other threads cannot read/write access the integer without causing a race
// 'sp' going out of scope -> destructor called
}
Thread creation guarantees an (inter-thread) happens-before relationship between make_shared
(in main
) and sp
's copy/move-constructor (inside each thread).
Therefore, shared_ptr
's constructor has a synchronized view of memory and can safely increment ref_count
with no additional ordering:
ctrlblk->ref_count.fetch_add(1, std::memory_order_relaxed);
For the destruction part, since only thread #7
writes to the shared integer, the others 9 threads are not allowed to access the same memory location without causing a race.
This creates a problem because all threads are destructed at about the same time (assume reset
in main
has been called earlier)
and only one thread is going to delete the shared integer (the one decrementing ref_count
from 1 to 0).
It is imperative that the last thread has a synchronized memory view before it deletes the integer, but since 9 out of 10 threads do not have a synchronized view, additional ordering is necessary.
The destructor may contain something like:
if (ctrlblk->ref_count.fetch_sub(1, std::memory_order_acq_rel) == 1)
{
// delete managed memory
}
The atomic ref_count
has a single modification order and therefore all atomic modifications occur in some order.
Let's say the threads (in this example) that perform the last 3 decrements on ref_count
are thread #7
(3 → 2), #5
(2 → 1) and #3
(1 → 0).
Both decrements performed by threads #7
and #5
come earlier in the modification order than the one performed by #3
.
The release sequence becomes:
#7
(store release) → #5
(read-modify-write, no ordering required) → #3
(load acquire)
The end result is that the release operation performed by thread #7
has synchronized with the acquire operation performed by #3
and the integer modification (by #7
) is guaranteed to have
happened before the integer destruction (by #3
).
Technically, only the threads that have accessed the managed memory location have to perform a release operation, but since a library implementer does not know about thread actions,
all threads perform a release operation upon destruction.
For the final destruction of shared memory, technically only the last thread needs to perform an acquire operation and therefore a shared_ptr
library implementer could optimize by setting a standalone fence
that is only called by the last thread.
if (ctrlblk->ref_count.fetch_sub(1, std::memory_order_release) == 1)
{
std::atomic_thread_fence(std::memory_order_acquire);
// delete managed memory
}