I think this mostly comes down to one thing: your test is pretty much meaningless. All those blocks are supposed to do something, and use multiple cores and asynchronous operations to do that.
Also, in your test, it's likely that a lot of time is spent on synchronization. With a more realistic code, the code will take some time to execute, so there will be less contention, so the actual overhead will be smaller than what you measured.
But to actually answer your question, yes, you're overlooking some performance tweaks. Specifically, SingleProducerConstrained
, which means data structures with less locking can be used. If I use this on both blocks (the BufferBlock
is completely useless here, you can safely remove it), the rate raises from about 3–4 millions of items per second to more than 5 millions on my computer.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…