Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.0k views
in Technique[技术] by (71.8m points)

multithreading - Is duplication of state resources considered optimal for hyper-threading?

This question has an answer that says:

Hyper-threading duplicates internal resources to reduce context switch time. Resources can be: Registers, arithmetic unit, cache.

Why did CPU designers end up with duplication of state resources for simultaneous multithreading (or hyper-threading on Intel)?

Why wouldn't tripling (quadrupling, and so on) those same resources give us three logical cores and, therefore, even faster throughput?

Is duplication that researchers arrived at in some sense optimal, or is it just a reflection of current possibilities (transistor size, etc.)?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The answer you're quoting sounds wrong. Hyperthreading competitively shares the existing ALUs, cache, and physical register file.

Running two threads at once on the same core lets it find more parallelism to keep those execution units fed with work instead of sitting idle waiting for cache misses, latency, and branch mispredictions. (See Modern Microprocessors A 90-Minute Guide! for very useful background, and a section on SMT. Also this answer for more about how modern superscalar / out-of-order CPUs find and exploit instruction-level parallelism to run more than 1 instruction per clock.)

Only a few things need to be physically replicated or partitioned to track the architectural state of two CPUs in one core, and it's mostly in the front-end (before the issue/rename stage). David Kanter's Haswell writeup shows how Sandybridge always partitioned the IDQ (decoded-uop queue that feeds the issue/rename stage), but IvyBridge and Haswell can use it as one big queue when only a single thread is active. He also describes how cache is competitively shared between threads. For example, a Haswell core has 168 physical integer registers, but the architectural state of each logical CPU only needs 16. (Out-of-order execution for each thread of course benefits from lots of registers, that's why register renaming onto a big physical register file is done in the first place.)

Some things are statically partitioned, like the ROB, to stop one thread from filling up the back-end with work dependent on a cache-miss load.


Modern Intel CPUs have so many execution units that you can only barely saturate them with carefully tuned code that doesn't have any stalls and runs 4 fused-domain uops per clock. This is very rare in practice, outside something like a matrix multiply in a hand-tuned BLAS library.

Most code benefits from HT because it can't saturate a full core on its own, so the existing resources of a single core can run two threads at faster than half speed each. (Usually significantly faster than half).

But when only a single thread is running, the full power of a big core is available for that thread. This is what you lose out on if you design a multicore CPU that has lots of small cores. If Intel CPUs didn't implement hyperthreading, they would probably not include quite so many execution units for a single thread. It helps for a few single-thread workloads, but helps a lot more with HT. So you could argue that it is a case of replicating ALUs because the design supports HT, but it's not essential.

Pentium 4 didn't really have enough execution resources to run two full threads without losing more than you gained. Part of this might be the trace cache, but it also didn't have nearly the amount of execution units. P4 with HT made it useful to use prefetch threads that do nothing but prefetch data from an array the main thread is looping over, as described/recommended in What Every Programmer Should Know About Memory (which is otherwise still useful and relevant). A prefetch thread has a small trace-cache footprint and fetches into the L1D cache used by the main thread. This is what happens when you implement HT without enough execution resources to really make it good.


HT doesn't help at all for code that achieves very high throughput with a single thread per physical core. For example, saturating the front-end bandwidth of 4 uops / clock cycle without ever stalling.

Or if your code only bottlenecks on a core's peak FMA throughput or something (keeping 10 FMAs in flight with 10 vector accumulators). It can even hurt for code that ends up slowing down a lot from extra cache misses caused by competing for space in the L1D and L2 caches with another thread. (And also the uop cache and L1I cache).

Saturating the FMAs and doing something with the results typically takes some instructions other than vfma... so high-throughput FP code is often close to saturating the front-end as well.

Agner Fog's microarch pdf says the same thing about very carefully tuned code not benefiting from HT, or even being hurt by it.

Paul Clayton's comments on the question also make some good points about SMT designs in general.


If you have different threads doing different things, SMT can still be helpful. e.g. high-throughput FP code sharing a core with a thread that does mostly integer work and stalls a lot on branch and cache misses could gain significant overall throughput. The low-throughput thread leaves most of the core unused most of the time, so running another thread that uses the other 80% of a core's front-end and back-end resources can be very good.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...