Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
74 views
in Technique[技术] by (71.8m points)

.net - Unable to use more than one processor group for my threads in a C# app

According to MSDN documentation and Stephen Toub answer, my C# app should use every Logical Processor of every Processor Group because it is configured as required (see my App.config below).

I run my app on a windows server 2012 with a NUMA architecture: 2 x [cpu Xeon E5-2697 v3 at 14 cores each with Hyper Thread activated] => 2 x 14 x 2 = 56 Logical Processors.

My app start 80 threads either from "Thread Class" or "Parallel.For" and in both case it only takes 28 Logical Processors, all from the same Processor Group.

Why does the Task scheduler assign my threads on only one Processor Group?

My code is available at GitHub or the executable could be downloaded at my Home website

I've already asked this question on social.msdn.microsoft.com without any answers.

  • Update 2015-01-26: I reported a bug at connect.microsoft.com

  • Update 2015-01-30: I added CoreInfo dump as additional references.

  • Update 2015-01-30: The problem occurs also with prime95 where it only offer to select 28 threads (not c# related)

  • Update 2015-01-30: My tool now show more information like Processor Mask per node. It sounds like I do not have access to the other node (the node I do not run in)

  • Update 2015-02-02, We do NOT have Citrix installed on this particular server (sorry, I was wrong)

  • Update 2015-02-02, We contacted HP...

  • Update 2015-02-03, Added more information to my program to display processorGroup per thread and few more little gadgets.

  • Update 2015-02-17, After talked to HP dev team, I updated my answer to reflect what I learned. (Just want to mention that I received EXCELLENT support from HP)

  • Update 2015-05-13, HP confirmed the problem in a "Customer Advisory" note. See this linked document id: c04650594

I set my .Net 4.5 (or 4.5.1) App.Config to:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
    <runtime>
        <Thread_UseAllCpuGroups enabled="true"></Thread_UseAllCpuGroups>
        <GCCpuGroup enabled="true"></GCCpuGroup>
        <gcServer enabled="true"></gcServer>
    </runtime>
    <startup> 
        <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.5.1"/>
    </startup>
</configuration>

This is the dump of CoreInfo from Microsoft:

Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
Microcode signature: 00000023
HTT         *   Hyperthreading enabled
HYPERVISOR  -   Hypervisor is present
VMX         *   Supports Intel hardware-assisted virtualization
SVM         -   Supports AMD hardware-assisted virtualization
X64         *   Supports 64-bit mode

SMX         *   Supports Intel trusted execution
SKINIT      -   Supports AMD SKINIT

NX          *   Supports no-execute page protection
SMEP        *   Supports Supervisor Mode Execution Prevention
SMAP        -   Supports Supervisor Mode Access Prevention
PAGE1GB     *   Supports 1 GB large pages
PAE         *   Supports > 32-bit physical addresses
PAT         *   Supports Page Attribute Table
PSE         *   Supports 4 MB pages
PSE36       *   Supports > 32-bit address 4 MB pages
PGE         *   Supports global bit in page tables
SS          *   Supports bus snooping for cache operations
VME         *   Supports Virtual-8086 mode
RDWRFSGSBASE    *   Supports direct GS/FS base access

FPU         *   Implements i387 floating point instructions
MMX         *   Supports MMX instruction set
MMXEXT      -   Implements AMD MMX extensions
3DNOW       -   Supports 3DNow! instructions
3DNOWEXT    -   Supports 3DNow! extension instructions
SSE         *   Supports Streaming SIMD Extensions
SSE2        *   Supports Streaming SIMD Extensions 2
SSE3        *   Supports Streaming SIMD Extensions 3
SSSE3       *   Supports Supplemental SIMD Extensions 3
SSE4a       -   Supports Streaming SIMDR Extensions 4a
SSE4.1      *   Supports Streaming SIMD Extensions 4.1
SSE4.2      *   Supports Streaming SIMD Extensions 4.2

AES         *   Supports AES extensions
AVX         *   Supports AVX intruction extensions
FMA         *   Supports FMA extensions using YMM state
MSR         *   Implements RDMSR/WRMSR instructions
MTRR        *   Supports Memory Type Range Registers
XSAVE       *   Supports XSAVE/XRSTOR instructions
OSXSAVE     *   Supports XSETBV/XGETBV instructions
RDRAND      *   Supports RDRAND instruction
RDSEED      -   Supports RDSEED instruction

CMOV        *   Supports CMOVcc instruction
CLFSH       *   Supports CLFLUSH instruction
CX8         *   Supports compare and exchange 8-byte instructions
CX16        *   Supports CMPXCHG16B instruction
BMI1        *   Supports bit manipulation extensions 1
BMI2        *   Supports bit manipulation extensions 2
ADX         -   Supports ADCX/ADOX instructions
DCA         *   Supports prefetch from memory-mapped device
F16C        *   Supports half-precision instruction
FXSR        *   Supports FXSAVE/FXSTOR instructions
FFXSR       -   Supports optimized FXSAVE/FSRSTOR instruction
MONITOR     *   Supports MONITOR and MWAIT instructions
MOVBE       *   Supports MOVBE instruction
ERMSB       *   Supports Enhanced REP MOVSB/STOSB
PCLMULDQ    *   Supports PCLMULDQ instruction
POPCNT      *   Supports POPCNT instruction
LZCNT       *   Supports LZCNT instruction
SEP         *   Supports fast system call instructions
LAHF-SAHF   *   Supports LAHF/SAHF instructions in 64-bit mode
HLE         -   Supports Hardware Lock Elision instructions
RTM         -   Supports Restricted Transactional Memory instructions

DE          *   Supports I/O breakpoints including CR4.DE
DTES64      *   Can write history of 64-bit branch addresses
DS          *   Implements memory-resident debug buffer
DS-CPL      *   Supports Debug Store feature with CPL
PCID        *   Supports PCIDs and settable CR4.PCIDE
INVPCID     *   Supports INVPCID instruction
PDCM        *   Supports Performance Capabilities MSR
RDTSCP      *   Supports RDTSCP instruction
TSC         *   Supports RDTSC instruction
TSC-DEADLINE    *   Local APIC supports one-shot deadline timer
TSC-INVARIANT   *   TSC runs at constant rate
xTPR        *   Supports disabling task priority messages

EIST        *   Supports Enhanced Intel Speedstep
ACPI        *   Implements MSR for power management
TM          *   Implements thermal monitor circuitry
TM2         *   Implements Thermal Monitor 2 control
APIC        *   Implements software-accessible local APIC
x2APIC      *   Supports x2APIC

CNXT-ID     -   L1 data cache mode adaptive or BIOS

MCE         *   Supports Machine Check, INT18 and CR4.MCE
MCA         *   Implements Machine Check Architecture
PBE         *   Supports use of FERR#/PBE# pin

PSN         -   Implements 96-bit processor serial number

PREFETCHW   *   Supports PREFETCHW instruction

Maximum implemented CPUID leaves: 0000000F (Basic), 80000008 (Extended).

Logical to Physical Processor Map:
Physical Processor 0 (Hyperthreaded):
**------------------------------------------------------
Physical Processor 1 (Hyperthreaded):
--**----------------------------------------------------
Physical Processor 2 (Hyperthreaded):
----**--------------------------------------------------
Physical Processor 3 (Hyperthreaded):
------**------------------------------------------------
Physical Processor 4 (Hyperthreaded):
--------**----------------------------------------------
Physical Processor 5 (Hyperthreaded):
----------**--------------------------------------------
Physical Processor 6 (Hyperthreaded):
------------**------------------------------------------
Physical Processor 7 (Hyperthreaded):
--------------**----------------------------------------
Physical Processor 8 (Hyperthreaded):
----------------**--------------------------------------
Physical Processor 9 (Hyperthreaded):
------------------**------------------------------------
Physical Processor 10 (Hyperthreaded):
--------------------**----------------------------------
Physical Processor 11 (Hyperthreaded):
----------------------**--------------------------------
Physical Processor 12 (Hyperthreaded):
------------------------**------------------------------
Physical Processor 13 (Hyperthreaded):
--------------------------**----------------------------
Physical Processor 14 (Hyperthreaded):
----------------------------**--------------------------
Physical Processor 15 (Hyperthreaded):
------------------------------**------------------------
Physical Processor 16 (Hyperthreaded):
--------------------------------**----------------------
Physical Processor 17 (Hyperthreaded):
----------------------------------**--------------------
Physical Processor 18 (Hyperthreaded):
------------------------------------**------------------
Physical Processor 19 (Hyperthreaded):
--------------------------------------**----------------
Physical Processor 20 (Hyperthreaded):
----------------------------------------**--------------
Physical Processor 21 (Hyperthreaded):
------------------------------------------**------------
Physical Processor 22 (Hyperthreaded):
--------------------------------------------**----------
Physical Processor 23 (Hyperthreaded):
----------------------------------------------**--------
Physical Processor 24 (Hyperthreaded):
------------------------------------------------**------
Physical Processor 25 (Hyperthreaded):
--------------------------------------------------**----
Physical Processor 26 (Hyperthreaded):
----------------------------------------------------**--
Physical Processor 27 (Hyperthreaded):
------------------------------------------------------**

Logical Processor to Socket Map:
Socket 0:
****************************----------------------------
Socket 1:
----------------------------****************************

Logical Processor to NUMA Node Map:
NUMA Node 0:
****************************----------------------------
NUMA Node 1:
----------------------------****************************
Calculating Cross-NUMA Node Access Cost...

Approximate Cross-NUMA Node Access Cost (relative to fastest):
     00  01
00: 1.0 1.1
01: 1.1 1.1

Logical Processor to Cache Map:
Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
**------------------------------------------------------
Instruction Cache   0, Level 1,   32 KB, Assoc   8, LineSize  

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The bug has been fixed by a new (yet unpublished) HP Bios (at the time of writing this).

The new Bios (targeting HP Proliant DL360 and DL380 Gen9) introduce a new setting: "NUMA Group Size Optimization" with choice of [Clustered - default] or [Flat]. HP says to set it to flat.

The sceenshot part of this answer has been conducted on a DL380 instead of a DL360 because of server availability. But I expect same behavior on DL360. The problem disapeared, we had only one group.

As far as I know, the OS communicate with the BIOS to know the CPU(s) configuration. The Bios play an important role in how the OS will present the logical processors available to applications (Processor Group, Affinity, etc).

About the Microsoft documentation Supporting Systems That Have More Than 64 Processors and Processor Groups it is clearly stated that more than one processor group will only be created when the Logical Processor (LC) count is >64. On our server (56 LC) with Numa Architecture set to "Clustered" we had 2 processor groups. A hardware engineer working at HP Bios dev team explained me that when set to "Clustered", the Bios is fooling Windows by padding the real number of logical processor to 72 Logical Processor (the max number of Logical Processor for the E5 v3 Family). The real number of LC is 56 in our DL360. That's the reason why we add 2 groups instead of 1. The Microsoft documentation seems accurate. I personally think that it would be better to create 1 group per numa node whenever possible but in our case, there is a bug. What is faulty is hard to know between HP or Microsoft when the HP Bios setting is set to Clustered (default) but Microsoft seems to not support that option which seems to cause our problem.

On HP Bios for DL360 and DL380, The Bios configuration "Numa Configuration" set to "Clustered" (default) will create 2 groups although there is only 56 Logical Processors (when hyperthreaded). The result is that only one processor is visible at a time for any application. Probably also due to HP fooling Windows by padding fake number of Logical Processors. It sounds like Microsoft does not expect that. Our C# app can't run on the 2 groups. It's hard to blame Microsoft on that behavior where HP does something they can't anticipated. Perhaps we will see, one day, Windows supporting many groups when LC <= 64.

About Prime95. This CPU stress test software has good documentation on Wikipedia that clearly state that it will load into only one processor group (in Limits section).

Running in Numa Architecture set to Flat


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...