Windows QueryPerformanceCounter() has logic to determine the number of processors and invoke syncronization logic if necessary. It attempts to use the TSC register but for multiprocessor systems this register is not guaranteed to be syncronized between processors (and more importantly can vary greatly due to intelligent downclocking and sleep states).
MSDN says that it doesn't matter which processor this is called on so you may be seeing extra syncronization code for such a situation cause overhead. Also remember that it can invoke a bus transfer so you may be seeing bus contention delays.
Try using SetThreadAffinityMask() if possible to bind it to a specific processor. Otherwise you might just have to live with the delay or you could try a different timer (for example take a look at http://en.wikipedia.org/wiki/High_Precision_Event_Timer).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…