Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
104 views
in Technique[技术] by (71.8m points)

Avoiding the overhead of C# virtual calls

I have a few heavily optimized math functions that take 1-2 nanoseconds to complete. These functions are called hundreds of millions of times per second, so call overhead is a concern, despite the already-excellent performance.

In order to keep the program maintainable, the classes that provide these methods inherit an IMathFunction interface, so that other objects can directly store a specific math function and use it when needed.

public interface IMathFunction
{
  double Calculate(double input);
  double Derivate(double input);
}

public SomeObject
{
  // Note: There are cases where this is mutable
  private readonly IMathFunction mathFunction_; 

  public double SomeWork(double input, double step)
  {
    var f = mathFunction_.Calculate(input);
    var dv = mathFunction_.Derivate(input);
    return f - (dv * step);
  }
}

This interface is causing an enormous overhead compared to a direct call due to how the consuming code uses it. A direct call takes 1-2ns, whereas the virtual interface call takes 8-9ns. Evidently, the presence of the interface and its subsequent translation of the virtual call is the bottleneck for this scenario.

I would like to retain both maintainability and performance if possible. Is there a way I can resolve the virtual function to a direct call when the object is instantiated so that all subsequent calls are able to avoid the overhead? I assume this would involve creating delegates with IL, but I wouldn't know where to start with that.

question from:https://stackoverflow.com/questions/53785910/avoiding-the-overhead-of-c-sharp-virtual-calls

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

So this has obvious limitations and should not be used all the time anywhere you have an interface, but if you have a place where perf really needs to be maximized you can use generics:

public SomeObject<TMathFunction> where TMathFunction: struct, IMathFunction 
{
  private readonly TMathFunction mathFunction_;

  public double SomeWork(double input, double step)
  {
    var f = mathFunction_.Calculate(input);
    var dv = mathFunction_.Derivate(input);
    return f - (dv * step);
  }
}

And instead of passing an interface, pass your implementation as TMathFunction. This will avoid vtable lookups due to an interface and also allow inlining.

Note the use of struct is important here, as generics will otherwise access the class via the interface.

Some implementation:

I made a simple implementation of IMathFunction for testing:

class SomeImplementationByRef : IMathFunction
{
    public double Calculate(double input)
    {
        return input + input;
    }

    public double Derivate(double input)
    {
        return input * input;
    }
}

... as well as a struct version and an abstract version.

So, here's what happens with the interface version. You can see it is relatively inefficient because it performs two levels of indirection:

    return obj.SomeWork(input, step);
sub         esp,40h  
vzeroupper  
vmovaps     xmmword ptr [rsp+30h],xmm6  
vmovaps     xmmword ptr [rsp+20h],xmm7  
mov         rsi,rcx
vmovsd      qword ptr [rsp+60h],xmm2  
vmovaps     xmm6,xmm1
mov         rcx,qword ptr [rsi+8]          ; load mathFunction_ into rcx.
vmovaps     xmm1,xmm6  
mov         r11,7FFED7980020h              ; load vtable address of the IMathFunction.Calculate function.
cmp         dword ptr [rcx],ecx  
call        qword ptr [r11]                ; call IMathFunction.Calculate function which will call the actual Calculate via vtable.
vmovaps     xmm7,xmm0
mov         rcx,qword ptr [rsi+8]          ; load mathFunction_ into rcx.
vmovaps     xmm1,xmm6  
mov         r11,7FFED7980028h              ; load vtable address of the IMathFunction.Derivate function.
cmp         dword ptr [rcx],ecx  
call        qword ptr [r11]                ; call IMathFunction.Derivate function which will call the actual Derivate via vtable.
vmulsd      xmm0,xmm0,mmword ptr [rsp+60h] ; dv * step
vsubsd      xmm7,xmm7,xmm0                 ; f - (dv * step)
vmovaps     xmm0,xmm7  
vmovaps     xmm6,xmmword ptr [rsp+30h]  
vmovaps     xmm7,xmmword ptr [rsp+20h]  
add         rsp,40h  
pop         rsi  
ret  

Here's an abstract class. It's a little more efficient but only negligibly:

        return obj.SomeWork(input, step);
 sub         esp,40h  
 vzeroupper  
 vmovaps     xmmword ptr [rsp+30h],xmm6  
 vmovaps     xmmword ptr [rsp+20h],xmm7  
 mov         rsi,rcx  
 vmovsd      qword ptr [rsp+60h],xmm2  
 vmovaps     xmm6,xmm1  
 mov         rcx,qword ptr [rsi+8]           ; load mathFunction_ into rcx.
 vmovaps     xmm1,xmm6  
 mov         rax,qword ptr [rcx]             ; load object type data from mathFunction_.
 mov         rax,qword ptr [rax+40h]         ; load address of vtable into rax.
 call        qword ptr [rax+20h]             ; call Calculate via offset 0x20 of vtable.
 vmovaps     xmm7,xmm0  
 mov         rcx,qword ptr [rsi+8]           ; load mathFunction_ into rcx.
 vmovaps     xmm1,xmm6  
 mov         rax,qword ptr [rcx]             ; load object type data from mathFunction_.
 mov         rax,qword ptr [rax+40h]         ; load address of vtable into rax.
 call        qword ptr [rax+28h]             ; call Derivate via offset 0x28 of vtable.
 vmulsd      xmm0,xmm0,mmword ptr [rsp+60h]  ; dv * step
 vsubsd      xmm7,xmm7,xmm0                  ; f - (dv * step)
 vmovaps     xmm0,xmm7
 vmovaps     xmm6,xmmword ptr [rsp+30h]  
 vmovaps     xmm7,xmmword ptr [rsp+20h]  
 add         rsp,40h  
 pop         rsi  
 ret  

So both an interface and an abstract class rely heavily on branch target prediction to have acceptable performance. Even then, you can see there's quite a lot more going into it, so the best-case is still relatively slow while the worst-case is a stalled pipeline due to a mispredict.

And finally here's the generic version with a struct. You can see it's massively more efficient because everything has been fully inlined so there's no branch prediction involved. It also has the nice side effect of removing most of the stack/parameter management that was in there too, so the code becomes very compact:

    return obj.SomeWork(input, step);
push        rax  
vzeroupper  
movsx       rax,byte ptr [rcx+8]  
vmovaps     xmm0,xmm1  
vaddsd      xmm0,xmm0,xmm1  ; Calculate - got inlined
vmulsd      xmm1,xmm1,xmm1  ; Derivate - got inlined
vmulsd      xmm1,xmm1,xmm2  ; dv * step
vsubsd      xmm0,xmm0,xmm1  ; f - 
add         rsp,8  
ret  

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...