c++ - Expression templates and C++11

Question

Welcome To Ask or Share your Answers For Others

c++ - Expression templates and C++11

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c++ - Expression templates and C++11

Let's look at one particular benefit of expression templates: ETs can be used to avoid vector-sized temporaries in memory which occur in overloaded operators like:

template<typename T>
std::vector<T> operator+(const std::vector<T>& a, const std::vector<T>& b)
{
  std::vector<T> tmp;   // vector-sized temporary
  for_each(...);
  return tmp;
}

In C++11 the return statement of this function applies move semantics. No copy of the vector. That's a win.

However, if I look at a simple expression like

d = a + b + c;

I see that the above function gets called twice (for both operator+) while the final assignment can be done with move semantics.

In total 2 loops are executed. Means that I put out a temporary and read it back in right after. For big vectors this falls out of cache. That's worse than expression templates. They can do the whole thing in just 1 loop. ETs can execute the above code equivalent to:

for(int i=0 ; i < vec_length ; ++i)
  d[i] = a[i] + b[i] + c[i];

I was wondering whether lambdas together with move semantics or any other new feature can do as good as ETs. Any thoughts?

Edit:

Basically, using the ET technique the compiler builds a parse tree that resembles the algebraic expression with it's type system. This tree consists of inner nodes and leaf nodes. The inner nodes represent operations (addition, multiplication, etc.) and the leaf nodes represent references to the data objects.

I tried to think of the whole computation process in the fashion of a stack machine: Take an operation from an operation stack and pull the next arguments from the argument stack and evaluate the operation. Put the result back on the stack waiting for the operation.

To represent these two different objects (operation stack and data leaf stack) I bundled together a std::tuple for the operations and a std::tuple for the data leaves into a std::pair<>. Initially I used a std:vector but that resulted in runtime overhead.

The whole process goes in two phases: Stack machine initialisation where the operation and argument stack are initialised. And the evaluation phase which is triggered by assigning the paired containers to the vector.

I created a class Vec which holds a private array<int,5> (the payload) and which features an overloaded assignment operator that takes the "expression".

The global operator* is overloaded for all combinations of taking Vec and "expression" to enable the correct handling also in the case where we have more than just a*b. (Notice, I switched for this educational example to the multiplication - basically to quickly spot the imull in the assembler.)

What is done first before the evaluation starts is "extracting" the values out of the involved Vec objects and initializing the argument stack. That was necessary to not have different kinds of objects lying around: Indexable vectors and non-indexable results. This is what the Extractor is for. Nice thing again: Variadic templates are used which in this case results in no run-time overhead (all this is done at compile time).

The whole thing works. The expression is nicely evaluated (I also added the addition, but that is left out here to fit the code). Below you can see the assembler output. Just raw compuation, exactly as you want it to be: En-par with ET technique.

Upshot. The new language features of C++11 offer the variadic templates which (along with template meta-programming) open up the area of compile time computation. I showed here how the benefits of variadic templates can be used to produce code as good as with the traditional ET technique.

#include<algorithm>
#include<iostream>
#include<vector>
#include<tuple>
#include<utility>
#include<array>



template<typename Target,typename Tuple, int N, bool end>
struct Extractor {
  template < typename ... Args >
  static Target index(int i,const Tuple& t, Args && ... args)
  {
    return Extractor<Target, Tuple,  N+1, 
             std::tuple_size<Tuple>::value == N+1>::
      index(i, t , std::forward<Args>(args)..., std::get<N>(t).vec[i] );
  }
};

template < typename Target, typename Tuple, int N >
struct Extractor<Target,Tuple,N,true>
{
    template < typename ... Args >
    static Target index(int i,Tuple const& t, 
            Args && ... args) { 
      return Target(std::forward<Args>(args)...); }
};

template < typename ... Vs > 
std::tuple<typename std::remove_reference<Vs>::type::type_t...>
extract(int i , const std::tuple<Vs...>& tpl)
{
  return Extractor<std::tuple<typename std::remove_reference<Vs>::type::type_t...>,
           std::tuple<Vs...>, 0,
           std::tuple_size<std::tuple<Vs...> >::value == 0>::index(i,tpl);
}


struct Vec {
  std::array<int,5> vec;
  typedef int type_t;

  template<typename... OPs,typename... VALs>
  Vec& operator=(const std::pair< std::tuple<VALs...> , std::tuple<OPs...> >& e) {
    for( int i = 0 ; i < vec.size() ; ++i ) {
      vec[i] = eval( extract(i,e.first) , e.second );
    }
  }
};




template<int OpPos,int ValPos, bool end>
struct StackMachine {
  template<typename... OPs,typename... VALs>
  static void eval_pos( std::tuple<VALs...>& vals , const std::tuple<OPs...> & ops )
  {
    std::get<ValPos+1>( vals ) =
      std::get<OpPos>(ops).apply( std::get<ValPos>( vals ) , 
                  std::get<ValPos+1>( vals ) );
    StackMachine<OpPos+1,ValPos+1,sizeof...(OPs) == OpPos+1>::eval_pos(vals,ops);
  }
};

template<int OpPos,int ValPos>
struct StackMachine<OpPos,ValPos,true> {
  template<typename... OPs,typename... VALs>
  static void eval_pos( std::tuple<VALs...>& vals , 
            const std::tuple<OPs...> & ops )
  {}
};



template<typename... OPs,typename... VALs>
int eval( const std::tuple<VALs...>& vals , const std::tuple<OPs...> & ops )
{
  StackMachine<0,0,false>::eval_pos(const_cast<std::tuple<VALs...>&>(vals),ops);
  return std::get<sizeof...(OPs)>(vals);
}




struct OpMul {
  static int apply(const int& lhs,const int& rhs)  {
    return lhs*rhs;
  }
};

std::pair< std::tuple< const Vec&, const Vec& > , std::tuple<OpMul> >
operator*(const Vec& lhs,const Vec& rhs)
{
  return std::make_pair( std::tuple< const Vec&, const Vec& >( lhs , rhs ) , 
             std::tuple<OpMul>( OpMul() ) );
}

template<typename... OPs,typename... VALs>
std::pair< std::tuple< const Vec&, VALs... > , std::tuple<OPs...,OpMul> >
operator*(const Vec& lhs,const std::pair< std::tuple< VALs... > , std::tuple<OPs...> >& rhs)
{
  return std::make_pair( std::tuple_cat( rhs.first , std::tuple< const Vec& >(lhs)  ) , 
             std::tuple_cat( rhs.second , std::tuple<OpMul>( OpMul() )  ) );
}

template<typename... OPs,typename... VALs>
std::pair< std::tuple< const Vec&, VALs... > , std::tuple<OPs...,OpMul> >
operator*(const std::pair< std::tuple< VALs... > , std::tuple<OPs...> >& lhs,
      const Vec& rhs)
{
  return std::make_pair( std::tuple_cat( lhs.first , std::tuple< const Vec& >(rhs)  ) , 
             std::tuple_cat( lhs.second , std::tuple<OpMul>( OpMul() ) ) );
}

int main()
{
  Vec d,c,b,a;


  for( int i = 0 ; i < d.vec.size() ; ++i ) {
    a.vec[i] = 10+i;
    b.vec[i] = 20+i;
    c.vec[i] = 30+i;
    d.vec[i] = 0;
  }

  d = a * b * c * a;

  for( int i = 0 ; i < d.vec.size() ; ++i ) 
    std::cout << d.vec[i] << std::endl;
}

Assembler generated with g++-4.6 -O3 (I had to put some runtime dependence into the vector initialization so that the compiler doesn't calculate the whole thing at compile time and you actually see the imull instaructions.)

imull   %esi, %edx
imull   32(%rsp), %edx
imull   %edx, %esi
movl    68(%rsp), %edx
imull   %ecx, %edx
movl    %esi, (%rsp)
imull   36(%rsp), %edx
imull   %ecx, %edx
movl    104(%rsp), %ecx
movl    %edx, 4(%rsp)
movl    72(%rsp), %edx
imull   %ecx, %edx
imull   40(%rsp), %edx
imull   %ecx, %edx
movl    108(%rsp), %ecx
movl    %edx, 8(%rsp)
movl    76(%rsp), %edx
imull   %ecx, %edx
imull   44(%rsp), %edx
imull   %ecx, %edx
movl    112(%rsp), %ecx
movl    %edx, 12(%rsp)
movl    80(%rsp), %edx
imull   %ecx, %edx
imull   %eax, %edx
imull   %ecx, %edx
movl    %edx, 16(%rsp)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:26:56+0000

I was wondering whether lambdas together with move semantics or any other new feature can do as good as ETs. Any thoughts?

Quick Answer

Move semantics are not a total panacea on their own --techniques such as expression templates (ETs) are still needed in C++11 to eliminate overheads such as moving data around! So, to answer your question quickly before diving into the rest of my answer, move semantics, etc. doesn't completely replace ETs as my answer illustrates below.

Detailed Answer

ETs typically return proxy objects to defer evaluation until later, so there is no immediate apparent benefit of C++11 language features until the code that triggers the computation. That said, one would not want to write ET code, however, that triggers run-time code generation during the building of the expression tree with the proxies. Nicely, C++11's move semantics and perfect forwarding can help avoid such overheads should that otherwise occur. (Such would not have been possible in C++03.)

Essentially, when writing ETs one wants to exploit the language features in a way to generate optimal code once the member function(s) of the involved proxy objects are invoked. In C++11 this will include using perfect forwarding, move semantics over copying, etc. if such is actually still needed over and above what the compiler can already do. The name of the game is to minimize the run-time code generated and/or maximize the run-time speed and/or minimize the run-time overhead.

I wanted to actually try some ETs with C++11 features to see if I could elide ALL intermediate temporary instance types with a a = b + c + d; expression. (As this was just a fun break from my normal activities so I did not compare it to or write ET code purely using C++03. Also I did not worry about all aspects of code polishing that appears below.)

To start with, I did not use lambdas --as I preferred to use explicit types and functions-- so I won't argue for/against lambdas with respect to your question. My guess is that they would be similar to using functors and performing no better than the non-ET code below (i.e., moves would be required) --at least until compilers can automatically optimize lambdas using their own internal ETs for such. The code I wrote, however, exploits move semantics and perfect forwarding. Here's what I did starting with the results and then finally presenting the code.

I created a math_vector<N> class where N==3 and it defines an internal private instance of std::array<long double, N>. The members are a default constructor, copy and move constructors and assignments, an initializer list constructor, a destructor, a swap() member, operator [] to access elements of the vector and operator +=. Used without any expression templates, this code:

{
  cout << "CASE 1:
";
  math_vector<3> a{1.0, 1.1, 1.2};
  math_vector<3> b{2.0, 2.1, 2.2};
  math_vector<3> c{3.0, 3.1, 3.2};
  math_vector<3> d{4.0, 4.1, 4.2};
  math_vector<3> result = a + b + c + d;
  cout << '[' << &result << "]: " << result << "
";
}

outputs (when compiled with clang++ 3.1 or g++ 4.8 with -std=c++11 -O3):

CASE 1:
0x7fff8d6edf50: math_vector(initlist)
0x7fff8d6edef0: math_vector(initlist)
0x7fff8d6ede90: math_vector(initlist)
0x7fff8d6ede30: math_vector(initlist)
0x7fff8d6edd70: math_vector(copy: 0x7fff8d6edf50)
0x7fff8d6edda0: math_vector(move: 0x7fff8d6edd70)
0x7fff8d6eddd0: math_vector(move: 0x7fff8d6edda0)
0x7fff8d6edda0: ~math_vector()
0x7fff8d6edd70: ~math_vector()
[0x7fff8d6eddd0]: (10,10.4,10.8)
0x7fff8d6eddd0: ~math_vector()
0x7fff8d6ede30: ~math_vector()
0x7fff8d6ede90: ~math_vector()
0x7fff8d6edef0: ~math_vector()
0x7fff8d6edf50: ~math_vector()

i.e., the four explicit constructed instances using initializer lists (i.e., the initlist items), the result variable (i.e., 0x7fff8d6eddd0), and, also makes an additional three objects copying and moving.

To only focus on temporaries and moving, I created a second case that only creates result as a named variable --all others are rvalues:

{
  cout << "CASE 2:
";
  math_vector<3> result =
    math_vector<3>{1.0, 1.1, 1.2} +
    math_vector<3>{2.0, 2.1, 2.2} +
    math_vector<3>{3.0, 3.1, 3.2} +
    math_vector<3>{4.0, 4.1, 4.2}
  ;
  cout << '[' << &result << "]: " << result << "
";
}

which outputs this (again when ETs are NOT used):

CASE 2:
0x7fff8d6edcb0: math_vector(initlist)
0x7fff8d6edc50: math_vector(initlist)
0x7fff8d6edce0: math_vector(move: 0x7fff8d6edcb0)
0x7fff8d6edbf0: math_vector(initlist)
0x7fff8d6edd10: math_vector(move: 0x7fff8d6edce0)
0x7fff8d6edb90: math_vector(initlist)
0x7fff8d6edd40: math_vector(move: 0x7fff8d6edd10)
0x7fff8d6edb90: ~math_vector()
0x7fff8d6edd10: ~math_vector()
0x7fff8d6edbf0: ~math_vector()
0x7fff8d6edce0: ~math_vector()
0x7fff8d6edc50: ~math_vector()
0x7fff8d6edcb0: ~math_vector()
[0x7fff8d6edd40]: (10,10.4,10.8)
0x7fff8d6edd40: ~math_vector()

which is better: only extra move objects are created.

But I wanted better: I wanted zero extra temporaries and to have the code as if I hard-coded it with the one normal coding caveat: all explicitly instantiated types would still be created (i.e., the four initlist constructors and result). To accomplish this I then added expression template code as follows:

a proxy math_vector_expr<LeftExpr,BinaryOp,RightExpr> class was created to hold an expression not computed yet,
a proxy plus_op class was created to hold the addition operation,
a constructor was added to math_vector to accept a math_vector_expr object, and,
"starter" member functions were added to trigger the creation of the expression template.

The results using ETs are wonderful: no extra temporaries in either case! The previous two cases above now output:

CASE 1:
0x7fffe7180c60: math_vector(initlist)
0x7fffe7180c90: math_vector(initlist)
0x7fffe7180cc0: math_vector(initlist)
0x7fffe7180cf0: math_vector(initlist)
0x7fffe7180d20: math_vector(expr: 0x7fffe7180d90)
[0x7fffe7180d20]: (10,10.4,10.8)
0x7fffe7180d20: ~math_vector()
0x7fffe7180cf0: ~math_vector()
0x7fffe7180cc0: ~math_vector()
0x7fffe7180c90: ~math_vector()
0x7fffe7180c60: ~math_vector()

CASE 2:
0x7fffe7180dd0: math_vector(initlist)
0x7fffe7180e20: math_vector(initlist)
0x7fffe7180e70: math_vector(initlist)
0x7fffe7180eb0: math_vector(initlist)
0x7fffe7180d20: math_vector(expr: 0x7fffe7180dc0)
0x7fffe7180eb0: ~math_vector()
0x7fffe7180e70: ~math_vector()
0x7fffe7180e20: ~math_vector()
0x7fffe7180dd0: ~math_vector()
[0x7fffe7180d20]: (10,10.4,10.8)
0x7fffe7180d20: ~math_vector()

i.e., exactly 5 constructor calls and 5 destructor calls in each case. In fact, if you ask the compiler to generate the assembler code between the 4 initlist constructor calls and the outputting of result one gets this beautiful string of assembler code:

fldt    128(%rsp)
leaq    128(%rsp), %rdi
leaq    80(%rsp), %rbp
fldt    176(%rsp)
faddp   %st, %st(1)
fldt    224(%rsp)
faddp   %st, %st(1)
fldt    272(%rsp)
faddp   %st, %st(1)
fstpt   80(%rsp)
fldt    144(%rsp)
fldt    192(%rsp)
faddp   %st, %st(1)
fldt    240(%rsp)
faddp   %st, %st(1)
fldt    288(%rsp)
faddp   %st, %st(1)
fstpt   96(%rsp)
fldt    160(%rsp)
fldt    208(%rsp)
faddp   %st, %st(1)
fldt    256(%rsp)
faddp   %st, %st(1)
fldt    304(%rsp)
faddp   %st, %st(1)
fstpt   112(%rsp)

with g++ and clang++ outputs similar (even smaller) code. No function calls, etc. --just a bunch of adds which is EXACTLY what one wants!

The C++11 code to achieve this follows. Simply #define DONT_USE_EXPR_TEMPL to not use ETs or don't define it at all to use ETs.

#include <array>
#include <algorithm>
#include <initializer_list>
#include <type_traits>
#include <iostream>

//#define DONT_USE_EXPR_TEMPL

//===========================================================================

template <std::size_t N> class math_vector;

template <
  typename LeftExpr,
  typename BinaryOp,
  typename RightExpr
>
class math_vector_expr
{
  public:
    math_vector_expr() = delete;

    math_vector_expr(LeftExpr l, RightExpr r) : 
      l_(std::forward<LeftExpr>(l)), 
      r_(std::forward<RightExpr>(r))
    {
    }

    // Prohibit copying...
    math_vector_expr(math_vector_expr const&) = delete;
    math_vector_expr& operator =(math_vector_expr const&) = delete;

    // Allow moves...
    math_vector_expr(math_vector_expr&&) = default;
    math_vector_expr& operator =(math_vector_expr&&) = default;

    template <typename RE>
    auto operator +(RE&& re) const ->
      math_vector_expr<
        math_vector_expr<LeftExpr,BinaryOp,RightExpr> const&,
        BinaryOp,
        decltype(std::forward<RE>(re))
      >
    {
      return 
        math_vector_expr<
          math_vector_expr<LeftExpr,BinaryOp,RightExpr> const&,
          BinaryOp,
          decltype(std::forward<RE>(re))
        >(*this, std::forward<RE>(re))
      ;
    }

    auto le() -> 
      typename std::add_lvalue_reference<LeftExpr>::type
      { return l_; }

    auto le() const ->
      typename std::add_lvalue_reference<
        typename std::add_const<LeftExpr>::type
      >::type
      { return l_; }

    auto re() -> 
      typename std::add_lvalue_reference<RightExpr>::type
      { return r_; }

    auto re() const -> 
      typename std::add_lvalue_reference<
        typename std::add_const<RightExpr>::type
      >::type
      { return r_; }

    auto operator [](std::size_t index) const ->
      decltype(
        BinaryOp::apply(this->le()[index], this->re()[index])
      )
    {
      return BinaryOp::apply(le()[index], re()[index]);
    }

  private:
    LeftExpr l_;
    RightExpr r_;
};

//===========================================================================

template <typename T>
struct plus_op
{
  static T apply(T const& a, T const& b)
  {
    return a + b;
  }

  static T apply(T&& a, T const& b)
  {
    a += b;
    return std::move(a);
  }

  static T apply(T const& a, T&& b)
  {
    b += a;
    return std::move(b);
  }

  static T apply(T&& a, T&& b)
  {
    a += b;
    return std::move(a);
  }
};

//===========================================================================

template <std::size_t N>
class math_vector
{
  using impl_type = std::array<long double, N>;

  public:
    math_vector()
    {
      using namespace std;
      fill(begin(v_), end(v_), impl_type{});
      std::cout << this << ": math_vector()" << endl;
    }

    math_vect

Categories

c++ - Expression templates and C++11

c++ - Expression templates and C++11

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags