Let's look at one particular benefit of expression templates: ETs can be used to avoid vector-sized temporaries in memory which occur in overloaded operators like:
template<typename T>
std::vector<T> operator+(const std::vector<T>& a, const std::vector<T>& b)
{
std::vector<T> tmp; // vector-sized temporary
for_each(...);
return tmp;
}
In C++11 the return statement of this function applies move semantics. No copy of the vector. That's a win.
However, if I look at a simple expression like
d = a + b + c;
I see that the above function gets called twice (for both operator+
) while the final assignment can be done with move semantics.
In total 2 loops are executed. Means that I put out a temporary and read it back in right after. For big vectors this falls out of cache. That's worse than expression templates. They can do the whole thing in just 1 loop. ETs can execute the above code equivalent to:
for(int i=0 ; i < vec_length ; ++i)
d[i] = a[i] + b[i] + c[i];
I was wondering whether lambdas together with move semantics or any other new feature can do as good as ETs. Any thoughts?
Edit:
Basically, using the ET technique the compiler builds a parse tree
that resembles the algebraic expression with it's
type system. This tree consists of inner nodes and leaf nodes. The
inner nodes represent operations (addition, multiplication, etc.) and
the leaf nodes represent references to the data objects.
I tried to think of the whole computation process in the fashion of a
stack machine: Take an operation from an operation stack and pull
the next arguments from the argument stack and evaluate the operation.
Put the result back on the stack waiting for the operation.
To represent these two different objects (operation stack and data
leaf stack) I bundled together a std::tuple
for the operations and a
std::tuple
for the data leaves into a std::pair<>
. Initially I
used a std:vector
but that resulted in runtime overhead.
The whole process goes in two phases: Stack machine initialisation
where the operation and argument stack are initialised. And the
evaluation phase which is triggered by assigning the paired containers
to the vector.
I created a class Vec
which holds a private array<int,5>
(the
payload) and which features an overloaded assignment operator that
takes the "expression".
The global operator*
is overloaded for all combinations of taking
Vec
and "expression" to enable the correct handling also in the case
where we have more than just a*b
. (Notice, I switched for this
educational example to the multiplication - basically to quickly spot
the imull
in the assembler.)
What is done first before the evaluation starts is "extracting" the
values out of the involved Vec
objects and initializing the argument
stack. That was necessary to not have different kinds of objects lying
around: Indexable vectors and non-indexable results. This is what the
Extractor
is for. Nice thing again: Variadic templates are used which
in this case results in no run-time overhead (all this is done at
compile time).
The whole thing works. The expression is nicely evaluated (I also
added the addition, but that is left out here to fit the code). Below
you can see the assembler output. Just raw compuation, exactly as you
want it to be: En-par with ET technique.
Upshot. The new language features of C++11 offer the variadic
templates which (along with template meta-programming) open up the
area of compile time computation. I showed here how the benefits of
variadic templates can be used to produce code as good as with the
traditional ET technique.
#include<algorithm>
#include<iostream>
#include<vector>
#include<tuple>
#include<utility>
#include<array>
template<typename Target,typename Tuple, int N, bool end>
struct Extractor {
template < typename ... Args >
static Target index(int i,const Tuple& t, Args && ... args)
{
return Extractor<Target, Tuple, N+1,
std::tuple_size<Tuple>::value == N+1>::
index(i, t , std::forward<Args>(args)..., std::get<N>(t).vec[i] );
}
};
template < typename Target, typename Tuple, int N >
struct Extractor<Target,Tuple,N,true>
{
template < typename ... Args >
static Target index(int i,Tuple const& t,
Args && ... args) {
return Target(std::forward<Args>(args)...); }
};
template < typename ... Vs >
std::tuple<typename std::remove_reference<Vs>::type::type_t...>
extract(int i , const std::tuple<Vs...>& tpl)
{
return Extractor<std::tuple<typename std::remove_reference<Vs>::type::type_t...>,
std::tuple<Vs...>, 0,
std::tuple_size<std::tuple<Vs...> >::value == 0>::index(i,tpl);
}
struct Vec {
std::array<int,5> vec;
typedef int type_t;
template<typename... OPs,typename... VALs>
Vec& operator=(const std::pair< std::tuple<VALs...> , std::tuple<OPs...> >& e) {
for( int i = 0 ; i < vec.size() ; ++i ) {
vec[i] = eval( extract(i,e.first) , e.second );
}
}
};
template<int OpPos,int ValPos, bool end>
struct StackMachine {
template<typename... OPs,typename... VALs>
static void eval_pos( std::tuple<VALs...>& vals , const std::tuple<OPs...> & ops )
{
std::get<ValPos+1>( vals ) =
std::get<OpPos>(ops).apply( std::get<ValPos>( vals ) ,
std::get<ValPos+1>( vals ) );
StackMachine<OpPos+1,ValPos+1,sizeof...(OPs) == OpPos+1>::eval_pos(vals,ops);
}
};
template<int OpPos,int ValPos>
struct StackMachine<OpPos,ValPos,true> {
template<typename... OPs,typename... VALs>
static void eval_pos( std::tuple<VALs...>& vals ,
const std::tuple<OPs...> & ops )
{}
};
template<typename... OPs,typename... VALs>
int eval( const std::tuple<VALs...>& vals , const std::tuple<OPs...> & ops )
{
StackMachine<0,0,false>::eval_pos(const_cast<std::tuple<VALs...>&>(vals),ops);
return std::get<sizeof...(OPs)>(vals);
}
struct OpMul {
static int apply(const int& lhs,const int& rhs) {
return lhs*rhs;
}
};
std::pair< std::tuple< const Vec&, const Vec& > , std::tuple<OpMul> >
operator*(const Vec& lhs,const Vec& rhs)
{
return std::make_pair( std::tuple< const Vec&, const Vec& >( lhs , rhs ) ,
std::tuple<OpMul>( OpMul() ) );
}
template<typename... OPs,typename... VALs>
std::pair< std::tuple< const Vec&, VALs... > , std::tuple<OPs...,OpMul> >
operator*(const Vec& lhs,const std::pair< std::tuple< VALs... > , std::tuple<OPs...> >& rhs)
{
return std::make_pair( std::tuple_cat( rhs.first , std::tuple< const Vec& >(lhs) ) ,
std::tuple_cat( rhs.second , std::tuple<OpMul>( OpMul() ) ) );
}
template<typename... OPs,typename... VALs>
std::pair< std::tuple< const Vec&, VALs... > , std::tuple<OPs...,OpMul> >
operator*(const std::pair< std::tuple< VALs... > , std::tuple<OPs...> >& lhs,
const Vec& rhs)
{
return std::make_pair( std::tuple_cat( lhs.first , std::tuple< const Vec& >(rhs) ) ,
std::tuple_cat( lhs.second , std::tuple<OpMul>( OpMul() ) ) );
}
int main()
{
Vec d,c,b,a;
for( int i = 0 ; i < d.vec.size() ; ++i ) {
a.vec[i] = 10+i;
b.vec[i] = 20+i;
c.vec[i] = 30+i;
d.vec[i] = 0;
}
d = a * b * c * a;
for( int i = 0 ; i < d.vec.size() ; ++i )
std::cout << d.vec[i] << std::endl;
}
Assembler generated with g++-4.6 -O3
(I had to put some runtime dependence into the vector initialization so that the compiler doesn't calculate the whole thing at compile time and you actually see the imull
instaructions.)
imull %esi, %edx
imull 32(%rsp), %edx
imull %edx, %esi
movl 68(%rsp), %edx
imull %ecx, %edx
movl %esi, (%rsp)
imull 36(%rsp), %edx
imull %ecx, %edx
movl 104(%rsp), %ecx
movl %edx, 4(%rsp)
movl 72(%rsp), %edx
imull %ecx, %edx
imull 40(%rsp), %edx
imull %ecx, %edx
movl 108(%rsp), %ecx
movl %edx, 8(%rsp)
movl 76(%rsp), %edx
imull %ecx, %edx
imull 44(%rsp), %edx
imull %ecx, %edx
movl 112(%rsp), %ecx
movl %edx, 12(%rsp)
movl 80(%rsp), %edx
imull %ecx, %edx
imull %eax, %edx
imull %ecx, %edx
movl %edx, 16(%rsp)
See Question&Answers more detail:
os