Say you have a collection of elements, how can you pick out those with duplicates and put them into each group with least amount of comparison? preferably in C++, but algorithm is more important than the language.
For Example
given {E1,E2,E3,E4,E4,E2,E6,E4,E3}, I wish to extract out {E2,E2}, {E3,E3}, {E4,E4,E4}.
what data structure and algorithm you will choose? Please also include the cost of setting up the data structure, say, if it's a pre-sorted one like std::multimap
Updates
To make things clearer as suggested. there's one constraint: the elements must be compared by themselves to be certain they are duplicates.
So hashes do not apply, because virtually they shift the comparison to from heavy elements(e.g. chunks of data) to light elements(integers), and reduce some comparison, but not do away with them, and in the end, we are back to our original problem, when are inside one collision bucket.
Pretend you have a bunch of potentials duplicate files of GBs each, they bear the same hash value by every hash-algorithm human beings know. Now you are going to spot the real duplicates.
No, it can't be a real-life problem(even MD5 is enough to generate unique hash for real-life files). But just pretend so that we can focus on finding the data structure + algorithm that involves least amount of comparison.
What I am doing is to
represent into a STL std::list data structure(in that 1) its element-deletion is cheaper than, say, a vector 2) its insertion is cheaper, not requiring sort.)
pop out one element and compare it with the rest, if a duplicate is found, it's pulled out of the list. once the end of the list is reached, one group of duplication is found, if any.
repeat the above 2 steps until the list is empty.
It needs N-1 in the best case, but (N-1)! in the worse case.
what are the better alternatives?
My code using method explained above:
// algorithm to consume the std::list container,
// supports: list<path_type>,list< pair<std::string, paths_type::const_iterater>>
template<class T>
struct consume_list
{
groups_type operator()(list<T>& l)
{
// remove spurious identicals and group the rest
// algorithm:
// 1. compare the first element with the remaining elements,
// pick out all duplicated files including the first element itself.
// 2. start over again with the shrinked list
// until the list contains one or zero elements.
groups_type sub_groups;
group_type one_group;
one_group.reserve(1024);
while(l.size() > 1)
{
T front(l.front());
l.pop_front();
item_predicate<T> ep(front);
list<T>::iterator it = l.begin();
list<T>::iterator it_end = l.end();
while(it != it_end)
{
if(ep.equals(*it))
{
one_group.push_back(ep.extract_path(*(it))); // single it out
it = l.erase(it);
}
else
{
it++;
}
}
// save results
if(!one_group.empty())
{
// save
one_group.push_back(ep.extract_path(front));
sub_groups.push_back(one_group);
// clear, memory allocation not freed
one_group.clear();
}
}
return sub_groups;
}
};
// type for item-item comparison within a stl container, e.g. std::list
template <class T>
struct item_predicate{};
// specialization for type path_type
template <>
struct item_predicate<path_type>
{
public:
item_predicate(const path_type& base)/*init list*/
{}
public:
bool equals(const path_type& comparee)
{
bool result;
/* time-consuming operations here*/
return result;
}
const path_type& extract_path(const path_type& p)
{
return p;
}
private:
// class members
};
};
Thanks for the answer below, however they seem to be misled by my example that it's about integers. In fact the elements are type agnostic(not necessarily integers, strings or any other PODs), and the equal predicates are self-defined, that is the comparison can be very heavy.
So maybe my question should be: using which data structure + algorithm involves fewer comparisons.
Using a pre-sorted container like multiset, multimap is not better according to my test, since
- the sorting while inserting already does the comparisons,
- the following adjacent finding does comparison again,
- these data structure prefer less-than operations to equal operations, they perform 2 less-than(a
I do not see how it can save comparisons.
one more thing that's ignored by some answers below, I need to differentiate the duplicate groups from one another, not just keep them in the container.
Conclusion
After all the discussion, there seem to be 3 ways
- my original naive method as explained above
- Start with a linear container like
std::vector
, sort it and then locate the equal ranges
- start with an associated container like
std::map<Type, vector<duplicates>>
, pick out the duplicates during the setup of associated container as suggested by Charles Bailey.
I've coded a sample to test all the methods as posted below.
the number of duplicates and when they are distributed may influence the best choice.
- Method 1 is best when they fall heavily at the front, and is worst when at the end. Sort will not change the distribution, but the endian.
- Method 3 has the most average performance
- Method 2 is never the best choice
Thanks for all who participating in the discussion.
one output with 20 sample items from the code below.
Test with [ 20 10 6 5 4 3 2 2 2 2 1
1 1 1 1 1 1 1 1 1 ]
and [ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3
4 5 6 10 20 ] respectively
using std::vector -> sort() ->
adjacent_find():
comparisons: [ '<' = 139, '==' = 23
]
comparisons: [ '<' = 38, '==' = 23 ]
using std::list -> sort() -> shrink
list:
comparisons: [ '<' = 50, '==' = 43 ]
comparisons: [ '<' = 52, '==' = 43 ]
using std::list -> shrink list:
comparisons: [ '<' = 0, '==' = 121 ]
comparisons: [ '<' = 0, '==' = 43 ]
using std::vector -> std::map>:
comparisons: [ '<' = 79, '==' = 0 ]
comparisons: [ '<' = 53, '==' = 0 ]
using std::vector ->
std::multiset ->
adjacent_find():
comparisons: [ '<' = 79, '==' = 7 ]
comparisons: [ '<' = 53, '==' = 7 ]
Code
// compile with VC++10: cl.exe /EHsc
#include <vector>
#include <deque>
#include <list>
#include <map>
#include <set>
#include <algorithm>
#include <iostream>
#include <sstream>
#include <boost/foreach.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/format.hpp>
using namespace std;
struct Type
{
Type(int i) : m_i(i){}
bool operator<(const Type& t) const
{
++number_less_than_comparison;
return m_i < t.m_i;
}
bool operator==(const Type& t) const
{
++number_equal_comparison;
return m_i == t.m_i;
}
public:
static void log(const string& operation)
{
cout
<< "comparison during " <<operation << ": [ "
<< "'<' = " << number_less_than_comparison
<< ", "
<< "'==' = " << number_equal_comparison
<< " ]
";
reset();
}
int to_int() const
{
return m_i;
}
private:
static void reset()
{
number_less_than_comparison = 0;
number_equal_comparison = 0;
}
public:
static size_t number_less_than_comparison;
static size_t number_equal_comparison;
private:
int m_i;
};
size_t Type::number_less_than_comparison = 0;
size_t Type::number_equal_comparison = 0;
ostream& operator<<(ostream& os, const Type& t)
{
os << t.to_int();
return os;
}
template< class Container >
struct Test
{
void recursive_run(size_t n)
{
bool reserve_order = false;
for(size_t i = 48; i < n; ++i)
{
run(i);
}
}
void run(size_t i)
{
cout <<
boost::format("
Test %1% sample elements
using method%2%:
")
% i
% Description();
generate_sample(i);
sort();
locate();
generate_reverse_sample(i);
sort();
locate();
}
private:
void print_me(const string& when)
{
std::stringstream ss;
ss << when <<" = [ ";
BOOST_FOREACH(const Container::value_type& v, m_container)
{
ss << v << " ";
}
ss << "]
";
cout << ss.str();
}
void generate_sample(size_t n)
{
m_container.clear();
for(size_t i = 1; i <= n; ++i)
{
m_container.push_back(Type(n/i));
}
print_me("init value");
Type::log("setup");
}
void generate_reverse_sample(size_t n)
{
m_container.clear();
for(size_t i = 0; i < n; ++i)
{
m_container.push_back(Type(n/(n-i)));
}
print_me("init value(reverse order)");
Type::log("setup");
}
void sort()
{
sort_it();
Type::log("sort");
print_me("after sort");
}
void locate()
{
locate_duplicates();
Type::log("locate duplicate");
}
protected:
virtual string Description() = 0;
virtual void sort_it() = 0;
virtual void locate_duplicates() = 0;
protected:
Container m_container;
};
struct Vector : Test<vector<Type> >
{
string Description()
{
return "std::vector<Type> -> sort() -> adjacent_find()";
}
private:
void sort_it()
{
std::sort(m_container.begin(), m_container.end());
}
void locate_duplicates()
{
using std::adjacent_find;
typedef vector<Type>::iterator ITR;
typedef vector<Type>::value_type VALUE;
typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
typedef vector<TUPLE> V_TUPLE;
V_TUPLE results;
ITR itr_begin(m_container.begin());
ITR itr_end(m_container.end());
ITR itr(m_container.begin());
ITR itr_range_begin(m_container.begin());
while(itr_begin != itr_end)
{
// fin