This isn't "an answer" so much as a spur to think harder ;-) For concreteness, I'll wrap the OP's code, slightly respelled, in a function that also weeds out duplicates:
def gen(pools, ixs):
from itertools import combinations_with_replacement as cwr
from itertools import chain, product
from collections import Counter
assert all(0 <= i < len(pools) for i in ixs)
seen = set()
cnt = Counter(ixs) # map index to count
blocks = [cwr(pools[i], count) for i, count in cnt.items()]
for t in product(*blocks):
t = tuple(sorted(chain(*t)))
if t not in seen:
seen.add(t)
yield t
I don't fear sorting here - it's memory-efficient, and for small tuples is likely faster than all the overheads involved in creating a Counter
object.
But regardless of that, the point here is to emphasize the real value the OP got by reformulating the problem to use combinations_with_replacement
(cwr
). Consider these inputs:
N = 64
pools = [[0, 1]]
ixs = [0] * N
There are only 65 unique results, and the function generates them instantaneously, with no internal duplicates at all. On the other hand, the essentially identical
pools = [[0, 1]] * N
ixs = range(N)
also has the same 65 unique results, but essentially runs forever (as would, e.g, the other answers so far given), slogging through 2**64 possibilities. cwr
doesn't help here because each pool index appears only once.
So there's astronomical room for improvement over any solution that "merely" weeds out duplicates from a full Cartesian product, and some of that can be won by doing what the OP already did.
It seems to me the most promising approach would be to write a custom generator (not one relying primarily on itertools
functions) that generated all possibilities in lexicographic order to begin with (so, by construction, no duplicates would be created to begin with). But that requires some "global" analysis of the input pools first, and the code I started on quickly got more complex than I can make time to wrestle with now.
One based on @user2357112's answer
Combining cwr
with @user2357112's incremental de-duplicating gives a brief algorithm that runs fast on all the test cases I have. For example, it's essentially instantaneous for both spellings of the [0, 1] ** 64
examples above, and runs the example at the end of @Joseph Wood's answer approximately as fast as he said his C++ code ran (0.35 seconds on my box under Python 3.7.0, and, yes, found 162295 results):
def gen(pools, ixs):
from itertools import combinations_with_replacement as cwr
from collections import Counter
assert all(0 <= i < len(pools) for i in ixs)
result = {()}
for i, count in Counter(ixs).items():
result = {tuple(sorted(old + new))
for new in cwr(pools[i], count)
for old in result}
return result
To make it easier for other Pythonistas to try the last example, here's the input as executable Python:
pools = [[1, 10, 14, 6],
[7, 2, 4, 8, 3, 11, 12],
[11, 3, 13, 4, 15, 8, 6, 5],
[10, 1, 3, 2, 9, 5, 7],
[1, 5, 10, 3, 8, 14],
[15, 3, 7, 10, 4, 5, 8, 6],
[14, 9, 11, 15],
[7, 6, 13, 14, 10, 11, 9, 4]]
ixs = range(len(pools))
However, the OP later added that they typically have about 20 pools, each with some thousands of elements. 1000**20 = 1e60 is waaaaay out of practical reach for any approach that builds the full Cartesian product, no matter how cleverly it weeds out duplicates. It remains clear as mud how many they expect to be duplicates, though, so also clear as mud whether this kind of "incremental de-duplicating" is good enough to be practical.
Ideally we'd have a generator that yielded one result at a time, in lexicographic order.
Lazy lexicographic one-at-a-time generation
Building on the incremental de-duplication, suppose we have a strictly increasing (lexicographic) sequence of sorted tuples, append the same tuple T
to each, and sort each again. Then the derived sequence is still in strictly increasing order. For example, in the left column we have the 10 unique pairs from range(4)
, and in the right column what happens after we append (and sort again) 2 to each:
00 002
01 012
02 022
03 023
11 112
12 122
13 123
22 222
23 223
33 233
They started in sorted order, and the derived triples are also in sorted order. I'll skip the easy proof (sketch: if t1
and t2
are adjacent tuples, t1 < t2
, and let i
be the smallest index such that t1[i] != t2[i]
. Then t1[i] < t2[i]
(what "lexicographic <" means). Then if you throw x
into both tuples, proceed by cases: is x <= t1[i]
? between t1[i]
and t2[i]
? is x >= t2[i]
? In each case it's easy to see that the first derived tuple remains strictly less then the second derived tuple.)
So supposing we have a sorted sequence result
of all unique sorted tuples from some number of pools, what happens when we add elements of a new pool P
into the tuples? Well, as above,
[tuple(sorted(old + (P[0],))) for old in result]
is also sorted, and so is
[tuple(sorted(old + (P[i],))) for old in result]
for all i
in range(len(P))
. These guaranteed already-sorted sequences can be merged via heapq.merge()
, and another generator (killdups()
below) run on the merge result to weed out duplicates on the fly. There's no need to, e.g., keep a set of all tuples seen so far. Because the output of the merge is non-decreasing, it's sufficient just to check whether the next result is the same as the last result output.
Getting this to work lazily is delicate. The entire result-so-far sequence has to be accessed by each element of the new pool being added, but we don't want to materialize the whole thing in one gulp. Instead itertools.tee()
allows each element of the next pool to traverse the result-so-far sequence at its own pace, and automatically frees memory for each result item after all new pool elements have finished with it.
The function build1()
(or some workalike) is needed to ensure that the right values are accessed at the right times. For example, if the body of build1()
is pasted in inline where it's called, the code will fail spectacularly (the body would access the final values bound to rcopy
and new
instead of what they were bound to at the time the generator expression was created).
In all, of course this is somewhat slower, due to layers of delayed generator calls and heap merges. In return, it returns the results in lexicographic order, can start delivering results very quickly, and has lower peak memory burden if for no other reason than that the final result sequence isn't materialized at all (little is done until the caller iterates over the returned generator).
Tech note: don't fear sorted()
here. The appending is done via old + new
for a reason: old
is already sorted, and new
is typically a 1-tuple. Python's sort is linear-time in this case, not O(N log N)
.
def gen(pools, ixs):
from itertools import combinations_with_replacement as cwr
from itertools import tee
from collections import Counter
from heapq import merge
def killdups(xs):
last = None
for x in xs:
if x != last:
yield x
last = x
def build1(rcopy, new):
return (tuple(sorted(old + new)) for old in rcopy)
assert all(0 <= i < len(pools) for i in ixs)
result = [()]
for i, count in Counter(ixs).items():
poolelts = list(cwr(pools[i], count))
xs = [build1(rcopy, new)
for rcopy, new in zip(tee(result, len(poolelts)),
poolelts)]
result = killdups(merge(*xs))
return result
2 inputs
Turns out that for the 2-input case there's an easy approach derived from set algebra. If x
and y
are the same, cwr(x, 2)
is the answer. If x
and y
are disjoint, product(x, y)
. Else the intersection c
of x
and y
is non-empty, and the answer is the catenation of 4 cross-products obtained from the 3 pairwise-disjoint sets c
, x-c
, and y-c
: cwr(c, 2)
, product(x-c, c)
, product(y-c, c)
, and product(x-c, y-c)
. Proof is straightforward but tedious so I'll skip it. For example, there are no duplicates between cwr(c, 2)
and product(x-c, c)
because every tuple in the latter contains an element from x-c
, but every tuple in the former contains elements only from c
, and x-c
and c
are disjoint by construction. There are no duplicates within product(x-c, y-c)
because the two inputs are disjoint (if they contained an element in common, that would have been in the intersection of x
and y
, contradicting that x-c
has no element in the intersection). Etc.
Alas, I haven't found a way to generalize this beyond 2 inputs, which surprised me. It can be used on its own, or as a building block in other approaches. For example, if there are many inputs, they can be searched for pairs with large intersections, and this 2-input scheme used to do those parts of the overall products directly.
Even at just 3 inputs, it's not clear to me how to get the right result for
[1, 2], [2, 3], [1, 3]
The full Cartesian product has 2**3 = 8 elements, only one of which repeats: (1, 2, 3)
appears twice (as (1, 2, 3)
and again as (2, 3, 1)
). Each pair of inputs has a 1-element intersection, but the intersection of all 3 is empty.
Here's an implementation:
def pair(x, y):
from itertools import product, chain
from itertools import combinations_with_replacement
x = set(x)
y = set(y)
c = x & y
chunks = []
if c:
x -= c
y -= c
chunks.append(combinations_with_replacement(c, 2))
if x:
chunks.append(product(x, c))
if y:
chunks.append(product(y, c))
if x and y:
chunks.append(product(x, y))
return chain.from_iterable(chunks)
A Proof-of-Concept Based on Maximal Matching
This blends ideas from @Leon's sketch and an approach @JosephWoods sketched in comments. It's not polished, and can obviously be sped up, but it's reasonably quick on all the cases I tried. Because it's rather complex, it's probably more useful to post it in an already-hard-enough-to-follow un-optimized form anyway!
This doesn't make any attempt to determine the set of "free" pools (as in @Leon's sketch). Primarily because I didn't have code sitting around for that, and partly because it