Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
218 views
in Technique[技术] by (71.8m points)

python - Picking unordered combinations from pools with overlap

I have pools of values and I would like to generate every possible unordered combination by picking from certain pools.

For example, I wanted to pick from pool 0, pool 0, and pool 1:

>>> pools = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
>>> part = (0, 0, 1)
>>> list(product(*(pools[i] for i in part)))
[(1, 1, 2), (1, 1, 3), (1, 1, 4), (1, 2, 2), (1, 2, 3), (1, 2, 4), (1, 3, 2), (1, 3, 3), (1, 3, 4), (2, 1, 2), (2, 1, 3), (2, 1, 4), (2, 2, 2), (2, 2, 3), (2, 2, 4), (2, 3, 2), (2, 3, 3), (2, 3, 4), (3, 1, 2), (3, 1, 3), (3, 1, 4), (3, 2, 2), (3, 2, 3), (3, 2, 4), (3, 3, 2), (3, 3, 3), (3, 3, 4)]

This generates every possible combination by picking from pool 0, pool 0, and pool 1.

However order doesn't matter to me, so many of the combinations are actually duplicates. For example, since I used a Cartesian product, both (1, 2, 4) and (2, 1, 4) are generated.

I came up with a simple method to mitigate this issue. For members picked from a single pool, I select without ordering using combinations_with_replacement. I count how many times I want to draw from each pool. The code looks like this:

cnt = Counter()
for ind in part: cnt[ind] += 1
blocks = [combinations_with_replacement(pools[i], cnt[i]) for i in cnt]
return [list(chain(*combo)) for combo in product(*blocks)]

This reduces ordering duplicates if I happen to choose from the same pool multiple times. However all the pools have lots of overlap, and using combinations_with_replacement on multiple pools merged would generate some invalid combinations. Is there a more efficient method to generate unordered combinations?

Edit: Extra info about the inputs: The number of parts and pools is small (~5 and ~20), and for simplicity, each element is an integer. The actual problem I have already solved so this is just for academic interest. Let's say there are thousands hundreds of integers in each pool but some pools are small and only have dozens. So some kind of union or intersection seems to be the way to go.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This isn't "an answer" so much as a spur to think harder ;-) For concreteness, I'll wrap the OP's code, slightly respelled, in a function that also weeds out duplicates:

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from itertools import chain, product
    from collections import Counter

    assert all(0 <= i < len(pools) for i in ixs)
    seen = set()
    cnt = Counter(ixs) # map index to count
    blocks = [cwr(pools[i], count) for i, count in cnt.items()]
    for t in product(*blocks):
        t = tuple(sorted(chain(*t)))
        if t not in seen:
            seen.add(t)
            yield t

I don't fear sorting here - it's memory-efficient, and for small tuples is likely faster than all the overheads involved in creating a Counter object.

But regardless of that, the point here is to emphasize the real value the OP got by reformulating the problem to use combinations_with_replacement (cwr). Consider these inputs:

N = 64
pools = [[0, 1]]
ixs = [0] * N

There are only 65 unique results, and the function generates them instantaneously, with no internal duplicates at all. On the other hand, the essentially identical

pools = [[0, 1]] * N
ixs = range(N)

also has the same 65 unique results, but essentially runs forever (as would, e.g, the other answers so far given), slogging through 2**64 possibilities. cwr doesn't help here because each pool index appears only once.

So there's astronomical room for improvement over any solution that "merely" weeds out duplicates from a full Cartesian product, and some of that can be won by doing what the OP already did.

It seems to me the most promising approach would be to write a custom generator (not one relying primarily on itertools functions) that generated all possibilities in lexicographic order to begin with (so, by construction, no duplicates would be created to begin with). But that requires some "global" analysis of the input pools first, and the code I started on quickly got more complex than I can make time to wrestle with now.

One based on @user2357112's answer

Combining cwr with @user2357112's incremental de-duplicating gives a brief algorithm that runs fast on all the test cases I have. For example, it's essentially instantaneous for both spellings of the [0, 1] ** 64 examples above, and runs the example at the end of @Joseph Wood's answer approximately as fast as he said his C++ code ran (0.35 seconds on my box under Python 3.7.0, and, yes, found 162295 results):

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from collections import Counter

    assert all(0 <= i < len(pools) for i in ixs)
    result = {()}
    for i, count in Counter(ixs).items():
        result = {tuple(sorted(old + new))
                  for new in cwr(pools[i], count)
                  for old in result}
    return result

To make it easier for other Pythonistas to try the last example, here's the input as executable Python:

pools = [[1, 10, 14, 6],
         [7, 2, 4, 8, 3, 11, 12],
         [11, 3, 13, 4, 15, 8, 6, 5],
         [10, 1, 3, 2, 9, 5,  7],
         [1, 5, 10, 3, 8, 14],
         [15, 3, 7, 10, 4, 5, 8, 6],
         [14, 9, 11, 15],
         [7, 6, 13, 14, 10, 11, 9, 4]]
ixs = range(len(pools))

However, the OP later added that they typically have about 20 pools, each with some thousands of elements. 1000**20 = 1e60 is waaaaay out of practical reach for any approach that builds the full Cartesian product, no matter how cleverly it weeds out duplicates. It remains clear as mud how many they expect to be duplicates, though, so also clear as mud whether this kind of "incremental de-duplicating" is good enough to be practical.

Ideally we'd have a generator that yielded one result at a time, in lexicographic order.

Lazy lexicographic one-at-a-time generation

Building on the incremental de-duplication, suppose we have a strictly increasing (lexicographic) sequence of sorted tuples, append the same tuple T to each, and sort each again. Then the derived sequence is still in strictly increasing order. For example, in the left column we have the 10 unique pairs from range(4), and in the right column what happens after we append (and sort again) 2 to each:

00 002
01 012
02 022
03 023
11 112
12 122
13 123
22 222
23 223
33 233

They started in sorted order, and the derived triples are also in sorted order. I'll skip the easy proof (sketch: if t1 and t2 are adjacent tuples, t1 < t2, and let i be the smallest index such that t1[i] != t2[i]. Then t1[i] < t2[i] (what "lexicographic <" means). Then if you throw x into both tuples, proceed by cases: is x <= t1[i]? between t1[i] and t2[i]? is x >= t2[i]? In each case it's easy to see that the first derived tuple remains strictly less then the second derived tuple.)

So supposing we have a sorted sequence result of all unique sorted tuples from some number of pools, what happens when we add elements of a new pool P into the tuples? Well, as above,

[tuple(sorted(old + (P[0],))) for old in result]

is also sorted, and so is

[tuple(sorted(old + (P[i],))) for old in result]

for all i in range(len(P)). These guaranteed already-sorted sequences can be merged via heapq.merge(), and another generator (killdups() below) run on the merge result to weed out duplicates on the fly. There's no need to, e.g., keep a set of all tuples seen so far. Because the output of the merge is non-decreasing, it's sufficient just to check whether the next result is the same as the last result output.

Getting this to work lazily is delicate. The entire result-so-far sequence has to be accessed by each element of the new pool being added, but we don't want to materialize the whole thing in one gulp. Instead itertools.tee() allows each element of the next pool to traverse the result-so-far sequence at its own pace, and automatically frees memory for each result item after all new pool elements have finished with it.

The function build1() (or some workalike) is needed to ensure that the right values are accessed at the right times. For example, if the body of build1() is pasted in inline where it's called, the code will fail spectacularly (the body would access the final values bound to rcopy and new instead of what they were bound to at the time the generator expression was created).

In all, of course this is somewhat slower, due to layers of delayed generator calls and heap merges. In return, it returns the results in lexicographic order, can start delivering results very quickly, and has lower peak memory burden if for no other reason than that the final result sequence isn't materialized at all (little is done until the caller iterates over the returned generator).

Tech note: don't fear sorted() here. The appending is done via old + new for a reason: old is already sorted, and new is typically a 1-tuple. Python's sort is linear-time in this case, not O(N log N).

def gen(pools, ixs):
    from itertools import combinations_with_replacement as cwr
    from itertools import tee
    from collections import Counter
    from heapq import merge

    def killdups(xs):
        last = None
        for x in xs:
            if x != last:
                yield x
                last = x

    def build1(rcopy, new):
        return (tuple(sorted(old + new)) for old in rcopy)

    assert all(0 <= i < len(pools) for i in ixs)
    result = [()]
    for i, count in Counter(ixs).items():
        poolelts = list(cwr(pools[i], count))
        xs = [build1(rcopy, new)
              for rcopy, new in zip(tee(result, len(poolelts)),
                                    poolelts)]
        result = killdups(merge(*xs))
    return result

2 inputs

Turns out that for the 2-input case there's an easy approach derived from set algebra. If x and y are the same, cwr(x, 2) is the answer. If x and y are disjoint, product(x, y). Else the intersection c of x and y is non-empty, and the answer is the catenation of 4 cross-products obtained from the 3 pairwise-disjoint sets c, x-c, and y-c: cwr(c, 2), product(x-c, c), product(y-c, c), and product(x-c, y-c). Proof is straightforward but tedious so I'll skip it. For example, there are no duplicates between cwr(c, 2) and product(x-c, c) because every tuple in the latter contains an element from x-c, but every tuple in the former contains elements only from c, and x-c and c are disjoint by construction. There are no duplicates within product(x-c, y-c) because the two inputs are disjoint (if they contained an element in common, that would have been in the intersection of x and y, contradicting that x-c has no element in the intersection). Etc.

Alas, I haven't found a way to generalize this beyond 2 inputs, which surprised me. It can be used on its own, or as a building block in other approaches. For example, if there are many inputs, they can be searched for pairs with large intersections, and this 2-input scheme used to do those parts of the overall products directly.

Even at just 3 inputs, it's not clear to me how to get the right result for

[1, 2], [2, 3], [1, 3]

The full Cartesian product has 2**3 = 8 elements, only one of which repeats: (1, 2, 3) appears twice (as (1, 2, 3) and again as (2, 3, 1)). Each pair of inputs has a 1-element intersection, but the intersection of all 3 is empty.

Here's an implementation:

def pair(x, y):
    from itertools import product, chain
    from itertools import combinations_with_replacement
    x = set(x)
    y = set(y)
    c = x & y
    chunks = []
    if c:
        x -= c
        y -= c
        chunks.append(combinations_with_replacement(c, 2))
        if x:
            chunks.append(product(x, c))
        if y:
            chunks.append(product(y, c))
    if x and y:
        chunks.append(product(x, y))
    return chain.from_iterable(chunks)

A Proof-of-Concept Based on Maximal Matching

This blends ideas from @Leon's sketch and an approach @JosephWoods sketched in comments. It's not polished, and can obviously be sped up, but it's reasonably quick on all the cases I tried. Because it's rather complex, it's probably more useful to post it in an already-hard-enough-to-follow un-optimized form anyway!

This doesn't make any attempt to determine the set of "free" pools (as in @Leon's sketch). Primarily because I didn't have code sitting around for that, and partly because it


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...