Say I have the following logical relationships:
# ANIMALS
cat = [hobbes, tigger, garfield]
dog = [lassie]
frog = [kermit, hypnotoad]
# HABITATS
tree = [cat, frog]
river = [dog, frog, turtle]
house = [cat, dog]
There will never be duplicates within a set. I want to match them into pairs of (animal, habitat)
(hobbes, tree)
(garfield, tree)
(tigger, tree)
(kermit, tree)
(hypnotoad, tree)
(lassie, river)
(kermit, river)
(hypnotoad, river)
(hobbes, house)
(garfield, house)
(tigger, house)
(lassie, house)
If this were SQL, it would be something like....
CREATE TABLE animal (animal_name, animal_type);
CREATE TABLE habitat_to_animal (habitat_type, animal_type);
SELECT
animal.animal_name,
habitat_to_animal.habitat_type
FROM animal
JOIN habitat_to_animal
ON animal.animal_type = habitat_to_animal.animal_type
They're stored in numpy arrays (actual arrays are numpy.int32
) like so:
# animals are sorted by animal_kind
animal_kind = ['cat', 'cat', 'cat', 'dog', 'frog', 'frog']
animal_name = ['hobbes', 'tigger', 'garfield', 'lassie', 'kermit', 'hypnotoad']
# habitats are sorted by habitat_type
habitat_type = ['tree', 'tree', 'river', 'river', 'river', 'house', 'house']
habitat_animal = ['cat', 'frog', 'dog', 'frog', 'turtle', 'cat', 'fish']
What is the fastest way to do this in numpy if arrays are very large (millions of entries each)?
EDIT: This must be numpy only (no Pandas or other libs). As far as data cardinality goes, imagine both sets are ~10 million "animals" and ~100 million "habitats", and 200 million actual pairs output (most habitats will be empty, and some will have 5-10 animals). That means that generating the full cross product and filtering is a no-go. As are any loops; data is way too big.
Real arrays are integer IDs and have nothing to do with animals or habitats, but it's easier to read this way =)
question from:
https://stackoverflow.com/questions/65928382/numpy-how-to-express-a-many-to-many-relationship