Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
91 views
in Technique[技术] by (71.8m points)

python - numpy: How to express a many-to-many relationship?

Say I have the following logical relationships:

# ANIMALS
cat = [hobbes, tigger, garfield]
dog = [lassie]
frog = [kermit, hypnotoad]

# HABITATS
tree = [cat, frog]
river = [dog, frog, turtle]
house = [cat, dog]

There will never be duplicates within a set. I want to match them into pairs of (animal, habitat)

(hobbes, tree)
(garfield, tree)
(tigger, tree)
(kermit, tree)
(hypnotoad, tree)
(lassie, river)
(kermit, river)
(hypnotoad, river)
(hobbes, house)
(garfield, house)
(tigger, house)
(lassie, house)

If this were SQL, it would be something like....

CREATE TABLE animal (animal_name, animal_type);
CREATE TABLE habitat_to_animal (habitat_type, animal_type);

SELECT
    animal.animal_name,
    habitat_to_animal.habitat_type
FROM animal
JOIN habitat_to_animal
    ON animal.animal_type = habitat_to_animal.animal_type

They're stored in numpy arrays (actual arrays are numpy.int32) like so:

# animals are sorted by animal_kind
animal_kind = ['cat', 'cat', 'cat', 'dog', 'frog', 'frog']
animal_name = ['hobbes', 'tigger', 'garfield', 'lassie', 'kermit', 'hypnotoad']

# habitats are sorted by habitat_type
habitat_type = ['tree', 'tree', 'river', 'river', 'river', 'house', 'house']
habitat_animal = ['cat', 'frog', 'dog', 'frog', 'turtle', 'cat', 'fish']

What is the fastest way to do this in numpy if arrays are very large (millions of entries each)?

EDIT: This must be numpy only (no Pandas or other libs). As far as data cardinality goes, imagine both sets are ~10 million "animals" and ~100 million "habitats", and 200 million actual pairs output (most habitats will be empty, and some will have 5-10 animals). That means that generating the full cross product and filtering is a no-go. As are any loops; data is way too big.

Real arrays are integer IDs and have nothing to do with animals or habitats, but it's easier to read this way =)

question from:https://stackoverflow.com/questions/65928382/numpy-how-to-express-a-many-to-many-relationship

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you want selection capabilities like SQL, I'm going to suggest you think about Pandas for handling the data. It can happily store Numpy arrays, but you have very flexible filtering options for slicing the data lots of different ways.

For what you're asking, I don't know if you even need to do much more than just make the DataFrame and iterate over it. Here's an example that tries to stick close to your SQL example:

import pandas as pd

animal_kind = ['cat', 'cat', 'cat', 'dog', 'frog', 'frog']
animal_name = ['hobbes', 'tigger', 'garfield', 'lassie', 'kermit', 'hypnotoad']

habitat_type = ['tree', 'tree', 'river', 'river', 'river', 'house', 'house', 'house']
habitat_animal = ['cat', 'frog', 'dog', 'frog', 'turtle', 'cat', 'fish', 'dog']

df_animal = pd.DataFrame(data=animal_kind, columns=["animal"], index=animal_name)
habitat_to_animal = pd.DataFrame()
habitat_to_animal["animal"] = habitat_animal
habitat_to_animal["habitat"] = habitat_type

for i in range(len(habitat_to_animal)):
    animal, habitat = habitat_to_animal.loc[i]
    names = df_animal[df_animal.animal == animal].index
    for name in names:
        print(f"({name},{habitat})")

This produces your list as requested. I think it will scale fairly well, but there are 3-4 different ways to iterate over rows in Pandas, and the fastest way depends on the data and what you're doing to it.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...