python - How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?

Question

Welcome To Ask or Share your Answers For Others

python - How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?

I need to extract item combination from 2 lists by means of python Spacy Matcher. The problem is following: Let us have 2 lists:

colors=['red','bright red','black','brown','dark brown']
animals=['fox','bear','hare','squirrel','wolf']

I match the sequences by the following code:

first_color=[]
last_color=[]
only_first_color=[]
for color in colors:
    if ' ' in color:
        first_color.append(color.split(' ')[0])
        last_color.append(color.split(' ')[1])
    else:
        only_first_color.append(color)
matcher = Matcher(nlp.vocab)

pattern1 = [{"TEXT": {"IN": only_first_color}},{"TEXT":{"IN": animals}}]
pattern2 = [{"TEXT": {"IN": first_color}},{"TEXT": {"IN": last_color}},{"TEXT":{"IN": animals}}]

matcher.add("ANIMALS", None, pattern1,pattern2)

doc = nlp('bright red fox met black wolf')

matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(start, end, span.text)

It gives the output:

0 3 bright red fox
1 3 red fox
4 6 black wolf

How can i extract only 'bright red fox' and 'black wolf'? Should i change the patterns rules or post-process the matches?

Any thoughts appreciate!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:21:36+0000

You may use spacy.util.filter_spans:

Filter a sequence of Span objects and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with Retokenizer.merge. When spans overlap, the (first) longest span is preferred over shorter spans.

Python code:

matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
for span in spacy.util.filter_spans(spans):
    print(span.start, span.end, span.text)

Output:

0 3 bright red fox
4 6 black wolf

Categories

python - How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?

python - How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags