python - spacy adding special case tokenization rules by regular expression or pattern

Question

Welcome To Ask or Share your Answers For Others

python - spacy adding special case tokenization rules by regular expression or pattern

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - spacy adding special case tokenization rules by regular expression or pattern

I want to add special case for tokenization in spacy according to the documentation. The documentation shows how specific words can be considered as special cases. I want to be able to specify a pattern (e.g. a suffix). For example, I have a string like this

text = "A sample string with <word-1> and <word-2>"

where <word-i> specifies a single word.

I know I can have it for one special case at a time by the following code. But how can I specify a pattern for that?

import spacy
from spacy.symbols import ORTH
nlp = spacy.load('en', vectors=False,parser=False, entity=False) 
nlp.tokenizer.add_special_case(u'<WORD>', [{ORTH: u'<WORD>'}])

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T20:04:53+0000

You can use regex matches to find bounds of your special case strings, and then use spacy's merge method to merge them as single token. The add_special_case works only for defined words. Here is an example:

>>> import spacy
>>> import re
>>> nlp = spacy.load('en')
>>> my_str = u'Tweet hashtags #MyHashOne #MyHashTwo'
>>> parsed = nlp(my_str)
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#', u'NOUN'), (u'MyHashOne', u'NOUN'), (u'#', u'NOUN'), (u'MyHashTwo', u'PROPN')]
>>> indexes = [m.span() for m in re.finditer('#w+',my_str,flags=re.IGNORECASE)]
>>> indexes
[(15, 25), (26, 36)]
>>> for start,end in indexes:
...     parsed.merge(start_idx=start,end_idx=end)
... 
#MyHashOne
#MyHashTwo
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#MyHashOne', u'NOUN'), (u'#MyHashTwo', u'PROPN')]
>>>

Categories

python - spacy adding special case tokenization rules by regular expression or pattern

python - spacy adding special case tokenization rules by regular expression or pattern

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags