python - nltk regular expression tokenizer

Question

Welcome To Ask or Share your Answers For Others

python - nltk regular expression tokenizer

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - nltk regular expression tokenizer

I tried to implement a regular expression tokenizer with nltk in python, but the result is this:

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z].)+        # abbreviations, e.g. U.S.A.
...   | w+(-w+)*        # words with optional internal hyphens
...   | $?d+(.d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | ...            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]

But the wanted result is this:

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

Why? Where is the mistake?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:53:48+0000

You should turn all capturing groups to non-capturing:

([A-Z].)+ > (?:[A-Z].)+
w+(-w+)* -> w+(?:-w+)*
$?d+(.d+)?%? to $?d+(?:.d+)?%?

The issue is that regexp_tokenize seems to be using re.findall that returns capture tuple lists when multiple capture groups are defined in the pattern. See this nltk.tokenize package reference:

pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)

Also, I am not sure you wanted to use :-_ that matches a range including all uppercase letters, put the - to the end of the character class.

Thus, use

pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z].)+        # abbreviations, e.g. U.S.A.
      | w+(?:-w+)*        # words with optional internal hyphens
      | $?d+(?:.d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | ...              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

Categories

python - nltk regular expression tokenizer

python - nltk regular expression tokenizer

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags