Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
525 views
in Technique[技术] by (71.8m points)

python - nltk regular expression tokenizer

I tried to implement a regular expression tokenizer with nltk in python, but the result is this:

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z].)+        # abbreviations, e.g. U.S.A.
...   | w+(-w+)*        # words with optional internal hyphens
...   | $?d+(.d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | ...            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]

But the wanted result is this:

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

Why? Where is the mistake?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You should turn all capturing groups to non-capturing:

  • ([A-Z].)+ > (?:[A-Z].)+
  • w+(-w+)* -> w+(?:-w+)*
  • $?d+(.d+)?%? to $?d+(?:.d+)?%?

The issue is that regexp_tokenize seems to be using re.findall that returns capture tuple lists when multiple capture groups are defined in the pattern. See this nltk.tokenize package reference:

pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)

Also, I am not sure you wanted to use :-_ that matches a range including all uppercase letters, put the - to the end of the character class.

Thus, use

pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z].)+        # abbreviations, e.g. U.S.A.
      | w+(?:-w+)*        # words with optional internal hyphens
      | $?d+(?:.d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | ...              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...