I tried to implement a regular expression tokenizer with nltk in python, but the result is this:
>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z].)+ # abbreviations, e.g. U.S.A.
... | w+(-w+)* # words with optional internal hyphens
... | $?d+(.d+)?%? # currency and percentages, e.g. $12.40, 82%
... | ... # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]
But the wanted result is this:
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
Why? Where is the mistake?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…