python - Extracting whole words

Question

Welcome To Ask or Share your Answers For Others

python - Extracting whole words

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Extracting whole words

I have a large set of real-world text that I need to pull words out of to input into a spell checker. I'd like to extract as many meaningful words as possible without too much noise. I know there's plenty of regex ninjas around here, so hopefully someone can help me out.

Currently I'm extracting all alphabetical sequences with '[a-z]+'. This is an okay approximation, but it drags a lot of rubbish out with it.

Ideally I would like some regex (doesn't have to be pretty or efficient) that extracts all alphabetical sequences delimited by natural word separators (such as [/-_,.: ] etc.), and ignores any alphabetical sequences with illegal bounds.

However I'd also be happy to just be able to get all alphabetical sequences that ARE NOT adjacent to a number. So for instance 'pie21' would NOT extract 'pie', but 'http://foo.com' would extract ['http', 'foo', 'com'].

I tried lookahead and lookbehind assertions, but they were applied per-character (so for example re.findall('(?<!d)[a-z]+(?!d)', 'pie21') would return 'pi' when I want it to return nothing). I tried wrapping the alpha part as a term ((?:[a-z]+)) but it didn't help.

More detail: The data is an email database, so it's mostly plain English with normal numbers, but occasionally there's rubbish strings like GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA and AC7A21C0 that I'd like to ignore completely. I'm assuming any alphabetical sequence with a number in it is rubbish.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T01:13:27+0000

If you restrict yourself to ASCII letters, then use (with the re.I option set)

[a-z]+

is a word boundary anchor, matching only at the start and end of alphanumeric "words". So [a-z]+ matches pie, but not pie21 or 21pie.

To also allow other non-ASCII letters, you can use something like this:

[^Wd_]+

which also allows accented characters etc. You may need to set the re.UNICODE option, especially when using Python 2, in order to allow the w shorthand to match non-ASCII letters.

[^Wd_] as a negated character class allows any alphanumeric character except for digits and underscore.

Categories

python - Extracting whole words

python - Extracting whole words

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags