Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
451 views
in Technique[技术] by (71.8m points)

python - Umlauts in regexp matching (via locale?)

I'm surprised that I'm not able to match a German umlaut in a regexp. I tried several approaches, most involving setting locales, but up to now to no avail.

locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
re.findall(r'w+', 'abc def gxfci jkl', re.L)
re.findall(r'w+', 'abc def gxc3xbci jkl', re.L)
re.findall(r'w+', 'abc def güi jkl', re.L)
re.findall(r'w+', u'abc def güi jkl', re.L)

None of these versions matches the umlaut-u (ü) correctly with w+. Also removing the re.L flag or prefixing the pattern string with u (to make it unicode) did not help me.

Any ideas? How is the flag re.L used correctly?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Have you tried to use the re.UNICODE flag, as described in the doc?

>>> re.findall(r'w+', 'abc def güi jkl', re.UNICODE)
['abc', 'def', 'gxc3xbci', 'jkl']

A quick search points to this thread that gives some explanation:

re.LOCALE just passes the character to the underlying C library. It really only works on bytestrings which have 1 byte per character. UTF-8 encodes codepoints outside the ASCII range to multiple bytes per codepoint, and the re module will treat each of those bytes as a separate character.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...