Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
520 views
in Technique[技术] by (71.8m points)

regex - Matching (e.g.) a Unicode letter with Java regexps

There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z]. However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic .. and many more. Unicode defines many blocks each of which may have "letters".

The Java definition defines Posix classes for things like alpha characters, but that is specified to only work with US-ASCII. The predefined character classes define words to consist of [a-zA-Z_0-9], which also excludes many letters.

So how do you properly match against Unicode strings? Is there some other library that gets this right?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here you have a very nice explanation:

http://www.regular-expressions.info/unicode.html

Some hints:

"Java and .NET unfortunately do not support X (yet). Use P{M}p{M}* as a substitute. To match any number of graphemes, use (?:P{M}p{M}*)+ instead of X+."

"In Java, the regex token uFFFF only matches the specified code point, even when you turned on canonical equivalence. However, the same syntax uFFFF is also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("u00E0") will match both the single-code-point and double-code-point encodings of à, while Pattern.compile("\u00E0") matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à, while the latter compiles u00E0. Depending on what you're doing, the difference may be significant."


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...