Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
581 views
in Technique[技术] by (71.8m points)

regex - detect any combining character in Java

I am looking for a way to detect if a character in a java string "is a combining character" or not. For instance,

String khmerCombiningVowel = 
 new String(new byte[]{(byte) 0xe1,(byte) 0x9f,(byte) 0x80}, "UTF-8"); // unicode 17c0

represents a combining Khmer vowel sign. I have tried "\p{InCombiningDiacriticalMarks}" regex but it doesn't seem to apply to these particular combining characters. Or even if there is some comprehensive list of all unicode combining character blocks I might be able to make a regex for them?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

According to Algorithm to check for combining characters in Unicode, there are a number of blocks for combining characters.

Java has a number of helpful functions, try:

String codePointStr = new String(new byte[]{(byte) 0xe1, (byte) 0x9f, (byte) 0x80}, "UTF-8"); // unicode 17c0
System.out.println(codePointStr.matches("\p{Mc}"));
System.out.println(
    Character.COMBINING_SPACING_MARK == Character.getType(codePointStr.codePointAt(0)));

(prints true in both cases)

In this case, the COMBINING_SPACING_MARK (and related regex p{gc=Mc}) both refer to the Unicode category "Mark, Spacing Combining" which is basically any character that combines with a previous character while also adding width.

Other regular expressions that may be useful: p{M} for any kind of mark. If you want to use the Character getType() constants, you can get the same behavior to that by checking if its type is COMBINING_SPACING_MARK or ENCLOSING_MARK, or NON_SPACING_MARK.

ENCLOSING_MARK is a surrounding character, like a circle--also adds width to the character it combines with.

NON_SPACING_MARK includes the Latin alphabet diacritical combining marks, etc. (Marks that basically go on top or below, and don't add any width to the character).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...