Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.0k views
in Technique[技术] by (71.8m points)

regex - C# Regular Expressions with Uxxxxxxxx characters in the pattern

Regex.IsMatch( "foo", "[U00010000-U0010FFFF]" ) 

Throws: System.ArgumentException: parsing "[-]" - [x-y] range in reverse order.

Looking at the hex values for U00010000 and U0010FFF I get: 0xd800 0xdc00 for the first character and 0xdbff 0xdfff for the second.

So I guess I have really have one problem. Why are the Unicode characters formed with U split into two chars in the string?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

They're surrogate pairs. Look at the values - they're over 65535. A char is only a 16 bit value. How would you expression 65536 in only 16 bits?

Unfortunately it's not clear from the documentation how (or whether) the regular expression engine in .NET copes with characters which aren't in the basic multilingual plane. (The uxxxx pattern in the regular expression documentation only covers 0-65535, just like uxxxx as a C# escape sequence.)

Is your real regular expression bigger, or are you actually just trying to see if there are any non-BMP characters in there?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...