Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
420 views
in Technique[技术] by (71.8m points)

java - Regex for matching Unicode pattern

I am trying to validate a file's content when is uploaded and I am stuck at the Unicode encoding. I am not interested to find Unicode special characters, that are not in the ASCII range. I am trying to find if the content of the file contains at least one Unicode pattern, like u0046 for example.

For example, I exclude any file that contains the 'script' word, but what if the file contains this word written in Unicode? Sure, Java decodes it into a normal string when it reads the content, but what if I can't rely on this?

So, as far as I have searched on the Internet, I've seen Unicode characters written like u0046, or like U+0046. Based on this, I have written the following regex:

(\u|U+)....

This means, u or U+ followed by four characters. This pattern accomplishes what I desire, but I wonder if there are any other ways to write a Unicode character. It is always u or U+? Can it be more or less than 4 characters after u or U+?

Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The notation U+Any-number-of-hex-digits belongs to Unicode will not be functional anywhere in code. In java source code and *.properties u followed by four hex digits is a UTF-16 encoding of Unicode, automatically parsed.

The pattern to search for that:

"\\u[0-9A-Fa-f]{4}"

Or a String.contains on:

"\u"

In other languages than Java Uxxxxxx (six hex chars) is possible, for the full UTF-32 range. Unfortunately upto Java 8 not so.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...