Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.2k views
in Technique[技术] by (71.8m points)

string - Check if byte sequence contains utf-16

I am reading a byte sequence from a stream. Assume for the sake of argument, that the sequence is of a fixed length and I read the whole thing into a byte array (in my case it's vector<char> but it's not important for this question). This byte sequence contains a string, which my be either in utf-16 or in utf-8 encoding. Unfortunately, there's no indicator of which one it is.

I can verify whether the byte sequence represents a valid utf-16 encoding and also whether it represents a valid utf-8 encoding, but I can also imaging how the same sequence of bytes may be a valid utf-8 and a valid utf-16 at the same time.

So, does that mean there's no way to generically figure out which one it is?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If the contents are expected to be written in a language using the Latin script, simply counting nulls will detect UTF-16. In UTF-8, null bytes will decode to NUL control character, and they don't appear in text normally.

Languages written in other scripts cannot be fully valid in both UTF-16 and UTF-8 unless it's artificially constructed to be so.

So, first detect if it's fully valid UTF-8 sequence on its own:

  • If yes, check for null bytes, and if there are some, it's UTF-16. Otherwise it's UTF-8.
  • If not, it's UTF-16.

If the above resulted in UTF-16, that's not enough as you have to know the endianess as well. With languages written in Latin script, the amount of odd or even null bytes will tell this.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...