c# - Understanding Text Encoding (In .Net)

Question

Welcome To Ask or Share your Answers For Others

c# - Understanding Text Encoding (In .Net)

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c# - Understanding Text Encoding (In .Net)

I have done very little with encoding of Text. Truthfully, I don't really even know what it means exactly.

For example, if I have something like:

Dim myStr as String = "Hello"

Is that 'encoded' in memory in a particular format? Does that format depend on what language I'm using?

If I were in another country, like China, for example, and I had a string of Chinese (mandarin? My apologies if I'm using the wrong words here) would the following code (that I've used fine on English strings) still work the same?

System.Text.UTF8Encoding encoding=new System.Text.UTF8Encoding();
return encoding.GetBytes(str);

Or would it lose all meaning when you convert that .Net string to a UTF8Encoding when that conversion isn't valid?

Finally, I've worked with .Net for a few years now and I've never seen, heard, or had to do anything with Encoding. Am I the exception, or is it not a common thing to do?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:38:57+0000

The .NET string class is encoding strings using UTF16 - that means 2 bytes per character (although it allows for special combinations of two characters to form a single 4 byte character, so called "surrogate pairs") .

UTF8 on the other hand will use a variable number of bytes necessary to represent a particular Unicode character, i.e. only one byte for regular ASCII characters, but maybe 3 bytes for a Chinese character. Both encodings allow representing all Unicode characters, so there is always a mapping between them - both are different binary represenations (i.e for storing in memory or on disk) of the same (unicode) character set.

Since not all Unicode characters were able to fit into the original 2 bytes reserved by UTF-16, the format also allows to denote a combination of two UTF-16 characters to form 4 byte characters - the so formed character is called a "surrogate" or surrogate pair and is a pair of 16-bit Unicode encoding values that, together, represent a single character.

UTF-8 does not have this problem, since the number of bytes per Unicode character is not fixed. A good general overview over UTF-8, UTF-16 and BOMs can be gathered here.

An excellent overview / introduction to Unicode character encoding is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

Categories

c# - Understanding Text Encoding (In .Net)

c# - Understanding Text Encoding (In .Net)

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags