So, if I want to deal with unicode characters, should I use
wchar_t
?
First of all, note that the encoding does not force you to use any particular type to represent a certain character. You may use char
to represent Unicode characters just as wchar_t
can - you only have to remember that up to 4 char
s together will form a valid code point depending on UTF-8, UTF-16, or UTF-32 encoding, while wchar_t
can use 1 (UTF-32 on Linux, etc) or up to 2 working together (UTF-16 on Windows).
Next, there is no definite Unicode encoding. Some Unicode encodings use a fixed width for representing codepoints (like UTF-32), others (such as UTF-8 and UTF-16) have variable lengths (the letter 'a' for instance surely will just use up 1 byte, but apart from the English alphabet, other characters surely will use up more bytes for representation).
So you have to decide what kind of characters you want to represent and then choose your encoding accordingly. Depending on the kind of characters you want to represent, this will affect the amount of bytes your data will take. E.g. using UTF-32 to represent mostly English characters will lead to many 0-bytes. UTF-8 is a better choice for many Latin based languages, while UTF-16 is usually a better choice for Eastern Asian languages.
Once you have decided on this, you should minimize the amount of conversions and stay consistent with your decision.
In the next step, you may decide what data type is appropriate to represent the data (or what kind of conversions you may need).
If you would like to do text-manipulation/interpretation on a code-point basis, char
certainly is not the way to go if you have e.g. Japanese kanji. But if you just want to communicate your data and regard it no more as a quantitative sequence of bytes, you may just go with char
.
The link to UTF-8 everywhere was already posted as a comment, and I suggest you having a look there as well. Another good read is What every programmer should know about encodings.
As by now, there is only rudimentary language support in C++ for Unicode (like the char16_t
and char32_t
data types, and u8
/u
/U
literal prefixes). So chosing a library for manging encodings (especially conversions) certainly is a good advice.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…