c++ - Unicode conversion issues

Question

Welcome To Ask or Share your Answers For Others

c++ - Unicode conversion issues

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

c++ - Unicode conversion issues

Here is a beginner question on Unicode. I'm using Embarcadero C++ Builder 2009, where they supposedly changed the default strings to use Unicode.

I type various symbols in my source editor, that aren't part of the standard "7-bit ASCII".
My program is using the String type of C++ Builder to fetch user input.
I am also adding input manually by setting a value to a wchar_t.

It would seem that there are conflicts in how the symbols are interpreted. Sometimes I get a symbol with for example the code 0x00C7 ('?'), but sometimes the same symbol is coded as 0xFFC7, for example in the source code editor. To my understanding, the former is proper Unicode, the latter is "something else". Can someone confirm this?

I wonder where this "something else" encoding is coming from, and how to get rid of it?

EDIT: Further research: it seems that one place where the 0xFF** encoding appears is when I do something like this:

string str = ...;
wchar_t wch = (wchar_t)str[i];

Same result no matter if it is std::string or VCL String. Is wchar_t not the same as Unicode?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:26:51+0000

I'm guessing the problem is that in your compiler char is signed (the standard allows it to be either signed or unsigned, it's implementation-defined/specific). As such, whenever you convert chars that have bit 7 set to 1 (0x80 through 0xFF) into any larger integer type, it's treated as a negative value and it gets sign-extended to preserve the negative value, or, in other words, this bit 7 gets copied to bit 8, bit 9 and so on, into all higher bits of the bigger integer type. So, 0xC7 can turn into 0xFFC7 and 0xFFFFFFC7. To prevent that from happening, cast chars to unsigned chars first.

Categories

c++ - Unicode conversion issues

c++ - Unicode conversion issues

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags