First I did some more test using your code and I can confirm that L"Преступление и наказание"
is a correct UTF16 string. I controlled the code of the individual characters, and they are correctly 0x41f, 0x440, 0x435, 0x441, 0x442, 0x443, 0x43f, 0x43b, 0x435, 0x43d, 0x438, 0x435, 0x20, 0x438, 0x20, 0x43d, 0x430, 0x43a, 0x430, 0x437, 0x430, 0x43d, 0x438, 0x435
I could not find any reference about it, but it looks like simply calling imbue
is not enough. imbue
it a method from basic_ios
which is an ancestor of cout
and wcout
. It does act on numeric conversions, but on all my tests, it has no effect on the charset used for output.
By default, the locale used in a C++ (or C) program is ... the C
locale which knows nothing about unicode. All printable ASCII characters (below 128) are outputted as is, and others are replaced with a ?
. It is exactly what your program does.
To make it work correctly, you have to select a locale that knows about unicode characters with setlocale
. Once this is done, you can change the numeric conversion by calling imbue
, and as you selected a unicode charset all will be fine.
So provided your current locale uses an UTF-8 charset, you only have to add
setlocale(LC_ALL, "");
as first line in your program, and the output will be as expected :
0: "Преступление"
1: "и"
2: "наказание"
I counted 3 words.
and the last word was "наказание"
If your current locale does not use UTF-8, choose one that is installed on you system and that supports it. I used setlocale(LC_ALL, "fr_FR.UTF-8");
, or even setlocale(LC_ALL, "en_US.UTF-8");
and both worked.
Edit :
In fact, the best way to correctly output unicode to screen is to use setlocale(LC_ALL, "");
. It automatically adapts to the current charset. I tested with a stripped down variant using Latin1 charset (my system speaks natively french and not russian ...)
#include <iostream>
#include <locale>
using namespace std;
int main() {
setlocale(LC_ALL, "");
wchar_t ws[] = { 0xe8, 0xe9, 0 };
wcout << ws << endl;
}
I tried it under Linux using UTF-8 charset and ISO-8859-1 (latin1) (resp export LANG=fr_FR.UTF-8
and export LANG=fr_FR.ISO-8859-1
) and I got correctly èé
in the proper charset. I tried it also under Windows XP, with codepage 851 (oem) and 1252 (ansi) (resp. chcp 850
and chcp 1252
with Lucida console charset), and got èé
on the console too.
Edit 2 :
Of course, you can also set a global C++ locale with locale::global(locale("");
with default locale or locale::global(locale("ru_RU.UTF-8");
with russian locale, but it is more than simply calling setlocale
. According to the documentation of Gnu implementation of C++ Standard Library about locale : there is only one relation (of the C++ locale mechanism) to the C locale mechanism: the global C locale is modified if a named C++ locale object is set as the global locale", that is: std::locale::global(std::locale(""));
affects the C functions as if the following call was made: std::setlocale(LC_ALL, "");
. On the other hand, there is no vice versa, that is, calling setlocale has no whatsoever on the C++ locale mechanism, in particular on the working of locale("").
So it really looks like there was an underlying C library mechanizme that should be first enabled with setlocale
to allow imbue
conversion to work correctly.