I have boiled down the problem to an issue with the strcoll()
function, which is not related to Unicode normalization. Recap: My minimal example that demonstrates the different behaviour of uniq
depending on the current locale was:
$ echo -e "xc9xa2
xc9xac" > test.txt
$ cat test.txt
?
?
$ LC_COLLATE=C uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 uniq -D test.txt
?
?
Obviously, if the locale is en_US.UTF-8
uniq
treats ?
and ?
as duplicates, which shouldn't be the case. I then ran the same commands again with valgrind
and investigated both call graphs with kcachegrind
.
$ LC_COLLATE=C valgrind --tool=callgrind uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 valgrind --tool=callgrind uniq -D test.txt
$ kcachegrind callgrind.out.5754 &
$ kcachegrind callgrind.out.5763 &
The only difference was, that the version with LC_COLLATE=en_US.UTF-8
called strcoll()
whereas LC_COLLATE=C
did not. So I came up with the following minimal example on strcoll()
:
#include <iostream>
#include <cstring>
#include <clocale>
int main()
{
const char* s1 = "xc9xa2";
const char* s2 = "xc9xac";
std::cout << s1 << std::endl;
std::cout << s2 << std::endl;
std::setlocale(LC_COLLATE, "en_US.UTF-8");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
std::setlocale(LC_COLLATE, "C");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
std::cout << std::endl;
s1 = "xa2";
s2 = "xac";
std::cout << s1 << std::endl;
std::cout << s2 << std::endl;
std::setlocale(LC_COLLATE, "en_US.UTF-8");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
std::setlocale(LC_COLLATE, "C");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
}
Output:
?
?
0
-1
-10
-1
?
?
0
-1
-10
-1
So, what's wrong here? Why does strcoll()
returns 0
(equal) for two different characters?
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…