Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
526 views
in Technique[技术] by (71.8m points)

utf 8 - How do locales work in Linux / POSIX and what transformations are applied?

I'm working with huge files of (I hope) UTF-8 text. I can reproduce it using Ubuntu 13.10 (3.11.0-14-generic) and 12.04.

While investigating a bug I've encountered strange behavoir

$ export LC_ALL=en_US.UTF-8   
$ sort part-r-00000 | uniq -d 
? ? ? ? 251
? ɡ ? ? ?       291
? ? ? ? 301
? ?     475
? ?     565

$ export LC_ALL=C
$ sort part-r-00000 | uniq -d 
$ # no duplicates found

The duplicates also appear when running a custom C++ program that reads the file using std::stringstream - it fails due to duplicates when using en_US.UTF-8 locale. C++ seems to be unaffected at least for std::string and input/output.

Why are duplicates found when using a UTF-8 locale and no duplicates are found with the C locale?

What transformations does the locale to the text that causes this behavoir?

Edit: Here is a small example

$ uniq -D duplicates.small.nfc 
? ? ? ? ?       224
? ? ? ? ?       224
? ? ? ? 251
? ? ? ? 251
? ɡ ? ? ?       291
? ? ? ? ?       291
? ? ? ? 301
? ? ? ? 301
? ? ? ? 301
? ?     475
? ?     475
? ?     565
? ?     565

Output of locale when the problem appears:

$ locale 
LANG=en_US.UTF-8                                                                                                                                                                                               
LC_CTYPE="en_US.UTF-8"                                                                                                                                                                                         
LC_NUMERIC=de_DE.UTF-8                                                                                                                                                                                         
LC_TIME=de_DE.UTF-8                                                                                                                                                                                            
LC_COLLATE="en_US.UTF-8"                                                                                                                                                                                       
LC_MONETARY=de_DE.UTF-8                                                                                                                                                                                        
LC_MESSAGES="en_US.UTF-8"                                                                                                                                                                                      
LC_PAPER=de_DE.UTF-8                                                                                                                                                                                           
LC_NAME=de_DE.UTF-8                                                                                                                                                                                            
LC_ADDRESS=de_DE.UTF-8                                                                                                                                                                                         
LC_TELEPHONE=de_DE.UTF-8                                                                                                                                                                                       
LC_MEASUREMENT=de_DE.UTF-8                                                                                                                                                                                     
LC_IDENTIFICATION=de_DE.UTF-8                                                                                                                                                                                  
LC_ALL=                   

Edit: After normalisation using:

cat duplicates | uconv -f utf8 -t utf8 -x nfc > duplicates.nfc

I still get the same results

Edit: The file is valid UTF-8 according to iconv - (from here)

$ iconv -f UTF-8 duplicates -o /dev/null
$ echo $?
0

Edit: Looks like it something similiar to this: http://xahlee.info/comp/unix_uniq_unicode_bug.html and https://lists.gnu.org/archive/html/bug-coreutils/2012-07/msg00072.html

It's working on FreeBSD

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I have boiled down the problem to an issue with the strcoll() function, which is not related to Unicode normalization. Recap: My minimal example that demonstrates the different behaviour of uniq depending on the current locale was:

$ echo -e "xc9xa2
xc9xac" > test.txt
$ cat test.txt
?
?
$ LC_COLLATE=C uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 uniq -D test.txt
?
?

Obviously, if the locale is en_US.UTF-8 uniq treats ? and ? as duplicates, which shouldn't be the case. I then ran the same commands again with valgrind and investigated both call graphs with kcachegrind.

$ LC_COLLATE=C valgrind --tool=callgrind uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 valgrind --tool=callgrind uniq -D test.txt
$ kcachegrind callgrind.out.5754 &
$ kcachegrind callgrind.out.5763 &

The only difference was, that the version with LC_COLLATE=en_US.UTF-8 called strcoll() whereas LC_COLLATE=C did not. So I came up with the following minimal example on strcoll():

#include <iostream>
#include <cstring>
#include <clocale>

int main()
{
    const char* s1 = "xc9xa2";
    const char* s2 = "xc9xac";
    std::cout << s1 << std::endl;
    std::cout << s2 << std::endl;

    std::setlocale(LC_COLLATE, "en_US.UTF-8");
    std::cout << std::strcoll(s1, s2) << std::endl;
    std::cout << std::strcmp(s1, s2) << std::endl;

    std::setlocale(LC_COLLATE, "C");
    std::cout << std::strcoll(s1, s2) << std::endl;
    std::cout << std::strcmp(s1, s2) << std::endl;

    std::cout << std::endl;

    s1 = "xa2";
    s2 = "xac";
    std::cout << s1 << std::endl;
    std::cout << s2 << std::endl;

    std::setlocale(LC_COLLATE, "en_US.UTF-8");
    std::cout << std::strcoll(s1, s2) << std::endl;
    std::cout << std::strcmp(s1, s2) << std::endl;

    std::setlocale(LC_COLLATE, "C");
    std::cout << std::strcoll(s1, s2) << std::endl;
    std::cout << std::strcmp(s1, s2) << std::endl;
}

Output:

?
?
0
-1
-10
-1

?
?
0
-1
-10
-1

So, what's wrong here? Why does strcoll() returns 0 (equal) for two different characters?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...