Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
399 views
in Technique[技术] by (71.8m points)

c# - Why does string.Compare seem to handle accented characters inconsistently?

If I execute the following statement:

string.Compare("mun", "mün", true, CultureInfo.InvariantCulture)

The result is '-1', indicating that 'mun' has a lower numeric value than 'mün'.

However, if I execute this statement:

string.Compare("Muntelier, Schweiz", "München, Deutschland", true, CultureInfo.InvariantCulture)

I get '1', indicating that 'Muntelier, Schewiz' should go last.

Is this a bug in the comparison? Or, more likely, is there a rule I should be taking into account when sorting strings containing accented


The reason this is an issue is, I'm sorting a list and then doing a manual binary filter that's meant to get every string beginning with 'xxx'.

Previously I was using the Linq 'Where' method, but now I have to use this custom function written by another person, because he says it performs better.

But the custom function doesn't seem to take into account whatever 'unicode' rules .NET has. So if I tell it to filter by 'mün', it doesn't find any items, even though there are items in the list beginning with 'mun'.

This seems to be because of the inconsistent ordering of accented characters, depending on what characters go after the accented character.


OK, I think I've fixed the problem.

Before the filter, I do a sort based on the first n letters of each string, where n is the length of the search string.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

There is a tie-breaking algorithm at work, see http://unicode.org/reports/tr10/

To address the complexities of language-sensitive sorting, a multilevel comparison algorithm is employed. In comparing two words, for example, the most important feature is the base character: such as the difference between an A and a B. Accent differences are typically ignored, if there are any differences in the base letters. Case differences (uppercase versus lowercase), are typically ignored, if there are any differences in the base or accents. Punctuation is variable. In some situations a punctuation character is treated like a base character. In other situations, it should be ignored if there are any base, accent, or case differences. There may also be a final, tie-breaking level, whereby if there are no other differences at all in the string, the (normalized) code point order is used.

So, "Munt..." and "Münc..." are alphabetically different and sort based on the "t" and "c".

Whereas, "mun" and "mün" are alphabetically the same ("u" equivelent to "ü" in lost languages) so the character codes are compared


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...