Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
323 views
in Technique[技术] by (71.8m points)

c# - Why does the default string comparer fail to maintain transitive consistency?

I know this issue has been noted before, more or less concisely, but I still create this new thread because I ran into the issue again when writing a unit test.

The default string comparison (that is the culture-dependent case-sensitive comparison that we get with string.CompareTo(string), Comparer<string>.Default, StringComparer.CurrentCulture, string.Compare(string, string) and others) violates transitivity when the strings contain hyphens (or minus signs, I am talking about plain U+002D characters).

Here is a simple repro:

static void Main()
{
  const string a = "fk-";
  const string b = "-fk";
  const string c = "Fk";

  Console.WriteLine(a.CompareTo(b));  // "-1"
  Console.WriteLine(b.CompareTo(c));  // "-1"
  Console.WriteLine(a.CompareTo(c));  // "1"

  var listX = new List<string> { a, b, c, };
  var listY = new List<string> { c, a, b, };
  var listZ = new List<string> { b, c, a, };
  listX.Sort();
  listY.Sort();
  listZ.Sort();
  Console.WriteLine(listX.SequenceEqual(listY));  // "False"
  Console.WriteLine(listY.SequenceEqual(listZ));  // "False"
  Console.WriteLine(listX.SequenceEqual(listZ));  // "False"
}

In the upper part we see how transitivity fails. a is less than b, and b is less than c, yet a fails to be less than c.

This goes against the documented behavior of Unicode collation which states that:

... for any strings A, B, and C, if A < B and B < C, then A < C.

Now sorting a list with a, b and c is exactly like trying to rank the hands of "Rock", "Paper" and "Scissors" in the well-known intransitive game. An impossible task.

The last part of my code sample above shows that the result of sorting depends on the initial order of the elements (and there are no two elements in the list which compare "equal" (0)).

Linq's listX.OrderBy(x => x) is also affected, of course. This should be a stable sort, but you get strange results when ordering a collection containing a, b and c together with other strings.

I tried this with all the CultureInfos on my machine (since this is a culture-dependent sort), including the "invariant culture", and each and every one has the same problem. I tried this with the .NET 4.5.1 runtime, but I believe older versions have the same bug.

Conclusion: When sorting strings in .NET with the default comparer, results are unpredictable if some strings contain hyphens.

What changes were introduced in .NET 4.0 that caused this behavior?

It has already been observed that this behavior is inconsistent across different versions of the platform: in .NET 3.5, strings with hyphens can be reliably sorted. In all versions of the framework, calling System.Globalization.CultureInfo.CurrentCulture.CompareInfo.GetSortKey provides unique DeyData for these strings, so why aren't they sorted correctly?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Microsoft Connect Discussion Here is some code to workaround:

static int CompareStringUsingSortKey(string s1, string s2)
{
    SortKey sk1 = CultureInfo.InvariantCulture.CompareInfo.GetSortKey(s1);
    SortKey sk2 = CultureInfo.InvariantCulture.CompareInfo.GetSortKey(s2);
    return SortKey.Compare(sk1, sk2);
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...