Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
216 views
in Technique[技术] by (71.8m points)

c# - strange string.IndexOf behavour

I wrote the following snippet to get rid of excessive spaces in slabs of text

int index = text.IndexOf("  ");
while (index > 0)
{
    text = text.Replace("  ", " ");
    index = text.IndexOf("  ");
}

Generally this works fine, albeit rather primative and possibly inefficient.

Problem

When the text contains " - " for some bizzare reason the indexOf returns a match! The Replace function doesn't remove anything and then it is stuck in a endless loop.

Any ideas what is going on with the string.IndexOf?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Ah, the joys of text.

What you most likely have there, but got lost when posting on SO, is a "soft hyphen".

To reproduce the problem, I tried this code in LINQPad:

void Main()
{
    var text = "Test1 u00ad Test2";
    int index = text.IndexOf("  ");
    while (index > 0)
    {
        text = text.Replace("  ", " ");
        index = text.IndexOf("  ");
    }
}

And sure enough, the above code just gets stuck in a loop.

Note that u00ad is the Unicode symbol for Soft Hyphen, according to CharMap. You can always copy and paste the character from CharMap as well, but posting it here on SO will replace it with its much more common cousin, the Hyphen-Minus, Unicode symbol u002d (the one on your keyboard.)

You can read a small section in the documentation for the String Class which has this to say on the subject:

String search methods, such as String.StartsWith and String.IndexOf, also can perform culture-sensitive or ordinal string comparisons. The following example illustrates the differences between ordinal and culture-sensitive comparisons using the IndexOf method. A culture-sensitive search in which the current culture is English (United States) considers the substring "oe" to match the ligature "?". Because a soft hyphen (U+00AD) is a zero-width character, the search treats the soft hyphen as equivalent to Empty and finds a match at the beginning of the string. An ordinal search, on the other hand, does not find a match in either case.

I've highlighted the relevant part, but I also remember a blog post about this exact problem a while back but my Google-Fu is failing me tonight.

The problem here is that IndexOf and Replace use different methods for locating the text.

Whereas IndexOf will consider the soft hyphen as "not really there", and thus discover the two spaces on each side of it as "two joined spaces", the Replace method won't, and thus won't remove either of them. Therefore the criteria is present for the loop to continue iterating, but since Replace doesn't remove the spaces that fit the criteria, it will never end. Undoubtedly there are other such characters in the Unicode symbol space that exhibit similar problems, but this is the most typical case I've seen.

There's at least two ways of handling this:

  1. You can use Regex.Replace, which seems to not have this problem:

    text = Regex.Replace(text, "  +", " ");
    

    Personally I would probably use the whitespace special character in the Regular Expression, which is s, but if you only want spaces, the above should do the trick.

  2. You can explicitly ask IndexOf to use an ordinal comparison, which won't get tripped up by text behaving like ... well ... text:

    index = text.IndexOf("  ", StringComparison.Ordinal);
    

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...