Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
170 views
in Technique[技术] by (71.8m points)

c# - How would you get an array of Unicode code points from a .NET String?

I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char's in a string, I don't get the 32-bit Unicode code points and some comparisons with high values fail.

I understand Unicode well enough that I could parse the bytes myself if necessary, but I'm looking for a C#/.NET Framework BCL solution. So ...

How would you convert a string to an array (int[]) of 32-bit Unicode code points?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You are asking about code points. In UTF-16 (C#'s char) there are only two possibilities:

  1. The character is from the Basic Multilingual Plane, and is encoded by a single code unit.
  2. The character is outside the BMP, and encoded using a surrogare high-low pair of code units

Therefore, assuming the string is valid, this returns an array of code points for a given string:

public static int[] ToCodePoints(string str)
{
    if (str == null)
        throw new ArgumentNullException("str");

    var codePoints = new List<int>(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        codePoints.Add(Char.ConvertToUtf32(str, i));
        if (Char.IsHighSurrogate(str[i]))
            i += 1;
    }

    return codePoints.ToArray();
}

An example with a surrogate pair ?? and a composed character ?:

ToCodePoints("U0001F300 El Niu006Eu0303o");                        // ?? El Ni?o
// { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // ??   E l   N i n ?? o

Here's another example. These two code points represents a 32th musical note with a staccato accent, both surrogate pairs:

ToCodePoints("U0001D162U0001D181");              // ????
// { 0x1d162, 0x1d181 }                            // ?? ???

When C-normalized, they are decomposed into a notehead, combining stem, combining flag and combining accent-staccato, all surrogate pairs:

ToCodePoints("U0001D162U0001D181".Normalize());  // ????????
// { 0x1d158, 0x1d165, 0x1d170, 0x1d181 }          // ?? ?? ?? ???

Note that leppie's solution is not correct. The question is about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the example above, the ? in the string is represented by a Latin lowercase n followed by a combining tilde ??. Leppie's solution discards any combining characters that cannot be normalized into a single code point.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...