c# - How would you get an array of Unicode code points from a .NET String?

Question

Welcome To Ask or Share your Answers For Others

c# - How would you get an array of Unicode code points from a .NET String?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c# - How would you get an array of Unicode code points from a .NET String?

I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char's in a string, I don't get the 32-bit Unicode code points and some comparisons with high values fail.

I understand Unicode well enough that I could parse the bytes myself if necessary, but I'm looking for a C#/.NET Framework BCL solution. So ...

How would you convert a string to an array (int[]) of 32-bit Unicode code points?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:48:37+0000

You are asking about code points. In UTF-16 (C#'s char) there are only two possibilities:

The character is from the Basic Multilingual Plane, and is encoded by a single code unit.
The character is outside the BMP, and encoded using a surrogare high-low pair of code units

Therefore, assuming the string is valid, this returns an array of code points for a given string:

public static int[] ToCodePoints(string str)
{
    if (str == null)
        throw new ArgumentNullException("str");

    var codePoints = new List<int>(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        codePoints.Add(Char.ConvertToUtf32(str, i));
        if (Char.IsHighSurrogate(str[i]))
            i += 1;
    }

    return codePoints.ToArray();
}

An example with a surrogate pair ?? and a composed character ?:

ToCodePoints("U0001F300 El Niu006Eu0303o");                        // ?? El Ni?o
// { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // ??   E l   N i n ?? o

Here's another example. These two code points represents a 32th musical note with a staccato accent, both surrogate pairs:

ToCodePoints("U0001D162U0001D181");              // ????
// { 0x1d162, 0x1d181 }                            // ?? ???

When C-normalized, they are decomposed into a notehead, combining stem, combining flag and combining accent-staccato, all surrogate pairs:

ToCodePoints("U0001D162U0001D181".Normalize());  // ????????
// { 0x1d158, 0x1d165, 0x1d170, 0x1d181 }          // ?? ?? ?? ???

Note that leppie's solution is not correct. The question is about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the example above, the ? in the string is represented by a Latin lowercase n followed by a combining tilde ??. Leppie's solution discards any combining characters that cannot be normalized into a single code point.

Categories

c# - How would you get an array of Unicode code points from a .NET String?

c# - How would you get an array of Unicode code points from a .NET String?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags