c# - Parsing through Arabic / RTL text from left to right

Question

Welcome To Ask or Share your Answers For Others

c# - Parsing through Arabic / RTL text from left to right

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

c# - Parsing through Arabic / RTL text from left to right

Let's say I have a string in an RTL language such as Arabic with some English chucked in:

string s = "Test:?????;?????;a;b"

Notice there are semicolons in the string. When I use the Split command like string[] spl = s.Split(';');, then some of the strings are saved in reverse order. This is what happens:

?????spl[0] = "?Test:?????"
spl[1] = "?"?????
spl[2] = ?"a"
spl[3] = ?"b"

The above is out of order compared to the original. Instead, I expect to get this:

??spl[0] = ?"Test:?????"
spl[1] = "??????"
spl[2] = ?"a"
spl[3] = ?"b"

I'm prepared to write my own split function. However, the chars in the string also parse in reverse order, so I'm back to square one. I just want to go through each character as it's shown on the screen.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:04:20+0000

As your string currently stands, the word ????? is stored prior to the word ?????; the fact that ????? is displayed "first" (that is, further to the left), is just a (correct) result of the Unicode Bidirectional Algorithm in displaying the text.

That is: the string you start with ("Test:?????;?????;a;b") is the result of the user entering "Test:", then ?????, then ";", then ?????, and then ";a;b". Thus, the way C# is splitting it does in fact mirror the way that the string is created. It's just that the way it is created is not reflected in the display of the string, because the two consecutive Arabic words are treated as a single unit when they are displayed.

If you'd like a string to display Arabic words in left-to-right order with semicolons in between, while also storing the words in that same order, then you should put a Left-to-Right mark (U+200E) after the semicolon. This will effectively section off each Arabic word as its own unit, and the Bidirectional Algorithm will then treat each word separately.

For instance, the following code begins with a string identical to the one you use (with the addition of a single Left-to-Right mark), yet it will split it up according to the way that you are expecting it to (that is, spl[0] = ?"Test:?????", and spl[1] = "??????"):

static void Main(string[] args) {
    string s = "Test:?????;u200E?????;a;b";
    string[] spl = s.Split(';');
}

Categories

c# - Parsing through Arabic / RTL text from left to right

c# - Parsing through Arabic / RTL text from left to right

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags