I need to split a string and extract words separated by whitespace characters.The source may be in English or Japanese. English whitespace characters include tab and space, and Japanese text uses these too. (IIRC, all widely-used Japanese character sets are supersets of US-ASCII.)
So the set of characters I need to use to split my string includes normal ASCII space and tab.
But, in Japanese, there is another space character, commonly called a 'full-width space'. According to my Mac's Character Viewer utility, this is U+3000 "IDEOGRAPHIC SPACE". This is (usually) what results when a user presses the space bar while typing in Japanese input mode.
Are there any other characters that I need to consider?
I am processing textual data submitted by users who have been told to "separate entries with spaces". However, the users are using a wide variety of computer and mobile phone operating systems to submit these texts. We've already seen that users may not be aware of whether they are in Japanese or English input mode when entering this data.
Furthermore, the behavior of the space key differs across platforms and applications even in Japanese mode (e.g., Windows 7 will insert an ideographic space but iOS will insert an ASCII space).
So what I want is basically "the set of all characters that visually look like a space and might be generated when the user presses the space key, or the tab key since many users do not know the difference between a space and a tab, in Japanese and/or English".
Is there any authoritative answer to such a question?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…