Check whether this works or not. I found this website that seems to list all the characters in Unicode that might be used in Japanese text.
The corresponding regex (for single character) would be:
/[u3000-u303fu3040-u309fu30a0-u30ffuff00-uff9fu4e00-u9fafu3400-u4dbf]/
-------------_____________-------------_____________-------------_____________
Punctuation Hiragana Katakana Full-width CJK CJK Ext. A
Roman/ (Common & (Rare)
Half-width Uncommon)
Katakana
The ranges are (as quoted from the site):
3000 - 303f
: Japanese-style punctuation
3040 - 309f
: Hiragana
30a0 - 30ff
: Katakana
ff00 - ff9f
: Full-width Roman characters and half-width Katakana
4e00 - 9faf
: CJK unified ideographs - Common and uncommon Kanji
3400 - 4dbf
: CJK unified ideographs Extension A - Rare Kanji
I have changed the ranges a bit:
- I have changed from
ff00 - ffef
to ff00 - ff9f
for Full-width Roman characters and half-width Katakana. The code points from ffa0 - ffdc
contains Hangul half-width characters, which is not what you want. You may want to re-add the code points from ffe0 - ffef
, but they are mostly half-width punctuations or full-width currency symbols.
You can check the site and take off any range you don't want, or are sure that it will not appear in your input.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…