diacritics - How to remove accents and keep Chinese characters using a command?

Question

Welcome To Ask or Share your Answers For Others

diacritics - How to remove accents and keep Chinese characters using a command?

1 Reply

深蓝 · Answer 1 · 2021-02-16T21:01:16+0000

There is no way to keep Chinese characters in a file whose encoding is ASCII; this encoding only encodes the code points between NUL (0x00) and 0x7F (DEL) which basically means the basic control characters plus basic English alphabetics and punctuation. (Look at the ASCII chart for an enumeration.)

What you appear to be asking is how to remove accents from European alphabetics while keeping any Chinese characters intact in a file whose encoding is UTF-8. I believe there is no straightforward way to do this with iconv, but it should be comfortably easy to come up with a one-liner in a language with decent Unicode support, like perhaps Perl.

bash$ python -c 'print("u4effCafu00e9u9f00")' >unizh.txt
bash$ cat unizh.txt
仿Café鼀
bash$ perl -CSD -MUnicode::Normalize -pe '$_ = NFKD($_); s/p{M}//g' unizh.txt 
仿Cafe鼀

Maybe add the -i option to modify the file in-place; this simple demo just writes out the result to standard output.

This has the potentially undesired side effect of normalizing each character to its NFKD form.

Code inspired by Remove accents from accented characters and Chinese characters to test with gleaned from What's the complete range for Chinese characters in Unicode? (the ones on the boundary of the range are not particularly good test cases so I just guessed a bit).

Categories

diacritics - How to remove accents and keep Chinese characters using a command?

diacritics - How to remove accents and keep Chinese characters using a command?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags