unicode - (grep) Regex to match non-ASCII characters?

Question

Welcome To Ask or Share your Answers For Others

unicode - (grep) Regex to match non-ASCII characters?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

unicode - (grep) Regex to match non-ASCII characters?

On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.

However, is there a regular expression for 'any character that's not an ASCII character'?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T22:20:52+0000

This will match a single non-ASCII character:

[^x00-x7F]

This is a valid PCRE (Perl-Compatible Regular Expression).

You can also use the POSIX shorthands:

[[:ascii:]] - matches a single ASCII char
[^[:ascii:]] - matches a single non-ASCII char

[^[:print:]] will probably suffice for you.**

Categories

unicode - (grep) Regex to match non-ASCII characters?

unicode - (grep) Regex to match non-ASCII characters?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags