linux - how to detect invalid utf8 unicode/binary in a text file

Question

Welcome To Ask or Share your Answers For Others

linux - how to detect invalid utf8 unicode/binary in a text file

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

linux - how to detect invalid utf8 unicode/binary in a text file

I need to detect corrupted text file where there are invalid (non-ASCII) utf-8, Unicode or binary characters.

???>t??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????w?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????o?????????????????????????????_????????????????????????????????????????????????????????????o???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????~??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????}???????????????????????????}w??????×???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????~????????????????????????????????????????????????????????????????????????????????????????????????????????????_??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????^?????????????????s???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????w??????????????????????????????????????????????????????????????????}????????????????????????????????????????????y????????????????????????????????????????????????????????????????????????????????????????????????????o???????????????????????????????????????????????????????????????????????????}??????

what I have tried:

iconv -f utf-8 -t utf-8 -c file.csv

this converts a file from utf-8 encoding to utf-8 encoding and -c is for skipping invalid utf-8 characters. However at the end those illegal characters still got printed. Are there any other solutions in bash on linux or other languages?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:28:14+0000

Assuming you have your locale set to UTF-8 (see locale output), this works well to recognize invalid UTF-8 sequences:

grep -axv '.*' file.txt

Explanation (from grep man page):

-a, --text: treats file as text, essential prevents grep to abort once finding an invalid byte sequence (not being utf8)
-v, --invert-match: inverts the output showing lines not matched
-x '.*' (--line-regexp): means to match a complete line consisting of any utf8 character.

Hence, there will be output, which is the lines containing the invalid not utf8 byte sequence containing lines (since inverted -v)

Categories

linux - how to detect invalid utf8 unicode/binary in a text file

linux - how to detect invalid utf8 unicode/binary in a text file

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags