Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
560 views
in Technique[技术] by (71.8m points)

linux - how to detect invalid utf8 unicode/binary in a text file

I need to detect corrupted text file where there are invalid (non-ASCII) utf-8, Unicode or binary characters.

???>t??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????w?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????o?????????????????????????????_????????????????????????????????????????????????????????????o???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????~??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????}???????????????????????????}w??????×???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????~????????????????????????????????????????????????????????????????????????????????????????????????????????????_??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????^?????????????????s???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????w??????????????????????????????????????????????????????????????????}????????????????????????????????????????????y????????????????????????????????????????????????????????????????????????????????????????????????????o???????????????????????????????????????????????????????????????????????????}??????

what I have tried:

iconv -f utf-8 -t utf-8 -c file.csv 

this converts a file from utf-8 encoding to utf-8 encoding and -c is for skipping invalid utf-8 characters. However at the end those illegal characters still got printed. Are there any other solutions in bash on linux or other languages?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Assuming you have your locale set to UTF-8 (see locale output), this works well to recognize invalid UTF-8 sequences:

grep -axv '.*' file.txt

Explanation (from grep man page):

  • -a, --text: treats file as text, essential prevents grep to abort once finding an invalid byte sequence (not being utf8)
  • -v, --invert-match: inverts the output showing lines not matched
  • -x '.*' (--line-regexp): means to match a complete line consisting of any utf8 character.

Hence, there will be output, which is the lines containing the invalid not utf8 byte sequence containing lines (since inverted -v)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...