Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
685 views
in Technique[技术] by (71.8m points)

regex - Stripping hex bytes with sed - no match

I have a text file with two non-ascii bytes (0xFF and 0xFE):

??58832520.3,ABC
348384,DEF

The hex for this file is:

FF FE 35 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 33 34 38 33 38 34 2C 44 45 46

It's coincidental that FF and FE happen to be the leading bytes (they exist throughout my file, although seemingly always at the beginning of a line).

I am trying to strip these bytes out with sed, but nothing I do seems to match them.

$ sed 's/[^a-zA-Z0-9,]//g' test.csv 
??588325203,ABC
348384,DEF

$ sed 's/[a-zA-Z0-9,]//g' test.csv 
??.

Main question: How do I strip these bytes?
Bonus question: The two regex's above are direct negations, so one of them logically has to filter out these bytes, right? Why do both of these regex's match the 0xFF and 0xFE bytes?

Update: the direct approach of stripping out a range of hex byte (suggested by two answers below) seems to strip out the first "legit" byte from each line and leave the bytes I'm trying to get rid of:

$sed 's/[x80-xff]//' test.csv
??8832520.3,ABC
48384,DEF

FF FE 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 34 38 33 38 34 2C 44 45 46 0A

Notice the missing "5" and "3" from the beginning of each line, and the new 0A added to the end of the file.

Bigger Update: This problem seems to be system-specific. The problem was observed on OSX, but the suggestions (including my original sed statement above) work as I expect them to on NetBSD.

A solution: This same task seems easy enough via Perl:

$ perl -pe 's/^xFFxFE//' test.csv
58832520.3,ABC
348384,DEF

However, I'll leave this question open since this is only a workaround, and doesn't explain what the problem was with sed.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
sed 's/[^ -~]//g'

or as the other answer implies

sed 's/[x80-xff]//g'

See section 3.9 of the sed info pages. The chapter entitled escapes.

Edit for OSX, the native lang setting is en_US.UTF-8

try

LANG='' sed 's/[^ -~]//g' myfile

This works on an osx machine here, I'm not entirely sure why it does not work when in UTF-8


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...