regex - Stripping hex bytes with sed - no match

Question

Welcome To Ask or Share your Answers For Others

regex - Stripping hex bytes with sed - no match

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - Stripping hex bytes with sed - no match

I have a text file with two non-ascii bytes (0xFF and 0xFE):

??58832520.3,ABC
348384,DEF

The hex for this file is:

FF FE 35 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 33 34 38 33 38 34 2C 44 45 46

It's coincidental that FF and FE happen to be the leading bytes (they exist throughout my file, although seemingly always at the beginning of a line).

I am trying to strip these bytes out with sed, but nothing I do seems to match them.

$ sed 's/[^a-zA-Z0-9,]//g' test.csv 
??588325203,ABC
348384,DEF

$ sed 's/[a-zA-Z0-9,]//g' test.csv 
??.

Main question: How do I strip these bytes?
Bonus question: The two regex's above are direct negations, so one of them logically has to filter out these bytes, right? Why do both of these regex's match the 0xFF and 0xFE bytes?

Update: the direct approach of stripping out a range of hex byte (suggested by two answers below) seems to strip out the first "legit" byte from each line and leave the bytes I'm trying to get rid of:

$sed 's/[x80-xff]//' test.csv
??8832520.3,ABC
48384,DEF

FF FE 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 34 38 33 38 34 2C 44 45 46 0A

Notice the missing "5" and "3" from the beginning of each line, and the new 0A added to the end of the file.

Bigger Update: This problem seems to be system-specific. The problem was observed on OSX, but the suggestions (including my original sed statement above) work as I expect them to on NetBSD.

A solution: This same task seems easy enough via Perl:

$ perl -pe 's/^xFFxFE//' test.csv
58832520.3,ABC
348384,DEF

However, I'll leave this question open since this is only a workaround, and doesn't explain what the problem was with sed.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:52:31+0000

sed 's/[^ -~]//g'

or as the other answer implies

sed 's/[x80-xff]//g'

See section 3.9 of the sed info pages. The chapter entitled escapes.

Edit for OSX, the native lang setting is en_US.UTF-8

try

LANG='' sed 's/[^ -~]//g' myfile

This works on an osx machine here, I'm not entirely sure why it does not work when in UTF-8

Categories

regex - Stripping hex bytes with sed - no match

regex - Stripping hex bytes with sed - no match

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags