Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
249 views
in Technique[技术] by (71.8m points)

python - Regular expression parsing a binary file?

I have a file which mixes binary data and text data. I want to parse it through a regular expression, but I get this error:

TypeError: can't use a string pattern on a bytes-like object

I'm guessing that message means that Python doesn't want to parse binary files. I'm opening the file with the "rb" flags.

How can I parse binary files with regular expressions in Python?

EDIT: I'm using Python 3.2.0

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I think you use Python 3 .

1.Opening a file in binary mode is simple but subtle. The only difference from opening it in text mode is that the mode parameter contains a 'b' character.

........

4.Here’s one difference, though: a binary stream object has no encoding attribute. That makes sense, right? You’re reading (or writing) bytes, not strings, so there’s no conversion for Python to do.

http://www.diveintopython3.net/files.html#read

Then, in Python 3, since a binary stream from a file is a stream of bytes, a regex to analyse a stream from a file must be defined with a sequence of bytes, not a sequence of characters.

In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths).

http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html

and

In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that.

http://www.diveintopython3.net/strings.html#boring-stuff

and

4.6. Strings vs. Bytes# Bytes are bytes; characters are an abstraction. An immutable sequence of Unicode characters is called a string. An immutable sequence of numbers-between-0-and-255 is called a bytes object.

....

1.To define a bytes object, use the b' ' “byte literal” syntax. Each byte within the byte literal can be an ASCII character or an encoded hexadecimal number from x00 to xff (0–255).

http://www.diveintopython3.net/strings.html#boring-stuff

So you will define your regex as follows

pat = re.compile(b'[a-f]+d+')

and not as

pat = re.compile('[a-f]+d+')

More explanations here:

15.6.4. Can’t use a string pattern on a bytes-like object


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...