Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
288 views
in Technique[技术] by (71.8m points)

$ Windows newline symbol in Python bytes regex

$ matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character.

However, the Windows newline flag contains two characters ' ', how to make '$' recognize ' ' as a newline character in bytes?

Here is what I have:

# Python 3.4.2
import re

input = b'''
//today is a good day 

//this is Windows newline style 

//unix line style 

...other binary data... 
'''

L = re.findall(rb'//.*?$', input, flags = re.DOTALL | re.MULTILINE)
for item in L : print(item)

now the output is:

b'//today is a good day 
'
b'//this is Windows newline style 
'
b'//unix line style '

but the expected output is as follows:

the expected output:
b'//today is a good day '
b'//this is Windows newline style '
b'//unix line style '
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It is not possible to redefine anchor behavior.

To match a // with any number of characters other than CR and LF after it, use a negated character class [^ ] with * quantifier:

L = re.findall(rb'//[^
]*', input)

Note that this approach does not require using re.M and re.S flags.

Or, you can add ? before a $ and enclose this part in a positive look-ahead (also, you will beed a *? lazy quantifier with .):

rb'//.*?(?=
?$)'

The point in using a lookahead is that $ itself is a kind of a lookahead since it does not really consume the character. Thus, we can safely put it into a look-ahead with optional .

Maybe this is not that pertinent since it is from MSDN, but I think it is the same for Python:

Note that $ matches but does not match (the combination of carriage return and newline characters, or CR/LF). To match the CR/LF character combination, include ?$ in the regular expression pattern.

In PCRE, you can use (*ANYCRLF), (*CR) and (*ANY) to override the default behavior of the $ anchor, but not in Python.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...