Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
889 views
in Technique[技术] by (71.8m points)

html - Python regex look-behind requires fixed-width pattern

When trying to extract the title of a html-page I have always used the following regex:

(?<=<title.*>)([sS]*)(?=</title>)

Which will extract everything between the tags in a document and ignore the tags themselves. However, when trying to use this regex in Python it raises the following Exception:

Traceback (most recent call last):  
File "test.py", line 21, in <module>
    pattern = re.compile('(?<=<title.*>)([sS]*)(?=</title>)')
File "C:Python31lib
e.py", line 205, in compile
    return _compile(pattern, flags)   
File "C:Python31lib
e.py", line 273, in _compile
    p = sre_compile.compile(pattern, flags)   File
"C:Python31libsre_compile.py", line 495, in compile
    code = _code(p, flags)   File "C:Python31libsre_compile.py", line 480, in _code
_compile(code, p.data, flags)   File "C:Python31libsre_compile.py", line 115, in _compile
    raise error("look-behind requires fixed-width pattern")
sre_constants.error: look-behind requires fixed-width pattern

The code I am using is:

pattern = re.compile('(?<=<title.*>)([sS]*)(?=</title>)')
m = pattern.search(f)

if I do some minimal adjustments it works:

pattern = re.compile('(?<=<title>)([sS]*)(?=</title>)')
m = pattern.search(f)

This will, however, not take into account potential html titles that for some reason have attributes or similar.

Anyone know a good workaround for this issue? Any tips are appreciated.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Toss out the idea of parsing HTML with regular expressions and use an actual HTML parsing library instead. After a quick search I found this one. It's a much safer way to extract information from an HTML file.

Remember, HTML is not a regular language so regular expressions are fundamentally the wrong tool for extracting information from it.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...