When trying to extract the title of a html-page I have always used the following regex:
(?<=<title.*>)([sS]*)(?=</title>)
Which will extract everything between the tags in a document and ignore the tags themselves. However, when trying to use this regex in Python it raises the following Exception:
Traceback (most recent call last):
File "test.py", line 21, in <module>
pattern = re.compile('(?<=<title.*>)([sS]*)(?=</title>)')
File "C:Python31lib
e.py", line 205, in compile
return _compile(pattern, flags)
File "C:Python31lib
e.py", line 273, in _compile
p = sre_compile.compile(pattern, flags) File
"C:Python31libsre_compile.py", line 495, in compile
code = _code(p, flags) File "C:Python31libsre_compile.py", line 480, in _code
_compile(code, p.data, flags) File "C:Python31libsre_compile.py", line 115, in _compile
raise error("look-behind requires fixed-width pattern")
sre_constants.error: look-behind requires fixed-width pattern
The code I am using is:
pattern = re.compile('(?<=<title.*>)([sS]*)(?=</title>)')
m = pattern.search(f)
if I do some minimal adjustments it works:
pattern = re.compile('(?<=<title>)([sS]*)(?=</title>)')
m = pattern.search(f)
This will, however, not take into account potential html titles that for some reason have attributes or similar.
Anyone know a good workaround for this issue? Any tips are appreciated.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…