Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
616 views
in Technique[技术] by (71.8m points)

regex - Is ">" (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

In other words may one use /<tag[^>]*>.*?</tag>/ regex to match the tag html element which does not contain nested tag elements?

For example (lt.html):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
  <head>
    <title>greater than sign in attribute value</title>
  </head>
  <body>
    <div>1</div>
    <div title=">">2</div>
  </body>
</html>

Regex:

$ perl -nE"say $1 if m~<div[^>]*>(.*?)</div>~" lt.html

And screen-scraper:

#!/usr/bin/env python
import sys
import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup(sys.stdin)
for div in soup.findAll('div'):
    print div.string


$ python lt.py <lt.html

Both give the same output:

1
">2

Expected output:

1
2

w3c says:

Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Yes, it is allowed (W3C Validator accepts it, only issues a warning).

Unescaped < and > are also allowed inside comments, so such simple regexp can be fooled.

If BeautifulSoup doesn't handle this, it could be a bug or perhaps a conscious design decision to make it more resilient to missing closing quotes in attributes.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...