Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
367 views
in Technique[技术] by (71.8m points)

python - How do I fix wrongly nested / unclosed HTML tags?

I need to sanitize HTML submitted by the user by closing any open tags with correct nesting order. I have been looking for an algorithm or Python code to do this but haven't found anything except some half-baked implementations in PHP, etc.

For example, something like

<p>
  <ul>
    <li>Foo

becomes

<p>
  <ul>
    <li>Foo</li>
  </ul>
</p>

Any help would be appreciated :)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

using BeautifulSoup:

from BeautifulSoup import BeautifulSoup
html = "<p><ul><li>Foo"
soup = BeautifulSoup(html)
print soup.prettify()

gets you

<p>
 <ul>
  <li>
   Foo
  </li>
 </ul>
</p>

As far as I know, you can't control putting the <li></li> tags on separate lines from Foo.

using Tidy:

import tidy
html = "<p><ul><li>Foo"
print tidy.parseString(html, show_body_only=True)

gets you

<ul>
<li>Foo</li>
</ul>

Unfortunately, I know of no way to keep the <p> tag in the example. Tidy interprets it as an empty paragraph rather than an unclosed one, so doing

print tidy.parseString(html, show_body_only=True, drop_empty_paras=False)

comes out as

<p></p>
<ul>
<li>Foo</li>
</ul>

Ultimately, of course, the <p> tag in your example is redundant, so you might be fine with losing it.

Finally, Tidy can also do indenting:

print tidy.parseString(html, show_body_only=True, indent=True)

becomes

<ul>
  <li>Foo
  </li>
</ul>

All of these have their ups and downs, but hopefully one of them is close enough.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...