Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
454 views
in Technique[技术] by (71.8m points)

python - How to get BeautifulSoup 4 to respect a self-closing tag?

This question is specific to BeautifulSoup4, which makes it different from the previous questions:

Why is BeautifulSoup modifying my self-closing elements?

selfClosingTags in BeautifulSoup

Since BeautifulStoneSoup is gone (the previous xml parser), how can I get bs4 to respect a new self-closing tag? For example:

import bs4   
S = '''<foo> <bar a="3"/> </foo>'''
soup = bs4.BeautifulSoup(S, selfClosingTags=['bar'])

print soup.prettify()

Does not self-close the bar tag, but gives a hint. What is this tree builder that bs4 is referring to and how to I self-close the tag?

/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:112: UserWarning: BS4 does not respect the selfClosingTags argument to the BeautifulSoup constructor. The tree builder is responsible for understanding self-closing tags.
  "BS4 does not respect the selfClosingTags argument to the "
<html>
 <body>
  <foo>
   <bar a="3">
   </bar>
  </foo>
 </body>
</html>
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor.

soup = bs4.BeautifulSoup(S, 'xml')

You’ll need to have lxml installed.

You don't need to pass selfClosingTags anymore:

In [1]: import bs4
In [2]: S = '''<foo> <bar a="3"/> </foo>'''
In [3]: soup = bs4.BeautifulSoup(S, 'xml')
In [4]: print soup.prettify()
<?xml version="1.0" encoding="utf-8"?>
<foo>
 <bar a="3"/>
</foo>

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...