Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
657 views
in Technique[技术] by (71.8m points)

python - Can CDATA sections be preserved by BeautifulSoup?

I'm using BeautifulSoup to read, modify, and write an XML file. I'm having trouble with CDATA sections being stripped out. Here's a simplified example.

The culprit XML file:

<?xml version="1.0" ?>
<foo>
    <bar><![CDATA[
        !@#$%^&*()_+{}|:"<>?,./;'[]-=
    ]]></bar>
</foo>

And here's the Python script.

from bs4 import BeautifulSoup

xmlfile = open("cdata.xml", "r") 
soup = BeautifulSoup( xmlfile, "xml" )
print(soup)

Here's the output. Note the CDATA section tags are missing.

<?xml version="1.0" encoding="utf-8"?>
<foo>
<bar>
        !@#$%^&amp;*()_+{}|:"&lt;&gt;?,./;'[]-=
    </bar>
</foo>

I also tried printing soup.prettify(formatter="xml") and got the same result with slightly different whitespace. There isn't much in the docs about reading in CDATA sections, so maybe this is an lxml thing?

Is there a way to tell BeautifulSoup to preserve CDATA sections?

Update Yes, it's an lxml thing. http://lxml.de/api.html#cdata So, the question becomes, is it possible to tell BeautifulSoup to initialize lxml with strip_cdata=False?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In my case if I use

soup = BeautifulSoup( xmlfile, "lxml-xml" )

then cdata is preserved and accesible.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...