Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
552 views
in Technique[技术] by (71.8m points)

python - Parse several XML declarations in a single file by means of lxml.etree.iterparse

I need to parse a file that contains various XML files, i.e., <xml></xml> <xml></xml> .. and so forth. While using etree.iterparse, I get the following (correct) error:

lxml.etree.XMLSyntaxError: XML declaration allowed only at the start of the document

Now, I can preprocess the input file and produce for each contained XML file a separate file. This might be the easiest solution. But I wonder if a proper solution for this 'problem' exists.

Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The sample data you've provided suggests one problem, while the question and the exception you've provided suggests another. Do you have multiple XML documents concatenated together, each with its own XML declaration, or do you have an XML fragment with multiple top-level elements?

If it's the former, then the solution's going to involve breaking the input stream up into multiple streams, and parsing each one individually. This doesn't necessarily mean, as one comment suggests, implementing an XML parser. You can search a string for XML declarations without having to parse anything else in it, so long as your input doesn't include CDATA sections that contain unescaped XML declarations. You can write a file-like object that returns characters from the underlying stream until it hits an XML declaration, and then wrap it in a generator function that keeps returning streams until EOF is reached. It's not trivial, but it's not hugely difficult either.

If you have an XML fragment with multiple top-level elements, you can just wrap them an XML element and parse the whole thing.

Of course, as with most problems involving bad XML input, the easiest solution may just be to fix the thing that's producing the bad input.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...