You are parsing a namespaced document, and there is no 'page'
tag present, because that only applies to tags without a namespace.
You are instead looking for the '{http://www.mediawiki.org/xml/export-0.8/}page'
element, which contains a '{http://www.mediawiki.org/xml/export-0.8/}ns'
element.
Many lxml
methods do let you specify a namespace map to make matching easier, but the iterparse()
method is not one of them, unfortunately.
The following .iterparse()
call certainly processes the right page
tags:
context = etree.iterparse('test.xml', events=('end',), tag='{http://www.mediawiki.org/xml/export-0.8/}page')
but you'll need to use .find()
to get the ns
and title
tags on the page element, or use xpath()
calls to get the text directly:
def process_element(elem):
if elem.xpath("./*[local-name()='ns']/text()=0"):
print elem.xpath("./*[local-name()='title']/text()")[0]
which, for your input example, prints:
>>> fast_iter(context, process_element)
MediaWiki:Category
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…