You can do so in several steps.
- Parse HTML with
parse5
. The bad part is that the result is not DOM. Though it's fast enough and W3C-compiant.
- Serialize it to XHTML with
xmlserializer
that accepts DOM-like structures of parse5
as input.
- Parse that XHTML again with
xmldom
. Now you finally have that DOM.
- The
xpath
library builds upon xmldom
, allowing you to run XPath queries. Be aware that XHTML has its own namespace, and queries like //a
won't work.
Finally you get something like this.
const fs = require('mz/fs');
const xpath = require('xpath');
const parse5 = require('parse5');
const xmlser = require('xmlserializer');
const dom = require('xmldom').DOMParser;
(async () => {
const html = await fs.readFile('./test.htm');
const document = parse5.parse(html.toString());
const xhtml = xmlser.serializeToString(document);
const doc = new dom().parseFromString(xhtml);
const select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
const nodes = select("//x:a/@href", doc);
console.log(nodes);
})();
Note that you have to prepend every single HTML element of a query with the x:
prefix, for example to match an a
inside a div
you would need:
//x:div/x:a
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…