I have a challenging problem to solve. I'm working on a script which takes a regex as an input. This script then finds all matches for this regex in a document and wraps each match in its own <span> element. The hard part is that the text is a formatted html document, so my script needs to navigate through the DOM and apply the regex across multiple text nodes at once, while figuring out where it has to split text nodes if needed.
For example, with a regex that captures full sentences starting with a capital letter and ending with a period, this document:
<p>
<b>HTML</b> is a language used to make <b>websites.</b>
It was developed by <i>CERN</i> employees in the early 90s.
<p>
Would be turned into this:
<p>
<span><b>HTML</b> is a language used to make <b>websites.</b></span>
<span>It was developed by <i>CERN</i> employees in the early 90s.</span>
<p>
The script then returns the list of all created spans.
I already have some code which finds all the text nodes and stores them in a list along with their position across the whole document and their depth. You don't really need to understand that code to help me and its recursive structure can be a bit confusing. The first part I'm not sure how to do is figure out which elements should be included within the span.
function SmartNode(node, depth, start) {
this.node = node;
this.depth = depth;
this.start = start;
}
function findTextNodes(node, depth, start) {
var list = [];
var start = start || 0;
depth = (typeof depth !== "undefined" ? depth : -1);
if(node.nodeType === Node.TEXT_NODE) {
list.push(new SmartNode(node, depth, start));
} else {
for(var i=0; i < node.childNodes.length; ++i) {
list = list.concat(findTextNodes(node.childNodes[i], depth+1, start));
if(list.length) start += list[list.length-1].node.nodeValue.length;
}
}
return list;
}
I figure I'll make a string out of all the document, run the regex through it and use the list to find which nodes correspond to witch regex matches and then split the text nodes accordingly.
But an issue arrives when I have a document like this:
<p>
This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a>
</p>
There's a sentence which starts outside of the <a>
tag but ends inside it. Now I don't want the script to split that link in two tags. In a more complex document, it could ruin the page if it did. The code could either wrap two sentences together:
<p>
<span>This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a></span>
</p>
Or just wrap each part in its own element:
<p>
<span>This program is </span>
<a href="beta.html">
<span>not stable yet.</span>
<span>Do not use this in production yet.</span>
</a>
</p>
There could be a parameter to specify what it should do. I'm just not sure how to figure out when an impossible cut is about to happen, and how to recover from it.
Another issue comes when I have whitespace inside a child element like this:
<p>This is a <b>sentence. </b></p>
Technically, the regex match would end right after the period, before the end of the <b>
tag. However, it would be much better to consider the space as part of the match and wrap it like this:
<p><span>This is a <b>sentence. </b></span></p>
Than this:
<p><span>This is a </span><b><span>sentence.</span> </b></p>
But that's a minor issue. After all, I could just allow extra white-space to be included within the regex.
I know this might sound like a "do it for me" question and its not the kind of quick question we see on SO on a daily basis, but I've been stuck on this for a while and it's for an open-source library I'm working on. Solving this problem is the last obstacle. If you think another SE site is best suited for this question, redirect me please.
See Question&Answers more detail:
os