javascript regex to extract anchor text and URL from anchor tags

Question

Welcome To Ask or Share your Answers For Others

javascript regex to extract anchor text and URL from anchor tags

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

javascript regex to extract anchor text and URL from anchor tags

I have a paragraph of text in a javascript variable called 'input_content' and that text contains multiple anchor tags/links. I would like to match all of the anchor tags and extract anchor text and URL, and put it into an array like (or similar to) this:

Array
(
    [0] => Array
        (
            [0] => <a href="http://yahoo.com">Yahoo</a>
            [1] => http://yahoo.com
            [2] => Yahoo
        )
    [1] => Array
        (
            [0] => <a href="http://google.com">Google</a>
            [1] => http://google.com
            [2] => Google
        )
)

I've taken a crack at it (http://pastie.org/339755), but I am stumped beyond this point. Thanks for the help!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

...

深蓝 · Answer 1 · 2021-10-16T23:59:16+0000

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)</a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4))
});

This assumes that your anchors will always be in the form <a href="...">...</a> i.e. it won't work if there are any other attributes (for example, target). The regular expression can be improved to accommodate this.

To break down the regular expression:

/ -> start regular expression
  [^<]* -> skip all characters until the first <
  ( -> start capturing first token
    <a href=" -> capture first bit of anchor
    ( -> start capturing second token
        [^"]+ -> capture all characters until a "
    ) -> end capturing second token
    "> -> capture more of the anchor
    ( -> start capturing third token
        [^<]+ -> capture all characters until a <
    ) -> end capturing third token
    </a> -> capture last bit of anchor
  ) -> end capturing first token
/g -> end regular expression, add global flag to match all anchors in string

Each call to our anonymous function will receive three tokens as the second, third and fourth arguments, namely arguments[1], arguments[2], arguments[3]:

arguments[1] is the entire anchor
arguments[2] is the href part
arguments[3] is the text inside

We'll use a hack to push these three arguments as a new array into our main matches array. The arguments built-in variable is not a true JavaScript Array, so we'll have to apply the split Array method on it to extract the items we want:

Array.prototype.slice.call(arguments, 1, 4)

This will extract items from arguments starting at index 1 and ending (not inclusive) at index 4.

var input_content = "blah 
    <a href="http://yahoo.com">Yahoo</a> 
    blah 
    <a href="http://google.com">Google</a> 
    blah";

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">([^<]+)</a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1, 4));
});

alert(matches.join("
"));

Gives:

<a href="http://yahoo.com">Yahoo</a>,http://yahoo.com,Yahoo
<a href="http://google.com">Google</a>,http://google.com,Google

Categories

javascript regex to extract anchor text and URL from anchor tags

javascript regex to extract anchor text and URL from anchor tags

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags