python - How to change specific link tags to text using re module?

Question

Welcome To Ask or Share your Answers For Others

python - How to change specific link tags to text using re module?

posted Mar 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to change specific link tags to text using re module?

I have HTML text. For example:

<a href="https://google.com">Google</a> Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua.<br />
<br />
#<a href="#something">somethin</a> #<a href="#somethingelse">somethinelse</a>

I want change links preceded by "#" to normal text (ex. with <b></b> tags). The other links should be unchanged.

I tried to use the re module, but the result was not quite successful.

import re

cond = re.compile('#<.*?>')
output = re.sub(cond, "#", "#<a href="stuff1">stuff1</a>")
print(output)

Output:

#stuff1</a>

It still has </a> at the end.

question from:https://stackoverflow.com/questions/65854894/how-to-change-specific-link-tags-to-text-using-re-module

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-03-06T05:14:31+0000

You're close! Your pattern, '#<.*?>', only matches the opening tag. Try this:

r'#<a href=".*?">(.*?)</a>'

This is also a little more specific, in that it will only match <a> tags. Also note that it's good practice to specify regular expressions as raw string literals (the r at the beginning). The parentheses, (.*?), are a capturing group. From the docs:

(...)

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the umber special sequence, described below.

You can refer back to this group in your replacement argument as g<#>, where # is which group you want. We've only defined one group, so it's naturally the first one: g<1>.

Additionally, once you've compiled a regular expression, you can call its own sub method:

pattern = re.compile(r'my pattern')
pattern.sub(r'replacement', 'text')

Usually the re.sub method is for when you haven't compiled:

re.sub(r'my pattern', r'replacement', 'text')

Performance difference is usually none or minimal, so use whichever makes your code clearer. (Personally I usually prefer compiling. Like any other variables, compiled expressions let me use clear, reusable names.)

So your code would be:

import re

pound_links = re.compile(r'#<a href=".*?">(.*?)</a>')
output = pound_links.sub(r'#g<1>', '#<a href="stuff1">stuff1</a>')

print(output)

Or:

import re

output = re.sub(r'#<a href=".*?">(.*?)</a>',
                r"#g<1>",
                "#<a href="stuff1">stuff1</a>")

print(output)

Either one outputs:

#stuff1

Categories

python - How to change specific link tags to text using re module?

python - How to change specific link tags to text using re module?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags