Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
339 views
in Technique[技术] by (71.8m points)

python - How to change specific link tags to text using re module?

I have HTML text. For example:

<a href="https://google.com">Google</a> Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua.<br />
<br />
#<a href="#something">somethin</a> #<a href="#somethingelse">somethinelse</a>

I want change links preceded by "#" to normal text (ex. with <b></b> tags). The other links should be unchanged.

I tried to use the re module, but the result was not quite successful.

import re

cond = re.compile('#<.*?>')
output = re.sub(cond, "#", "#<a href="stuff1">stuff1</a>")
print(output)

Output:

#stuff1</a>

It still has </a> at the end.

question from:https://stackoverflow.com/questions/65854894/how-to-change-specific-link-tags-to-text-using-re-module

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You're close! Your pattern, '#<.*?>', only matches the opening tag. Try this:

r'#<a href=".*?">(.*?)</a>'

This is also a little more specific, in that it will only match <a> tags. Also note that it's good practice to specify regular expressions as raw string literals (the r at the beginning). The parentheses, (.*?), are a capturing group. From the docs:

(...)

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the umber special sequence, described below.

You can refer back to this group in your replacement argument as g<#>, where # is which group you want. We've only defined one group, so it's naturally the first one: g<1>.

Additionally, once you've compiled a regular expression, you can call its own sub method:

pattern = re.compile(r'my pattern')
pattern.sub(r'replacement', 'text')

Usually the re.sub method is for when you haven't compiled:

re.sub(r'my pattern', r'replacement', 'text')

Performance difference is usually none or minimal, so use whichever makes your code clearer. (Personally I usually prefer compiling. Like any other variables, compiled expressions let me use clear, reusable names.)

So your code would be:

import re

pound_links = re.compile(r'#<a href=".*?">(.*?)</a>')
output = pound_links.sub(r'#g<1>', '#<a href="stuff1">stuff1</a>')

print(output)

Or:

import re

output = re.sub(r'#<a href=".*?">(.*?)</a>',
                r"#g<1>",
                "#<a href="stuff1">stuff1</a>")

print(output)

Either one outputs:

#stuff1

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...