Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
569 views
in Technique[技术] by (71.8m points)

regex - How do I extract HTML img sources with a regular expression?

I need to extract the src element from all image tags in an HTML document.

So, the input is an HTML page and the output would be a list of URL's pointing to images: ex... http://www.google.com/intl/en_ALL/images/logo.gif

The following is what I came up with so far:

<imgs+src=""(http://.*?)

This does not work for tags where the src isn't directly after the img tag, for example:

<img height="1px" src="spacer.gif">

Can someone help complete this regular expression? It's pretty easy, but I thought this may be a faster way to get an answer.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The following regexp snippet should work.

<img[^>]+src="([^">]+)"

It looks for text that starts with <img, followed by one or more characters that are not >, then src=". It then grabs everything between that point and the next " or >.

But if at all possible, use a real HTML parser. It's more solid, and will handle edge cases much better.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...