Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
243 views
in Technique[技术] by (71.8m points)

How can I get href links from HTML using Python?

import urllib2

website = "WEBSITE"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()

print html

So far so good.

But I want only href links from the plain text HTML. How can I solve this problem?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Try with Beautifulsoup:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    print link.get('href')

In case you just want links starting with http://, you should use:

soup.findAll('a', attrs={'href': re.compile("^http://")})

In Python 3 with BS4 it should be:

from bs4 import BeautifulSoup
import urllib.request

html_page = urllib.request.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a'):
    print(link.get('href'))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...