I'm trying to get all the href's from a HTML code and store it in a list for future processing such as this:
Example URL: www.example-page-xl.com
<body>
<section>
<a href="/helloworld/index.php"> Hello World </a>
</section>
</body>
I'm using the following code to list the href's:
import bs4 as bs4
import urllib.request
sauce = urllib.request.urlopen('https:www.example-page-xl.com').read()
soup = bs.BeautifulSoup(sauce,'lxml')
section = soup.section
for url in section.find_all('a'):
print(url.get('href'))
However I would like to store the URL as:
www.example-page-xl.com/helloworld/index.php and not just the relative path which is /helloworld/index.php
Appending/joining the URL with the relative path isn't required since the dynamic links may vary when I join the URL and the relative path.
In a nutshell I would like to scrape the absolute URL and not relative paths alone (and without joining)
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…