Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
194 views
in Technique[技术] by (71.8m points)

html - I need help extracting embedded .xlsx link from a webpage using Python/BeautifulSoup

I'm trying to access an excel table from this website to bring in as a DataFrame. Here is what I have:

import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://tedb.ornl.gov/data/'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

# Select all 'a' elements with href attributes containing URLs starting with https://
for link in soup.select('a[href^="https://"]'):
    href = link.get('href')
    print(href)

I'd like to grab Table 4.01, whose link, when inspected, is contained within the HTML element:

<a href="https://tedb.ornl.gov/wp-content/uploads/2020/06/Table4_01_06242020.xlsx">xlsx</a>

However, when I run my code, all I get back are the links below:

https://www.ornl.gov
https://tedb.ornl.gov/
https://tedb.ornl.gov/data/
https://tedb.ornl.gov/archive/
https://tedb.ornl.gov/citation/
https://tedb.ornl.gov/contact/
https://tedb.ornl.gov/wp-content/uploads/2020/02/TEDB_Ed_38.pdf
https://tedb.ornl.gov/wp-content/uploads/2020/08/TEDB_38.2_Spreadsheets_08312020.zip
https://tedb.ornl.gov/wp-content/uploads/2020/08/Updates_08312020.pdf
https://www.ornl.gov/ornl/contact-us/Security--Privacy-Notice
https://www.ornl.gov/content/accessibility
https://www.ornl.gov/content/notice-nondiscrimination-and-accessibility-requirements

Does anyone know why the excel link I'm looking for does not show up?

question from:https://stackoverflow.com/questions/66055635/i-need-help-extracting-embedded-xlsx-link-from-a-webpage-using-python-beautiful

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The table is dynamically generated, but there's a back-end url you can query.

Here's how:

import requests
from bs4 import BeautifulSoup

url = "https://tedb.ornl.gov/wp-admin/admin-ajax.php?action=wp_ajax_ninja_tables_public_action&table_id=3374&target_action=get-all-data&default_sorting=manual_sort"

response = requests.get(url).json()

for item in response:
    print(BeautifulSoup(item["value"]["excel"], "html.parser").find("a")["href"])

Output:

https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_01_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_02_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_03_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_04_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_01_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_02_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_03_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Table1_05_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_06_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_07_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_08_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_04_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_09_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_10_04302020.xlsx
and much more...

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...