The initial HTML does not contain the data you want to scrape, that's why using only BeautifulSoup
is not enough. You can load the page with Selenium
and then scrape the content.
Code:
import json
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
html = None
url = 'http://demo-tableau.bitballoon.com/'
selector = '#dataTarget > div'
delay = 10 # seconds
browser = webdriver.Chrome()
browser.get(url)
try:
# wait for button to be enabled
WebDriverWait(browser, delay).until(
EC.element_to_be_clickable((By.ID, 'getData'))
)
button = browser.find_element_by_id('getData')
button.click()
# wait for data to be loaded
WebDriverWait(browser, delay).until(
EC.presence_of_element_located((By.CSS_SELECTOR, selector))
)
except TimeoutException:
print('Loading took too much time!')
else:
html = browser.page_source
finally:
browser.quit()
if html:
soup = BeautifulSoup(html, 'lxml')
raw_data = soup.select_one(selector).text
data = json.loads(raw_data)
import pprint
pprint.pprint(data)
Output:
[[{'formattedValue': 'Atlantic', 'value': 'Atlantic'},
{'formattedValue': '6/26/2010 3:00:00 AM', 'value': '2010-06-26 03:00:00'},
{'formattedValue': 'ALEX', 'value': 'ALEX'},
{'formattedValue': '16.70000', 'value': '16.7'},
{'formattedValue': '-84.40000', 'value': '-84.4'},
{'formattedValue': '30', 'value': '30'}],
...
]
The code assumes that the button is initially disabled: <button id="getData" onclick="getUnderlyingData()" disabled>Get Data</button>
and data is not loaded automatically, but due to the button being clicked. Therefore you need to delete this line: setTimeout(function(){ getUnderlyingData(); }, 3000);
.
You can find a working demo of your example here: http://demo-tableau.bitballoon.com/.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…