I am using example code below to scrape a website. The problem is that the website has code behind "dojo/domReady!" attributes so the code referenced below will complete and scrape the HTML before the remaining site content has been adjusted/finalized.
Can anybody help me adjust the below code to enable it to "wait 10 seconds after page connection" before grabbing the HTML as the page exists? I am trying to wait an arbitrary amount of time to allow for any or all of the content to render further past the initial page load.
Example:
import bs4 as bs
import sys
import urllib3.request
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
import time
class Page(QWebEnginePage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebEnginePage.__init__(self)
self.html = ''
self.loadFinished.connect(self._on_load_finished)
self.load(QUrl(url))
self.app.exec_()
def _on_load_finished(self):
self.html = self.toHtml(self.Callable)
print('Load finished')
def Callable(self, html_str):
self.html = html_str
self.app.quit()
def main():
page = Page('some_website')
soup = bs.BeautifulSoup(page.html, 'html.parser')
print(soup)
main()
question from:
https://stackoverflow.com/questions/65713943/how-can-i-keep-a-pyqt5-stream-open-to-catch-dojo-domready-js-execution 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…