I have to scrape a very, very simple page on our company's intranet in order to automate one of our internal processes (returning a function's output as successful or not).
I found the following example:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://sitescraper.net'
r = Render(url)
html = r.frame.toHtml()
From http://blog.sitescraper.net/2010/06/scraping-javascript-webpages-in-python.html and it's almost perfect. I just need to be able to provide authentication to view the page.
I've been looking through the documentation for PyQt4 and I'll admit a lot of it is over my head. If anyone could help, I'd appreciate it.
Edit:
Unfortunately gruszczy's method didn't work for me. When I had done something similar through urllib2, I used the following code and it worked...
username = 'user'
password = 'pass'
req = urllib2.Request(url)
base64string = base64.encodestring('%s:%s' % (username, password))[:-1]
authheader = "Basic %s" % base64string
req.add_header("Authorization", authheader)
handle = urllib2.urlopen(req)
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…