Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
240 views
in Technique[技术] by (71.8m points)

python - Scraping Javascript driven web pages with PyQt4 - how to access pages that need authentication?

I have to scrape a very, very simple page on our company's intranet in order to automate one of our internal processes (returning a function's output as successful or not).

I found the following example:

import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *

class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

  def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

url = 'http://sitescraper.net'
r = Render(url)
html = r.frame.toHtml()

From http://blog.sitescraper.net/2010/06/scraping-javascript-webpages-in-python.html and it's almost perfect. I just need to be able to provide authentication to view the page.

I've been looking through the documentation for PyQt4 and I'll admit a lot of it is over my head. If anyone could help, I'd appreciate it.

Edit: Unfortunately gruszczy's method didn't work for me. When I had done something similar through urllib2, I used the following code and it worked...

username = 'user'
password = 'pass'

req = urllib2.Request(url)
base64string = base64.encodestring('%s:%s' % (username, password))[:-1]
authheader = "Basic %s" % base64string
req.add_header("Authorization", authheader)

handle = urllib2.urlopen(req)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I figured it out. Here's what I ended up with in case it can help someone else.

#!/usr/bin/python
# -*- coding: latin-1 -*-
import sys
import base64
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from PyQt4 import QtNetwork

class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)

    username = 'username'
    password = 'password'

    base64string = base64.encodestring('%s:%s' % (username, password))[:-1]
    authheader = "Basic %s" % base64string

    headerKey = QByteArray("Authorization")
    headerValue = QByteArray(authheader)

    url = QUrl(url)
    req = QtNetwork.QNetworkRequest()
    req.setRawHeader(headerKey, headerValue)
    req.setUrl(url)

    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)


    self.mainFrame().load(req)
    self.app.exec_()

  def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

def main():
    url = 'http://www.google.com'
    r = Render(url)
    html = r.frame.toHtml()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...