Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.5k views
in Technique[技术] by (71.8m points)

web scraping dynamic content with python

I'd like to use Python to scrape the contents of the "Were you looking for these authors:" box on web pages like this one: http://academic.research.microsoft.com/Search?query=lander

Unfortunately the contents of the box get loaded dynamically by JavaScript. Usually in this situation I can read the Javascript to figure out what's going on, or I can use an browser extension like Firebug to figure out where the dynamic content is coming from. No such luck this time...the Javascript is pretty convoluted and Firebug doesn't give many clues about how to get at the content.

Are there any tricks that will make this task easy?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Instead of trying to reverse engineer it, you can use ghost.py to directly interact with JavaScript on the page.

If you run the following query in a chrome console, you'll see it returns everything you want.

document.getElementsByClassName('inline-text-org');

Returns

[<div class=?"inline-text-org" title=?"University of Manchester">?University of Manchester?</div>, 
 <div class=?"inline-text-org" title=?"University of California Irvine">?University of California ...?</div>?
  etc...

You can run JavaScript through python in a real life DOM using ghost.py.

This is really cool:

from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://academic.research.microsoft.com/Search?query=lander')
result, resources = ghost.evaluate(
    "document.getElementsByClassName('inline-text-org');")

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...