Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
230 views
in Technique[技术] by (71.8m points)

javascript - Scraping multiple URLs by looping in PhantomJS

I am using PhantomJS to scrape some websites and therefore extract information with r. I am following this tutorial. Everything works fine for a single page, but I couldn't find any simple tutorial on how to automate for multiple pages. My experiments so far:

var countries = [ "Albania" ,"Afghanistan"];
var len = countries.length;
var name1 = ".html";
var add1 = "http://www.kluwerarbitration.com/CommonUI/BITs.aspx?country=";
var country ="";
var name ="";
var add="";


for (i=1; i <= len; i++){

    country = countries[i]
    name = country.concat(name1)
    add = add1.concat(name1)

    var webPage = require('webpage');
    var page = webPage.create();

    var fs = require('fs');
    var path = name

    page.open(add, function (status) {
        var content = page.content;
        fs.write(path,content,'w')
        phantom.exit();
    });

}

I don't seem to get any error when running the code, the script creates a html file only for the second country, which contains all information on the page exception made for the small table I am interested in.

I tried to gather some information from similar questions. However, also because I couldn't find a simple reproducible example, I don't understand what I am doing wrong.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The main problem seems to be that you're exiting too early. You're creating multiple page instances in a loop. Since PhantomJS is asynchronous, the call to page.open() immediately exists and the next for loop iteration is executed.

A for-loop is pretty fast, but web requests are slow. This means that your loop is fully executed before even the first page is loaded. This also means that the first page that is loaded will also exit PhantomJS, because you're calling phantom.exit() in each of those page.open() callbacks. I suspect the second URL is faster for some reason and is therefore always written.

var countFinished = 0, 
    maxFinished = len;
function checkFinish(){
    countFinished++;
    if (countFinished + 1 === maxFinished) {
        phantom.exit();
    }
}

for (i=1; i <= len; i++) {
    country = countries[i]
    name = country.concat(name1)
    add = add1.concat(country)

    var webPage = require('webpage');
    var page = webPage.create();

    var fs = require('fs');
    var path = name

    page.open(add, function (status) {
        var content = page.content;
        fs.write(path, content,'w')
        checkFinish();
    });
}

The problem is that you're creating a lot of page instances without cleaning up. You should close them when you're done with them:

for (i=1; i <= len; i++) {
    (function(i){
        country = countries[i]
        name = country.concat(name1)
        add = add1.concat(country)

        var webPage = require('webpage');
        var page = webPage.create();

        var fs = require('fs');
        var path = name

        page.open(add, function (status) {
            var content = page.content;
            fs.write(path, content,'w');
            page.close();
            checkFinish();
        });
    })(i);
}

Since JavaScript has function-level scope, you would need to use an IIFE to retain a reference to the correct page instance in the page.open() callback. See this question for more information about that: Q: JavaScript closure inside loops – simple practical example

If you don't want to clean up afterwards, then you should use the same page instance over all of those URLs. I already have an answer about doing that here: A: Looping over urls to do the same thing


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...