Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
114 views
in Technique[技术] by (71.8m points)

python - can we write a script which could be usable to scrape multiple sites

I have written almost 30 different scraping scripts for 30 different websites. A friend of mine told me that it is possible to have a single code file for scraping all these 30 websites and bring it to the dashboard for dynamic scraping (I didn't understand what he means). I know every single website has its own structure and different data is coming from different pages and elements. On the other hand, some websites provide dynamic data instead of static data which I used selenium for its scraping.

I am really not sure about what he was thinking and is it possible to follow a path where I right one single long script file and use it for scraping many websites.

I would appreciate it if anyone having knowledge about this help me with the idea, tutorials, web contents and...

question from:https://stackoverflow.com/questions/65867436/can-we-write-a-script-which-could-be-usable-to-scrape-multiple-sites

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Yes you can;

  1. Make module for each scraper
  2. Make an main app
  3. Import your modules to the main app
  4. Scrape your target website by using multiprocessing or multithreading methods.

Conceptual Code:

# This code will not run!

from multiprocessing import Pool
from Scrapers import Scraper1, Scraper2, Scraper3, ...


def run_each_scraper(get_scraper_object):
    get_scraper_object.run()

def launcher():
    list_of_websites = []
    # use loops here
    reserve_scraper_objects_for_websites = [Scraper1.scrape(list_of_websites[0]),
                                            Scraper2.scrape(list_of_websites[1]),
                                            ...]
    process_objects = Pool(20) # Depends on your system resource
    process_objects.map(run_each_scraper, reserve_scraper_objects_for_websites)

if __name__ == '__main__':
    launcher()

But if you need some technologies i suggest you to switch in Scrapy/Spider; Also you can handle dynamic websites with Splash, Splash can works with Scrapy. They made for big/massive web crawler apps even for production.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...