Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
525 views
in Technique[技术] by (71.8m points)

terminology - crawler vs scraper

Can somebody distinguish between a crawler and scraper in terms of scope and functionality.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

A crawler gets web pages -- i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s).

A scraper takes pages that have been downloaded or, in a more general sense, data that's formatted for display, and (attempts to) extract data from those pages, so that it can (for example) be stored in a database and manipulated as desired.

Depending on how you use the result, scraping may well violate the rights of the owner of the information and/or user agreements about use of web sites (crawling violates the latter in some cases as well). Many sites include a file named robots.txt in their root (i.e. having the URL http://server/robots.txt) to specify how (and if) crawlers should treat that site -- in particular, it can list (partial) URLs that a crawler should not attempt to visit. These can be specified separately per crawler (user-agent) if desired.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...