Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share

Login

Remember

Register

Ask
Q&A
All Activity
Hot!
Unreplyed
Tags
Users
Post an Article

Post an Article

Welcome To Ask or Share your Answers For Others

Categories

Topic[话题] (13)

Life[生活] (4)

Technique[技术] (2.1m)

Idea[创意] (3)

Jobs[工作] (2)

Others[杂七杂八] (18)

Code Example[编程示例] (0)

Python爬虫如何正确判断页面是否可以爬取？

0 votes

458 views

posted Feb 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

Python爬虫如何正确判断页面是否可以爬取？

用Python27些爬虫，想要爬取一些网站，我需要判断网页是否可以爬取，第一反应是通过状态码来判断，但是写完运行后发现有许多目标网站访问它不存在的页面时会返回一个404错误页面，可他的状态码却是200，结果爬回来好多根本就不存在的页面。这个本来是网站设置的问题，但是现在也不能用状态码来判断了，请问还有什么方法可以正确判断一个页面是不是404该不该爬？

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Please log in or register to add a comment.

Welcome To Ask or Share your Answers For Others

Please log in or register to reply this article.

1 Reply

0 votes

replyed Feb 17, 2021 by 深蓝 (71.8m points)

首先， 200 状态码，是网络连接状态，所以你只判断200并不能满足所有网站。

其次，写爬虫嘛，你应该实际去看看这些网站的规则是什么，可以先人工判断下，找找规律，比如看看网页返回内容是不是有什么特点之类的。

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Please log in or register to add a comment.

OGeek|极客中国-欢迎来到极客的世界，一个免费开放的程序员编程交流平台！开放，进步，分享！让技术改变生活，让极客改变未来！ Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share

Click Here to Ask a Question

Just Browsing Browsing

[1] What Power Query for combinations between Excel tables?

[2] java中有没有不需要porto文件反序列化字节流的插件啊？

[3] 前端访问图片总是展示不全，服务器上是好的

[4] react-native TextInput 设置 `paddingVertical: 0` 后, 光标变长？

[5] java - Compatibility between RandomAccessFile and Scanner

[6] css - How do I remove spacing around elements inside ".container-fluid"?

[7] 无法从组播端口中接收报文

[8] antd中Tree组件的拖拽问题？

[9] 前后端数据交互，结构转换问题

[10] python - Message: element click intercepted: Element ... is not clickable at point (657, 594). Other element would receive the click with Selenium

1.4m articles

1.4m replys

5 comments

57.0k users

Most popular tags

javascript python c# java How android c++ php ios html sql r c node.js .net iphone asp.net css reactjs jquery ruby What Android objective mysql linux Is git Python windows Why regex angular swift amazon excel algorithm macos Java visual how bash Can multithreading PHP Using scala angularjs typescript apache spring performance postgresql database flutter json rust arrays C# dart vba django wpf xml vue.js In go Get google jQuery xcode jsf http Google mongodb string shell oop powershell SQL C++ security assembly docker Javascript Android: Does haskell Convert azure debugging delphi vb.net Spring datetime pandas oracle math Django

Xstack问答社区
生活宝问答社区
OverStack问答社区
Ostack问答社区
在这了问答社区
在哪了问答社区
Xstack问答社区
无极谷问答社区
TouSu问答社区
SQlite问答社区
Qi-U问答社区
MLink问答社区
Jonic问答社区
Jike问答社区
16892问答社区
Vigges问答社区
55276问答社区
OGeek问答社区
深圳家问答社区
深圳家问答社区
深圳家问答社区
Vigges问答社区
Vigges问答社区
在这了问答社区
DevDocs API Documentations

Xstack问答社区
生活宝问答社区
OverStack问答社区
Ostack问答社区
在这了问答社区
在哪了问答社区
Xstack问答社区
无极谷问答社区
TouSu问答社区
SQlite问答社区
Qi-U问答社区
MLink问答社区
Jonic问答社区
Jike问答社区
16892问答社区
Vigges问答社区
55276问答社区
OGeek问答社区
深圳家问答社区
深圳家问答社区
深圳家问答社区
Vigges问答社区
Vigges问答社区
在这了问答社区
在这了问答社区
DevDocs API Documentations

Xstack问答社区
生活宝问答社区
OverStack问答社区
Ostack问答社区
在这了问答社区
在哪了问答社区
Xstack问答社区
无极谷问答社区
TouSu问答社区
SQlite问答社区
Qi-U问答社区
MLink问答社区
Jonic问答社区
Jike问答社区
16892问答社区
Vigges问答社区
55276问答社区
OGeek问答社区
深圳家问答社区
深圳家问答社区
深圳家问答社区
Vigges问答社区
Vigges问答社区
在这了问答社区
DevDocs API Documentations

Send feedback
深圳家
深圳家
极客中国
搜外友链
Ostack Developer QA ZONE
CC BY-SA 3.0
Contact with WebMaster by Email: [email protected]

Snow Theme by Q2A Market

Powered by Question2Answer

...