Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
297 views
in Technique[技术] by (71.8m points)

javascript - Parsing webpages to extract contents

I want to design a crawler, using java, that crawls a webpage and extract certain contents of the page. How should I do this? I am new and I need guidance to start designing crawlers.

For example, I want to access the content "red is my favorite color" from a webpage which is embedded something like below:

< div >red is my favorite color< / div >

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Suggested readings

Static pages:

Mind you, many of the pages will create content dynamically using JavaScript after loading. For such a case, the 'static page' approach won't help, you will need to search for tools in the "Web automation" category.
Selenium is such a toolset. You can command you browser to open and navigate pages using a common browser, you may even be able to use a 'headless browser' (no UI) using the phantomjs.

Good luck, there's lots of reading and coding ahead of you.

[edited for examples]

This technique is called Web scraping - use it with google for examples. The following are offered as an example of results in my searches, I offer no warranties or endorsements for them

For "static Webpage scrapping" - here's an example using jsoup

For "dynamic pages" - here's an example using Selenium


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...