Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.6k views
in Technique[技术] by (71.8m points)

web crawler - Nutch regex-urlfilter syntax

I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can't seem to get the syntax correct for the file NUTCH_ROOT/conf/regex-urlfilter.txt.

The site I want to crawl has a URL similar to this:

http://www.example.com/foo.cfm

On that page there are numerous links that match the following pattern:

http://www.example.com/foo.cfm/Bar_-_Foo/Extra/EX/20817/ID=6976

I want to crawl links that match second example above as well. In my regex-urlfilter.txt I have the following:

+^http://www.example.com/foo.cfm$
+^http://www.example.com/foo.cfm/(.+)*$

Nutch matches on the first one and crawls it correctly, but does not seem to pick up links using the other filter. How can I get Nutch to crawl URL's like the second one above?

I have tried the following with no luck:

+^http://www.example.com/foo.cfm/(.+)*$
+^http://www.example.com/foo.cfm/(.)*$
+^http://www.example.com/foo.cfm/.+$
+^http://www.example.com/foo.cfm/(.*)*$

In my NUTCH_ROOT/urls/nutch I have:

http://www.example.com/foo.cfm/
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

According to http://wiki.apache.org/nutch/FAQ#What_happens_if_I_inject_urls_several_times.3F you can't have multiple URLs (they will be ignored). What about to put only:

+^http://www.example.com/foo.cfm/(.+)*$

which should cover your first line: +^http://www.example.com/foo.cfm$ as well, or, if there are problems with /, try:

+^http://www.example.com/foo.cfm//?(.+)*$

Where //? should stand for character / or


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...