I'm trying to crawl multiple sites using Nutch. My seed.txt looks like this:
http://1.a.b/
http://2.a.b/
and my regex-urlfilter.txt looks like this:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+1/[^/]+1/
# accept anything else
#+.
+^http://1.a.b/*
+^http://2.a.b/*
I tried the following for the last part:
+^http://([a-z0-9]*.)*a.b/*
The only site crawled is the first one. All other configuration is default.
I run the following command:
bin/nutch crawl urls -solr http://localhost:8984/solr/ -dir crawl -depth 10 -topN 10
Any ideas?!
Thank you!
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…