Nutch store all the metadata information of URLs in CrawlDatum Object. and it is stored in /crawldb/*/part-*/data
location
As per the source code of CrawlDatum
/** Page was not fetched yet. */
db_unfetched --> public static final byte STATUS_DB_UNFETCHED = 0x01;
/** Page was successfully fetched. */
db_fetched --> public static final byte STATUS_DB_FETCHED = 0x02;
/** Page no longer exists. */
db_Gone --> public static final byte STATUS_DB_GONE = 0x03;
/** Page temporarily redirects to other page. */
db_redir_temp --> public static final byte STATUS_DB_REDIR_TEMP = 0x04;
/** Page permanently redirects to other page. */
db_redir_perm --> public static final byte STATUS_DB_REDIR_PERM = 0x05;
/** Page was successfully fetched and found not modified. */
db_notmodified --> public static final byte STATUS_DB_NOTMODIFIED = 0x06;
/** Page was marked as being a duplicate of another page */
db_duplicate --> public static final byte STATUS_DB_DUPLICATE = 0x07;
CrawlDatum private byte status;
will take one of the values mentioned above depending on the state of URL. (and there are lot of other flags which i'm not discussing now)
When will status value of CrawlDatum(object) change?
There are a lot of flows where it might take one of the several states mentioned above.I will explain a few flows which I'm well aware of.
- when we inject URLs into nutch. crawlDb folder is created with each URL CrawlDatum object with state as (db_unfetched). see below code from Injector class
InjectReducer.reduce method.
for (CrawlDatum val : values) {
if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
injected.set(val);
injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
injectedSet = true;
} else {
old.set(val);
oldSet = true;
}
}
By setting this flag it will be helpful for the generator phase to pick only unfetched urls.
- In Fetcher phase if you open FetcherThread source code. crawlDatum status is changed based on url http stats code. you can refer http codes here. (for better understanding)
case ProtocolStatus.MOVED: // redirect
case ProtocolStatus.TEMP_MOVED:
int code;
boolean temp;
if (status.getCode() == ProtocolStatus.MOVED) {
code = CrawlDatum.STATUS_FETCH_REDIR_PERM;
temp = false;
} else {
code = CrawlDatum.STATUS_FETCH_REDIR_TEMP;
temp = true;
}
output(fit.url, fit.datum, content, status, code);
String newUrl = status.getMessage();
Text redirUrl = handleRedirect(fit, newUrl, temp,
Fetcher.PROTOCOL_REDIR);
if (redirUrl != null) {
fit = queueRedirect(redirUrl, fit);
} else {
// stop redirecting
redirecting = false;
}
break;
case ProtocolStatus.EXCEPTION:
logError(fit.url, status.getMessage());
int killedURLs = ((FetchItemQueues) fetchQueues).checkExceptionThreshold(fit
.getQueueID());
if (killedURLs != 0)
context.getCounter("FetcherStatus",
"AboveExceptionThresholdInQueue").increment(killedURLs);
/* FALLTHROUGH */
case ProtocolStatus.RETRY: // retry
case ProtocolStatus.BLOCKED:
output(fit.url, fit.datum, null, status,
CrawlDatum.STATUS_FETCH_RETRY);
break;
case ProtocolStatus.GONE: // gone
case ProtocolStatus.NOTFOUND:
case ProtocolStatus.ACCESS_DENIED:
case ProtocolStatus.ROBOTS_DENIED:
output(fit.url, fit.datum, null, status,
CrawlDatum.STATUS_FETCH_GONE);
break;
case ProtocolStatus.NOTMODIFIED:
output(fit.url, fit.datum, null, status,
CrawlDatum.STATUS_FETCH_NOTMODIFIED);
break;
default:
if (LOG.isWarnEnabled()) {
LOG.warn("{} {} Unknown ProtocolStatus: {}", getName(),
Thread.currentThread().getId(), status.getCode());
}
output(fit.url, fit.datum, null, status,
CrawlDatum.STATUS_FETCH_RETRY);
if (redirecting && redirectCount > maxRedirect) {
((FetchItemQueues) fetchQueues).finishFetchItem(fit);
if (LOG.isInfoEnabled()) {
LOG.info("{} {} - redirect count exceeded {}", getName(),
Thread.currentThread().getId(), fit.url);
}
output(fit.url, fit.datum, null,
ProtocolStatus.STATUS_REDIR_EXCEEDED,
CrawlDatum.STATUS_FETCH_GONE);
}
- In deduplication phase if a URLs is found to be duplicate based on md5 hash then it will mark the status as STATUS_DB_DUPLICATE in the deduplication phase and in the next iteration it will not be picked by the Generator.