Apache Nutch is an open source scalable Web crawler written in Java and based on Lucene/Solr for the indexing and search part. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering. [*]
Motivation
- highly scalable and relatively feature rich crawler
- features like politeness which obeys robots.txt rules
- robust and scalable - you can run Nutch on a cluster of 100 machines
- quality - you can bias the crawling to fetch “important” pages first
Basics about Nutch
First you need to know that, Nutch data is composed of:
- The crawl database, or crawldb. This contains information about every url known to Nutch, including whether it was fetched, and, if so, when.
- The link database, or linkdb. This contains the list of known links to each url, including both the source url and anchor text of the link.
- A set of segments. Each segment is a set of urls that are fetched as a unit. Segments are directories with the following subdirectories:
- crawl_generate names a set of urls to be fetche
- crawl_fetch contains the status of fetching each url
- content contains the raw content retrieved from each url
- parse_text contains the parsed text of each url
- parse_data contains outlinks and metadata parsed from each url
- crawl_parse contains the outlink urls, used to update the crawldb
Nutch and Hadoop
As of the official Nutch 1.3 release the source code architecture has been greatly simplified to allow us to run Nutch in one of two modes; namely local and deploy. By default, Nutch no longer comes with a Hadoop distribution, however when run in local mode e.g. running Nutch in a single process on one machine, then we use Hadoop as a dependency. This may suit you fine if you have a small site to crawl and index, but most people choose Nutch because of its capability to run on in deploy mode, within a Hadoop cluster. This gives you the benefit of a distributed file system (HDFS) and MapReduce processing style. If you are interested in deployed mode read here.
Getting hands dirt with Nutch
Setup Nutch from binary distribution
- Unzip your binary Nutch package to $HOME/nutch-1.3
- cd $HOME/nutch-1.3/runtime/local
- From now on, we are going to use ${NUTCH_RUNTIME_HOME} to refer to the current directory.
Verify your Nutch installation
- run "bin/nutch"
- You can confirm a correct installation if you seeing the following: Usage: nutch [-core] COMMAND
- $ cd NUTCH_HOME/runtime/local
- $ echo "http://en.wikipedia.org/wiki/Collective_intelligence" > urls
- add: `+^http://([a-z0-9]*\.)*wikipedia.org/` in conf/regex-urlfilter.txt
- $ bin/nutch crawl urls -dir crawl-wiki-ci -depth 2
-
statistics associated with the crawldb
-
$ nutch readdb crawl-wiki-ci/crawldb/ -stats
-
CrawlDb statistics start: crawl-wiki-ci/crawldb/Statistics for CrawlDb: crawl-wiki-ci/crawldb/
TOTAL urls: 2727
retry 0: 2727
min score: 0.0
avg score: 8.107811E-4
max score: 1.341
status 1 (db_unfetched): 2665
status 2 (db_fetched): 61
status 3 (db_gone): 1
CrawlDb statistics: done
-
CrawlDb statistics start: crawl-wiki-ci/crawldb/Statistics for CrawlDb: crawl-wiki-ci/crawldb/
-
$ nutch readdb crawl-wiki-ci/crawldb/ -stats
-
Dump of the URLs from the crawldb
-
$ nutch readdb crawl-wiki-ci/crawldb/ -dump crawl-wiki-ci/stats
-
http://en.wikipedia.org/wiki/Special:RecentChangesLinked/MIT_Center_for_Collective_Intelligence Version: 7Status: 1 (db_unfetched)
Fetch time: Sat Feb 04 00:50:50 EST 2012
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.9607843E-4
Signature: null
Metadata:
….
-
http://en.wikipedia.org/wiki/Special:RecentChangesLinked/MIT_Center_for_Collective_Intelligence Version: 7Status: 1 (db_unfetched)
-
$ nutch readdb crawl-wiki-ci/crawldb/ -dump crawl-wiki-ci/stats
-
Top 10 highest rate links
-
$ nutch readdb crawl-wiki-ci/crawldb/ -topN 10 crawl-wiki-ci/stats/top10/
-
1.3416613 http://en.wikipedia.org/wiki/Collective_intelligence0.030499997 http://en.wikipedia.org/wiki/Howard_Bloom
0.02763889 http://en.wikipedia.org/wiki/Groupthink
0.02591739 http://en.wikipedia.org/wiki/Wikipedia
0.024347823 http://en.wikipedia.org/wiki/Pierre_L%C3%A9vy_(philosopher)
0.023733648 http://en.wikipedia.org/wiki/Wikipedia:Citation_needed
0.017142152 http://en.wikipedia.org/w/opensearch_desc.php
0.016599996 http://en.wikipedia.org/wiki/Artificial_intelligence
0.016499996 http://en.wikipedia.org/wiki/Consensus_decision_making
0.015199998 http://en.wikipedia.org/wiki/Group_selection
-
1.3416613 http://en.wikipedia.org/wiki/Collective_intelligence0.030499997 http://en.wikipedia.org/wiki/Howard_Bloom
-
$ nutch readdb crawl-wiki-ci/crawldb/ -topN 10 crawl-wiki-ci/stats/top10/
-
Dump of a Nutch segment
-
$ nutch readseg -dump crawl-wiki-ci/segments/20120204004509/ crawl-wiki-ci/stats/segments
-
CrawlDatum::Version: 7
Status: 1 (db_unfetched)
Fetch time: Sat Feb 04 00:45:03 EST 2012
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1328334307529 -
Content::
Version: -1
url: http://en.wikipedia.org/wiki/Collective_intelligence
base: http://en.wikipedia.org/wiki/Collective_intelligence
contentType: application/xhtml+xml
metadata: Content-Language=en Age=52614 Content-Length=29341 Last-Modified=Sat, 28 Jan 2012 17:27:22 GMT _fst_=33 nutch.segment.name=20120204004509 Connection=close X-Cache-Lookup=MISS from sq72.wikimedia.org:80 Server=Apache X-Cache=MISS from sq72.wikimedia.org X-Content-Type-Options=nosniff Cache-Control=private, s-maxage=0, max-age=0, must-revalidate Vary=Accept-Encoding,Cookie Date=Fri, 03 Feb 2012 15:08:18 GMT Content-Encoding=gzip nutch.crawl.score=1.0 Content-Type=text/html; charset=UTF-8
Content:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" dir="ltr" class="client-nojs" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Collective intelligence - Wikipedia, the free encyclopedia</title>
<meta ….
-
CrawlDatum::Version: 7
-
$ nutch readseg -dump crawl-wiki-ci/segments/20120204004509/ crawl-wiki-ci/stats/segments
References:
- http://wiki.apache.org/nutch/NutchTutorial
- http://en.wikipedia.org/wiki/Nutch
Now, your turn!
Thanks for reading this far. Here are some things you can do next:- Found a typo? Edit this post.
- Got questions? comment below.
- Was it useful? Show your support and share it.