Tuesday, June 7, 2016

Building a search engine using Nutch and Solr


Requirements

  • Ubuntu Server 14.04
  • Apache Solr(distro package) 3.6.2+dfsg-2
  • Apache Nutch 1.11

Installation
Follow the steps(Installing Solr using apt-get) outlined in [1] to install Solr. Download and extract the binary package of Nutch. In [2], follow the sections Verify your Nutch installation and Create a URL seed list. The configuration below is for indexing PDFs only.

After the installation, copy the $NUTCH_HOME/conf/schema.xml to /etc/solr/conf/schema.xml then restart tomcat

$sudo service tomcat6 restart

Download the nutch-site.xml below then replace the one in $NUTCH_HOME/conf with it.

nutch-site.xml


The script below recrawls the URLS. Make sure to change the SOLR_URL variable.
References
  1. https://www.digitalocean.com/community/tutorials/how-to-install-solr-on-ubuntu-14-04
  2. https://wiki.apache.org/nutch/NutchTutorial


0 comments: