Thursday, 5 July 2012

WEB SEARCH MECHANISM USING SPHINX SEARCH ENGINE


General steps to achieve this.

1. web Crawler/spider.
2. Indexing
3. processing.
4. calculating relevancy.
5. retrieving.

* Sphinx does Indexing, Processing and calculating relevancy itself.(Step 2, 3 & 4).


* SPHINX does not help with web crawling/spider at all.
Would just need a spider that stores the output of the crawl somewhere where
sphinx could read it.

* We can use this crawling mechanism http://commoncrawl.org or build own
crawling.



Example :
1. Crawling whole website and store it in log file
wget -r "http://www.promedik.com" -o log


2. map between a document and it’s original url. wget doesn’t generate such map
grep "oldstr" log|awk '{print $2}'|sed 's/oldstr/newstr/g' > map

like
http://www.promedik.com/index.php =>
/home/pavan.joshi/www.promedik.com/index.php.

3. Indexing:
insert map data to database like
   id                url                                                               path
   1    http://www.promedik.com/      /home/pavan.joshi/www.promedik.com/


4. Configuring sphinx.conf
source se
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass =
sql_db = databseName
sql_query = select id, url, path from data
sql_file_field = path
sql_query_info = select * from data where id=$id
}
index se
{
path = idx
source = databaseName
html_strip = 1
}

5. Run the indexer to create full-text index from your data:

$ cd /usr/local/sphinx/etc
$ /usr/local/sphinx/bin/indexer –all

6. Search
$ cd /usr/local/sphinx/etc
$ /usr/local/sphinx/bin/search promedik





                                                                                       -PAVANKUMAR JOSHI




No comments:

Post a Comment