mNxt Blog: WEB SEARCH MECHANISM USING SPHINX SEARCH ENGINE

General steps to achieve this.

1. web Crawler/spider.
2. Indexing
3. processing.
4. calculating relevancy.
5. retrieving.

* Sphinx does Indexing, Processing and calculating relevancy itself.(Step 2, 3 & 4).

* SPHINX does not help with web crawling/spider at all.
Would just need a spider that stores the output of the crawl somewhere where
sphinx could read it.

* We can use this crawling mechanism http://commoncrawl.org or build own
crawling.

Example :

1. Crawling whole website and store it in log file

wget -r "http://www.promedik.com" -o log

2. map between a document and it’s original url. wget doesn’t generate such map

grep "oldstr" log|awk '{print $2}'|sed 's/oldstr/newstr/g' > map

http://www.promedik.com/index.php =>

/home/pavan.joshi/www.promedik.com/index.php.

3. Indexing:

insert map data to database like

id url path

1 http://www.promedik.com/ /home/pavan.joshi/www.promedik.com/

4. Configuring sphinx.conf

source se

{

type = mysql

sql_host = localhost

sql_user = root

sql_pass =

sql_db = databseName

sql_query = select id, url, path from data

sql_file_field = path

sql_query_info = select * from data where id=$id

}

index se

{

path = idx

source = databaseName

html_strip = 1

}

5. Run the indexer to create full-text index from your data:

$ cd /usr/local/sphinx/etc

$ /usr/local/sphinx/bin/indexer –all

6. Search

$ cd /usr/local/sphinx/etc

$ /usr/local/sphinx/bin/search promedik

-PAVANKUMAR JOSHI

mNxt Blog

Java

Thursday, 5 July 2012

WEB SEARCH MECHANISM USING SPHINX SEARCH ENGINE

No comments:

Post a Comment