General steps to achieve this.
1. web Crawler/spider.
2. Indexing
3. processing.
4. calculating relevancy.
5. retrieving.
* Sphinx does Indexing, Processing and calculating relevancy itself.(Step 2, 3 & 4).
* SPHINX does not help with web crawling/spider at all.
Would just need a spider that stores the output of the crawl somewhere where
sphinx could read it.
* We can use this crawling mechanism http://commoncrawl.org or build own
crawling.
Example :
1. Crawling whole website and
store it in log file
wget -r
"http://www.promedik.com" -o log
2. map between a document and
it’s original url. wget doesn’t generate such map
grep "oldstr"
log|awk '{print $2}'|sed 's/oldstr/newstr/g' > map
like
http://www.promedik.com/index.php =>
/home/pavan.joshi/www.promedik.com/index.php.
3. Indexing:
insert map data to
database like
id
url path
1
http://www.promedik.com/ /home/pavan.joshi/www.promedik.com/
4. Configuring sphinx.conf
source se
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass =
sql_db = databseName
sql_query = select id, url, path from
data
sql_file_field = path
sql_query_info = select * from data
where id=$id
}
index se
{
path = idx
source = databaseName
html_strip = 1
}
5. Run the indexer to create
full-text index from your data:
$ cd
/usr/local/sphinx/etc
$
/usr/local/sphinx/bin/indexer –all
6. Search
$ cd /usr/local/sphinx/etc
$
/usr/local/sphinx/bin/search promedik
-PAVANKUMAR JOSHI
No comments:
Post a Comment