Thursday, 5 July 2012

DOCUMENT SEARCH USING SPHINX SEARCH ENGINE



   Sphinx supported documents – Text/Html/XML.
  1. Documents search(Text/PPT/XML/PDF/XLS/DOC/Images).
  2. XML search using Sphinx data source xmlpipe/xmlpipe2.

     Text /DOC /XML/PPT/XLS/PDF/Images file search example

For Text/DOC/XML : there is no need other tools . Sphinx directly supports.

PPT/PDF/XLS – Sphinx does not supports directly . We need implement by using third
party plugin. Like convertion tools like
            1. PDF to text/HTML/XML
            2. PPT to text/HTML/XML.
            3. XLS to text/HTML/XML.

Required Tools:

For PDF – pdftohtml
Download: (Linux)
(Windows)
Usage : (Linux)
$ pdftohtml /path/to/PDFfile.pdf
(Windows)
"C:\path\to\pdftohtm.exe" "C:\path\to\PDFfile.pdf".
Further


For PPT – ppthtml
Download: (Linux)
(Windows)
Usage : (Linux)
$ ppthtml /path/to/PPtfile.ppt
(Windows)
"C:\path\to\ppthtml.exe" "C:\path\to\PPTfile.ppt".
Further


For XLS – xlhtml
Download: (Linux)
(Windows)

Usage : (Linux)
$ xlhtml /path/to/XLSfile.xls

(Windows)
"C:\path\to\xlhtml.exe" "C:\path\to\XLSfile.xls".

Further


Here is Complete example :

  1. create table sphinx_data:
    CREATE TABLE IF NOT EXISTS `Document_data` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `file_name` varchar(50) NOT NULL,
    `path` varchar(100) NOT NULL,
    `convertion_path` varchar(100) NOT NULL,
    PRIMARY KEY (`id`)
    ) ;
file_name : Doc Name
path : Doc Saved Path.
Convertion_path : Converted Docs(Only PPT/PDF/XLS) saved Path

2.
Format Convertion.
Text/Doc/xml – no need to convertion .

PDF – convert HTML format using pdftohtml tool

PPT - convert HTML format using ppthtml tool.

XlS – convert HTML format using xlhtml tool.
Convertion files path are stored in convertion_path column.


3.
Insert doc data to db table Documents_data like

id         file_name              path (File location)                              Convertion path
1           test.txt                 /<file_path>/test.txt/                     /<file_path>/test.txt/
2           test.doc              /<file_path>/test.doc/                    /<file_path>/test.doc/
3           test.xml               /<file_path>/test.xml/                    /<file_path>/test.xml/
4           test.ppt               /<file_path>/test.ppt/                    /<file_path>/test.html/
5           test.xls                /<file_path>/test.xls/                      /<file_path>/test.html/
6           test.pdf                /<file_path>/test.pdf/                    /<file_path>/test.html/
7           images               /<file_path>/test.image.               <file_path>/test.images/


  1. For Searching - Used Convertion_path column.
Documents View or Download – Use Path column (Original Doc file path).

5. Configuring sphinx.conf
source se
{
type = mysql
sql_sock = <sql socket path><default /etc/var/mysql/mysql.sock>
sql_host = localhost
sql_user = username --> (as you described in database)
sql_pass = password --> (as you described in database)
sql_db = databseName
sql_query = select id, file_name, path from sphinx_data
sql_file_field = convertion_path --> search column.
sql_query_info = select * from sphinx_data where id=$id
}
index se
{
path = idx
source = databaseName
html_strip = 1
}


5. Run the indexer to create full-text index from your data:

$ cd /usr/local/sphinx/etc
$ /usr/local/sphinx/bin/indexer –all

6. Search
$ cd /usr/local/sphinx/etc
$ /usr/local/sphinx/bin/search promedik.

7. Returns Documents Ids.


8. Display search results Using Any language(PHP,Java,Python,Perl).



                                                                                    -PAVANKUMAR JOSHI

1 comment: