|ENG/RUS Main :: RiSearch :: RiSearch Pro :: RiCoord :: RiMap :: RiSearch PHP :: RiLax :: Forum|
|Introduction :: Manual :: Order :: License :: System requirements|
RiSearch Pro v.3.2 Manual© S. Tarasov
You may use external programs to parse file content. This feature allows to index others file formats - PDF, DOC, PS, some types of archives.
In order to do this, you have to write in config file extensions, which will be parsed using external programs:
If parser produces HTML (not plain text) you need to add these extensions to other parameter:
In this case script will strip out HTML tags from parser output.
In next lines for each extension you need to specify program and command line parameters, which will be used to parse out text. Input file name should be written as "%file%".
Command line parameters %file%, %out_file% and %temp_dir% will be filled by script during indexing.
You may specify any programs available on your computer, the only requirement is - external program should send text of the indexed file to standard output (STDOUT) or to file. Several examples are shown below.
PDF (Portable Document Format) - There is several possibilities to convert PDF file to text. You may use utility pdftotext from packet xpdf (http://www.foolabs.com/xpdf/, (c) by Derek B. Noonburg). This utility is available in binary form for several platforms. The command for text extracting and sending it to STDOUT should look like this:
You may also use the tools distributed with GhostScript.
DOC (MS Office Word) - for text extracting from DOC files the utility antiword can be used (http://antiword.cjb.net/, (c) 1998-2001 by Adri van Os).
There are several another tools for this task: catdoc (http://www.ice.ru/~vitus/catdoc/), word2x (http://word2x.alcom.co.uk/).
Some kinds of archives also can be indexed in similar way. But please note, that ALL content of archive will be treated as one text file. If there are other binary files inside, they will be indexed like ordinary text. Some examples are given below:
Don't forget to put quotes around %file%, if you want to index files
with spaces in filenames.
Instead of indexing all content of archive as one text file, you may index all files inside archive separately. Add wanted extensions to "arch_ext" variable in configuration file and specify command line for each extension. Few examples are shown below:
Unpacker should take file (%file%) and extract all files into temporary directory (%temp_dir%). Then script will scan this directory and index all files, which match filter rules. In results filename will be shown after "#" sign ( http://www.server.com/dir/archive.zip#file.htm ). Archive file inside archive also will be unpacked into another directory and indexed.
|http://risearch.org||S.Tarasov, © 2000-2003|