ENG/RUS   Main :: RiSearch :: RiSearch Pro :: RiCoord :: RiMap :: RiSearch PHP :: RiLax :: Forum

Introduction :: Manual :: Order :: License :: System requirements

Main

RiSearch Pro v.3.2 Manual

© S. Tarasov

External parsers

      You may use external programs to parse file content. This feature allows to index others file formats - PDF, DOC, PS, some types of archives.

      In order to do this, you have to write in config file extensions, which will be parsed using external programs:

 ext_parser_ext => 'pdf doc ps zip gz', 

      If parser produces HTML (not plain text) you need to add these extensions to other parameter:

 ext_parser_ext_html => 'pdf doc', 

      In this case script will strip out HTML tags from parser output.

      In next lines for each extension you need to specify program and command line parameters, which will be used to parse out text. Input file name should be written as "%file%".

 ext_parser_conf => { 
 'ext1' => 'command1 param %file%', 
 'ext2' => 'command2 param %file% %out_file%', 
 'ext3' => 'command3 param %file% %out_file% %temp_dir%', 
 }, 

      Command line parameters %file%, %out_file% and %temp_dir% will be filled by script during indexing.

  • %file% - path to input file. If you are using spider, script will store data in temporary file and delete it after parser finish its job.
  • %out_file% - in case your parser can not send text to STDOUT, you can use this parameter to specify path to temporary file.
  • %temp_dir% - some parsers may store images in separate directory. You can use this parameter to specify directory for images, script will delete all images and directory after parser finish job. Please refer to documentation for parser to see which command line parameters can be used with this parser.

      You may specify any programs available on your computer, the only requirement is - external program should send text of the indexed file to standard output (STDOUT) or to file. Several examples are shown below.

      PDF (Portable Document Format) - There is several possibilities to convert PDF file to text. You may use utility pdftotext from packet xpdf (http://www.foolabs.com/xpdf/, (c) by Derek B. Noonburg). This utility is available in binary form for several platforms. The command for text extracting and sending it to STDOUT should look like this:

 'pdf' => '/path/pdftotext %file% -', 

      You may also use the tools distributed with GhostScript.

      DOC (MS Office Word) - for text extracting from DOC files the utility antiword can be used (http://antiword.cjb.net/, (c) 1998-2001 by Adri van Os).

 'doc' => '/antiword/antiword %file%', 

      There are several another tools for this task: catdoc (http://www.ice.ru/~vitus/catdoc/), word2x (http://word2x.alcom.co.uk/).

      Some kinds of archives also can be indexed in similar way. But please note, that ALL content of archive will be treated as one text file. If there are other binary files inside, they will be indexed like ordinary text. Some examples are given below:

 'zip' => 'pkunzip -c %file%', 
 'gz' => 'gzip -cd %file%', 
 'rar' => 'rar p %file%', 

      Don't forget to put quotes around %file%, if you want to index files with spaces in filenames.

Archives

      Instead of indexing all content of archive as one text file, you may index all files inside archive separately. Add wanted extensions to "arch_ext" variable in configuration file and specify command line for each extension. Few examples are shown below:

 'zip' => 'pkunzip -e -d %file% %temp_dir%', 
 'rar' => 'rar x -r %file% %temp_dir%', 
 'arj' => 'arj x -t %file% %temp_dir%', 

      Unpacker should take file (%file%) and extract all files into temporary directory (%temp_dir%). Then script will scan this directory and index all files, which match filter rules. In results filename will be shown after "#" sign ( http://www.server.com/dir/archive.zip#file.htm ). Archive file inside archive also will be unpacked into another directory and indexed.



http://risearch.org S.Tarasov, © 2000-2003