ENG/RUS   Main :: RiSearch :: RiSearch Pro :: RiCoord :: RiMap :: RiSearch PHP :: RiLax :: Forum

Introduction :: Manual :: Order :: License :: System requirements

Main

RiSearch Pro v.3.2 Manual

© S. Tarasov

Indexing

      RiSearch Pro is a search script with index. It means, that before you can search it reads all your files and stores information in specific format for faster searching. RiSearch Pro has two types of index. One (static) index will be created when you first time index your site. This index has compact format, which allow to work with many thousands of documents. But index update is very expensive operation. Therefore script uses second index (dynamic) when you add new document to site. This index has block structure and is able to add new file in real time. Dynamic index has its own disadvantages: bigger size and slower search. Periodically (after you add about 1000-2000 new documents) you have to merge both indexes.

      To start indexing, you should run script "index.pl". You may do it using UnixShell, if your provider allows it, run it via admin panel or directly in browser window (script will ask for password, which can be created in admin panel). During the indexing script will create several files with information about your site (0_hash, 0_wordind and others) and store them in "db_N" directory, where "N" is some number.

      Another way to index your site is via HTTP protocol. Run "spider.pl" and it will crawl through your files and parse out all the links (spider.pl requires LWP module). It is useful for indexing dynamic sites (such as webboards).

      When script requests page from server it will identify itself as "RiSpider/1.0". You can change user-agent name in file "lib/common_lib.pm" in line:

$ua->agent("RiSpider/1.0");

      You may pass several parameters to scripts. For example:

 perl index.pl -base_dir=../ -base_url=http://www.server.com/ -rules=filename 

If no parameters are passed, script will use parameters from configuration file.

  1.  -base_dir=path/to/dir  - path to the directory, where your html files are located. Please note, that in all cases you should use or relative path, or absolute, starting from file system root (not from webserver root directory).

  2.  -base_url=http://www.server.com/  - URL of your site.

  3.  -rules=filter_filename  - file with filter rules (if no file is specified, default rules will be used).

  4.  -login=login  - login for access to closed sections of your site (used only with spider.pl).

  5.  -password=password  - password for access to closed sections of you site (used only with spider.pl).

      Indexing process requires a lot of system resources. Probably, it is better to index local copy of your site. Then just copy created database files to the server (please use "BIN" mode). Amount of RAM, required for indexing, depends on the "temp_db_size" variable in configuration file and the size of documents you want to index. New version of script has much smaller memory requirements, but still script may require 100-200 Mb of memory during indexing if your documents is bigger than 1 Mb.

      Please note, that most webservers will not allow to script to work too long time. After 30-60 seconds webserver will kill your script if it not finishes indexing at that time. Therefore, you will not be able to index more than several megabytes running "index.pl" as CGI script. In order to index large sites you have to run script via UnixShell, to use incremental indexing or to index local copy of your site.

Incremental indexing

      Both scripts can be stopped and restarted. Press "Ctrl-C" and script will save current state to hard disk. Later you can restart script using paramter "-action=restart". This can be helpfull during very large site indexing. If indexing takes too much time or memory, stop script and restart it later.

      As was stated above, most webservers will stop scripts after some time. It does not allow to index big sites via browser. Now there is solution. Just set two additional parameters in configuration file:

server_timeout - amount of time allowed for script to work;

restart_delay - delay between script restarts.

Script will start indexing, but after some time (defined in server_timeout parameter) it will save current state into hard disk and stop. Then brouser will start script again automatically and it will continue indexing process.

Auxiliary index

      Script can use auxiliary indexes for specific kinds of searches. At this time substring search and fuzzy search are available. To create substring index you have to use command:

 perl index.pl -action=substring 

      Substring index can be created automatically every time you reindex your site if you set "create_substring_index" parameter in configuration file. Remember, that index will be created only when you reindex your site. If you add new pages, substring index should be created manually as described above.

      Substring index also can be created via browser, using next query:

 http://www.server.com/cgi-bin/search/index.pl?action=substring 



http://risearch.org S.Tarasov, © 2000-2003