Main
RiSearch Pro v.3.2 Manual
© S. Tarasov
Configuration
Edit file riconfig.pm to set several parameters. Most of them are self documented
and does not require explanation.
base_dir => "../../",
- path to the directory, where your html files are located. If index.pl located
in the same directory, leave this variable as is. Please note, that in all cases
you should use or relative path, or absolute, starting from file system root
(not from webserver root directory).
More...
base_url => "http://www.server.com/",
- URL of your site.
site_size => 2,
- this variable controls database size and searching speed.
compact_index => 1,
- in compact mode index will take less space, but it will be limited to 65535 documents.
indexing_speed => 1,
- This parameter defines indexing speed and memory usage: 0 - slow indexing, but less
memory required; 1 - fast indexing, more memory required.
non_parse_ext => 'txt',
- list of extensions, were script should not remove HTML tags.
ext_parser_ext => 'pdf doc',
- this files will be indexed using external programs.
Please read more about external parser usage.
arch_ext => 'zip rar arj',
- archives extensions.
Please read more about archives indexing.
bin_ext => 'ppt xls',
- files with these extensions will not be indexed, but URL will be indexed.
$zone[1] = 'dir1';
- site zones description. Any number of zones may be used. Every zone has unique number.
The search form should send to the script additional parameter "z" with value equal
to the chosen zone. You may use checkboxes, radio-buttons or menus.
There is example in file "template.htm". When using checkboxes or menus
with attribute multiple, you may choose several zones simultaneously.
In such case search will be performed in all chosen zones. For searching in whole
site "z" should be equal zero or not send to script at all.
If one zone is located in several directories separate them by vertical bar
without space ( $zone[1] = 'dir2|dir3'; ).
numbers => '0-9',
- during the indexing script removes all non alphabetic characters from page
and index what is left. As alphabetic character script interprets Latin
characters and characters of regional alphabet (will be discussed later).
Here you may add other characters, which should be indexed (such as numbers,
underscore sign and so on).
use_selective_indexing => "NO",
- this option is useful for big sites with complex navigation, news postings
and other elements, which appear on every page and, probably, should not be
indexed. It allows to tell to the script, which parts of page should be cut
before indexing. Turn on this option ("YES") and uncomment next lines in file "config.pl".
no_index_strings => {
q[<!-- No index start 1 -->] => q[<!-- No index end 1 -->],
q[<!-- No index start 2 -->] => q[<!-- No index end 2 -->],
},
Inside the square brackets you need to write two strings. Everything placed between them
will be cut (note, if there are several occurrences of this strings
in file, each occurrence will be processed). For this purpose you may use
special marks, which divide different elements of design.
cut_default_filenames => 'YES',
- this variable allows to cut default filenames (such as index.html) from URL in search results.
INDEXING_SCHEME => 2,
- words indexing scheme. If indexing scheme equal "1", index is build on the whole word base.
Fastest method, but script will find only words equal to the keyword.
When indexing scheme is "2", index is based on the beginning of each word.
Script will find all words, which begin with given keyword. For example, for query
"port*" the words "portrait" and "portion" also will be found.
use_stop_words => "YES",
- list of common words, which should not be indexed.
verbose_output => 1,
- during indexing script will print information about every indexed file.
Change value to "0" to print information about every 100th file.
min_length => 3,
- minimal word length for indeixing.
max_length => 32,
- maximal word length for indeixing (longer words will be truncated).
max_doc_size => 1000000,
- maximal document size (bigger files will be truncated).
res_num => 10,
- number of results in page.
max_res_found => 0,
- maximal number of found documents (0 - no limit).
del_descr_chars => "",
- listed here characters will be removed from document description.
url_length_limit => 0,
- URL length limit in results output (0 - no limit).
CAP_LETTERS => '\xC0-\xDF\xA8',
- Put here list of capital letters of your language (which are different from Latin).
Do the same for small letters.
def_search_type => 1,
- Default search type. Possible values: 0 - substring search (can be used only with
INDEXING_SCHEME => 2), 1 - exact word search.
def_search_mode => "AND",
- Default search mode. Possible values: "AND" or "OR".
Results caching
Results of search can be cached to minimize response time.
Results will be cached only if
large number of documents was found or search took long time.
Cache files will be stored in separate directory "cache".
Script can erease old results itself or you can do it
manualy.
If "check_cache" option is set, script will use cache
for every query, if this query was asked recently.
Otherwise cache will be used only when displaying
next pages with search results.
enable_cache => "YES",
- turn "On" or "Off" caching.
check_cache => "YES",
- use cache for every query.
min_doc_found => 1000,
- use cache if number of found documents is larger then specified here.
min_search_time => 0.2,
- use cache if search time is longer then specified here.
delete_cache => "YES",
- delete old results automatically.
delete_cache_delay => 3600,
- delete old results after NNN seconds.
Spidering
Spidering script uses all parameters described above (except
base_dir and
base_url ).
You have to set several additional variables.
start_url
- List of starting URLs.
spider_delay => 0,
- delay in seconds between requests.
max_depth => 20,
- maximal spidering depth (number of "clicks" from start page to current page).
login => "",
- login for access to closed sections of your site (used only with spider.pl).
password => "",
- password for access to closed sections of you site (used only with spider.pl).
proxy => "http://user:password@server.com:port/",
- proxy settings for spider.
use_robots_txt_rules => 0,
- follow or not ROBOTS.TXT rules during indexing.
e_mail => 'foo@bar.com',
- webmaster's e-mail.
URL filter rules
Filter rules defines which URL should be indexed by spider.
Rule consists of commands (Index, NoIndex, Follow, NoFollow, Allow, Disallow), optional modifiers
(Match, NoMatch, NoCase, Case, String, Regex) and string (regular expression),
which will be matched against URL. Two actions are possible for each URL:
indexing and links extraction. Index Follow means that this URL will be indexed
and all links from file will be extracted for further indexing.
NoIndex NoFollow means that both actions are forbidden for this type of URL.
Use Follow NoIndex to allow links exctraction without file indexing.
Comand Allow is synonim for Index Follow, and command Disallow
is synonim for NoIndex NoFollow.
Allow [Match|NoMatch] [NoCase|Case] [String|Regex] [ ... ]
Use this to allow URLs that match (doesn't match) given argument.
First three optional parameters describe the type of comparison.
Default values are Match, NoCase, String.
Use "NoCase" or "Case" values to choose case insensitive or case sensitive
comparison.
Use "Regex" to choose regular expression comparison.
Use "String" to choose string with wildcard comparison.
One wildcard can be used - "*",
which stands for any number of any characters.
Disallow [Match|NoMatch] [NoCase|Case] [String|Regex] [ ... ]
Use this to disallow URLs that match (doesn't match) given argument.
The meaning of first three optional parameters is exactly the same
with "Allow" command.
Indexer compares URLs against all these command arguments in the
order of their appearance in config file.
Last command that matches some rule will take effect.
Some examples are presented below:
Disallow *
Allow http://risearch.org/*
Disallow */cgi-bin/* */img/* */temp/*
Disallow NoMatch *.htm *.html *.txt */
File content filter
File content filter allows to exclude file from indexing using specific
commands in file body. You should turn this function "ON" ( use_command_tag => 1 )
and specify beginning and end of command tag ("start_command_tag" and "end_command_tag" in config file).
Then you use some codes/words inside this tag and "index_command_tag" and "noindex_command_tag"
parameters define which codes allow or forbid file indexing.
Please note that if content filter is "ON" and no command tag is found in page,
page will NOT be indexed (this can be changed easyli).
|