freeWAIS-sf Basics


Table of Contents:-

  1. Introduction & Credits
  2. Copyright
  3. How freeWAIS-sf Indexes are Built
  4. Limits of freeWAIS-sf -SFgate System
  5. Formulating a Query

1. Introduction & Credits

Most of this document is identical to a part of the freeWAIS-sf Manual that you can find at the University of Dortmund where the package was developed by Ulrich Pfeifer while, at present, freeWAIS-sf is maintained by the WSC Group.
However, the original documentation is too detailed for the final user.
So, only the relevant parts of the original document is reported here.
However the original Copyright is also reported.


2. Copyright

Copyright (C) 1995 Ulrich Pfeifer.

Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies.

Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one.

Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the Free Software Foundation.


3. How freeWAIS-sf Indexes are Built

In order to understand how to query a freeWAIS-sf database, you need to know a little about how it is built.
Only relevant information for the following examples are reported.
Suppose we indexed a phonebook where information about each person (records) are separated in these fields: If we use a traditional WAIS, we can perform only a global query, i.e. we can search for a term in ALL the fields and not in a specific field.
On the contrary, freeWAIS-sf allows to index terms from a database structured in fields.
In particular, freeWAIS-sf creates three different kind of indexes:

LOCALTerms indexed in this way can be found only making a search inside a specific field
GLOBALTerms indexed in this way can be found making a global search similar to that made with old WAIS
BOTHTerms indexed in this way can be found both making a search inside a specific field or making a global search

It is the decision of the System Manager whether the content of a field must be indexed as LOCAL, GLOBAL or BOTH. Moreover, one can or cannot index some fields, saving space. For example, in a phonebook it makes no sense to index e-mail or address.


4. Limits of freeWAIS-sf - SFgate system

There are some limitations concerning indexing databases with freeWAIS-sf that must be known to avoid unexpected results.

CharacteristicsDescription
Terms length
Minimum:Terms made of only one character are NOT indexed.
For example: K (potassium) is NOT indexed, while K+ (potassium ion) is indexed.
Maximum:Terms longer than 20 characters are NOT indexed.
Non-indexed terms
Stopword list:There is a pre-built stopword list of common English terms that are not indexed.
So, it is possible to have some problems indexing databases prepared in other languages.
Too frequent terms:freeWAIS-sf considers as stopword some too frequent terms (more than 20,000 entries per database). So it is better to keep indexes not too big. A trick to bypass this limit is the splitting of a large database into two o more smaller ones. The drawback is that single term weight is altered.
Indexed characters
Default:ASCII characters that are indexed by freeWAIS-sf are: A-Z, a-z and 0-9.
Other characters:freeWAIS-sf allows to index your own characters (for example German characters).
At BioPD we offer the indexing of these four characters: + (plus) - (minus, hyphen) ' (apostrophe, ASCII 39) and _ (underscore).
Manegeing ' (apostrophe) is not plain. You always have to search for terms containing apostrophes inside quotes. See also the document How to Perform a Search of Terms Containing Apostrophes.
Non-indexable
characters:
The following characters cannot be indexed in any case:
< > = ( ) , { } / and "space"
P. Lindner patch: This patch is included in my hacked distribution of freeWAIS-sf.
It allows to index characters that are inside a term.
At BioPD we offer the indexing of terms that have a . (dot) inside them. In the words of P. Lindner: "You can extend it to other characters if you wish. (Though not * I bet...)"
Maximum number of hitsThe maximum number of hits one can retrieve is NOT clear defined.
In the words of Norbert Goevert: "I don't know the limits. They seem to be deep in the freeWAIS-sf stuff:-(".
From a pratical point of view the maximum seems to be around 250 hits.
LanguagesSFgate supports booelan operators and messages in the following languages: English, French, German, Italian, Spanish, Dutch, Swedish, and Portoguese.
So, some words of these eight languages are reserved.
English and Italian are supported at BioPD.
Lowercase/UppercaseFor reasons explained in the document Boolean Searching in freeWAIS-sf, it is always better to make query using lowercase terms.


5. Formulating a Query

freeWAIS-sf uses the original WAIS Protocol. This protocol was designed just for transporting a free text query which means a list of searchable terms separated by spaces. U. Pfeifer, the author of freeWAIS-sf, decided deliberately to use this old protocol so that all clients out there could use these new features.

So he had to encode a new and richer query semantics in the query string. This means that the query had to obey a certain syntax and consequently a user might get a syntax error when submitting a query. The goal of Pfeifer was to make the syntax as easy as possible and especially leave simple free text queries valid.

In the query, categories to be searched should have been selectable for each term. To leave the original queries valid (and to support casual users) Pfeifer provided a default category, which is used if no category is specified in the query.
Here an outline of the query language is given.

  1. ATOMIC SEARCH EXPESSIONS
    The atomic search expressions of the language are:

  2. STEMMING
    Stemming is handled transparently for the client. Terms searched in a stemmed category are searched using their word stem automatically. For a wildcard (only tail truncation is implemented), all matching words from the dictionary are used as search terms. Phrase search (called also literal searches) lookup the words in the string. At least one of them must be an index term. Then the server scans the documents containing this word for string matching the complete phrase. This means that string search can only work if the server has access to the documents. For type URL this is not the case.

  3. SOUNDEX & PHONIX
    The prefix operators soundex and phonix are allowed for converting the query term into its Soundex/Phonix code. This is very useful for example when searching in phonebooks if the exact spelling of a name is not known.

  4. BOOLEAN OPERATORS
    Arbitrary Boolean combination of these atomic expressions with the binary operators "and", "or" and "not" (where ("not" means "and not") are allowed. If you query a freeWAIS-sf database via SFgate, French, German, Italian, Spanish, Dutch, Swedish and Portoguese equivalent are also allowed in the HTML forms were it was specified so. In the case of Italian, these boolean operators are: Parentheses can be used for grouping boolean operators.
    Using classic clients (waissearch, waisq etc...) "OR" may be omitted, maintaining the compatibility with the original syntax.
    However using SFgate 4.0.30 or higher to query freeWAIS-sf databases, there is also the possibility to omit "AND" instead of "OR".

  5. CATEGORIES
    For each expression, a semantic category (field) can be defined using the "category pred" operator, where pred can be:
    1. Text categories
      • = (equal to)
    2. Numeric categories
      • == (equal to)
      • <= (equal to or less than)
      • < (less than)
      • > (greater than)
      • >= (equal to or gretaer than)

Examples

Here there are some examples of possible queries and their explanation.
We suppose to query a databases where there are at least three fields whose name are: For each query we give the explanation of the query syntax.

Please note that these examples refer to classic clients (waissearch, waisq, etc...) NOT to SFgate ones.
You can find LIVE examples for SFgate choosing documents in the HOWTO section.

molecular biology Free text query in the global index.
Find all documents containing terms molecular OR biology in ALL the fields indexed as GLOBAL or BOTH
molecular or biology Same as above
ti=molecular biology Find all the documents which have the term molecular in the field TITLE OR the term biology in a field indexed as GLOBAL or BOTH
ti=(molecular biology) Find all the documents which have the term molecular OR the term biology in the field TITLE
ti=(molecular or biology) Same as above
ti=(molecular and biology) Find all the documents which have the term molecular AND the term biology in the field TITLE
ti=(molecular not biology) Find all the documents which have the term molecular in the field TITLE while the term biology could be present in all other fields indexed as GLOBAL or BOTH except for the field TITLE
py==1990 Find all the documents which have the field PUBLICATION YEAR numerically equal to 1990
py<=1990 Find all the documents which have the field PUBLICATION YEAR numerically equal or less than 1990
py>1990 Find all the documents which have the field PUBLICATION YEAR numerically greater than 1990
ed<19930101 Find all the documents which have the field EDITION DATE older than January 1, 1993. This is in fact a Date Search whose format is yyyymmdd where:
  • yyyy stays for the year.
  • mm stays for the month and valid values are 01-12.
  • dd stays for the day and valid values are 01-31.
au=(soundex salatan) This is a soundex search.
Match terms which sound like to salatan, eg. "Salton"
ti="molecular biology" This is a phrasal (literal) search.
Find all the documents containing the term molecular immediately followed by the term biology in the field TITLE.
Please note the use of quotes to delimite a literal search
mol* This is a global wild-card search.
All the documents containing terms having mol as a stem in all the fields indexed as GLOBAL or BOTH, will be found
(molecular w/10 biology) This is a proximity search.
With this feature, you could search for all the documents that have terms molecular and biology within 10 terms of each other. The order of the terms is not important, i.e., molecular can precede or follow the term biology.
It is NOT implemented at BioPD
(molecular pre/10 biology) This is a proximity search too.
In this case the order of the terms is important. In fact this search will find all the documents that have molecular up to 10 terms before biology.
It is NOT implemented at BioPD
ti=(molecular w/2 biology) Another case of proximity search.
In this case proximity works within the field TITLE. It will find all the documents which have in the TITLE the term molecular within 2 terms of biology. Please, note that you must use parentheses around the terms you want to look for in the field.
It is NOT implemented at BioPD
(atleast/10 biology) This is a "at least" search. It is a particular case of proximity search.
Finds every document that has at least 10 occurrences of biology. The atleast condition has to be all lower-case and there cannot be any spaces between 'at' 'least' and the number, i.e. at least 10 biology will not work.
It is NOT implemented at BioPD



THIS PAGE REFERENCES:
© 1996-97 BioPD - University of Padova - Author: Leopoldo Saggin
Mail to: lsaggin@civ.bio.unipd.it - Last Revision: September 2, 1997
Tested on Netscape 1.22 and higher