From owner-freebsd-questions  Sat May 12 19:44:18 2001
Delivered-To: freebsd-questions@freebsd.org
Received: from nisser.com (c0039.upc-c.chello.nl [212.187.0.39])
	by hub.freebsd.org (Postfix) with ESMTP id B477837B423
	for <questions@FreeBSD.ORG>; Sat, 12 May 2001 19:44:14 -0700 (PDT)
	(envelope-from roelof@nisser.com)
Received: from nisser.com (roelof [10.0.0.2])
	by nisser.com (8.9.3/8.9.2) with ESMTP id EAA55305;
	Sun, 13 May 2001 04:43:43 +0200 (CEST)
	(envelope-from roelof@nisser.com)
Message-ID: <3AFDF4DE.1196C383@nisser.com>
Date: Sun, 13 May 2001 04:43:42 +0200
From: Roelof Osinga <roelof@nisser.com>
Organization: eBOA - Programming the Web
X-Mailer: Mozilla 4.77 [en] (Windows NT 5.0; U)
X-Accept-Language: en,pdf
MIME-Version: 1.0
To: Mike Meyer <mwm@mired.org>
Cc: Nathan@Vidican.com, questions@FreeBSD.ORG,
	Ted Mittelstaedt <tedm@toybox.placo.com>
Subject: Re: email to SQL
References: <15100.38970.996390.52851@guru.mired.org>
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Sender: owner-freebsd-questions@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Mike Meyer wrote:
> 
> ...
> > >it will be much faster than trying to search a couple hundred
> > >thousand lines
> > >of a text file.
> 
> I think you've misfigured. The amount of time it takes to search a
> text is pretty much determined by the search algorithm, not whether
> the text is stored in an SQL server or a flat file. In fact, assuming
> the same search algorithm is being used, the flat text file should be
> faster. mmap it in and you've got it all to search. Since your text is
> be scattered across multiple database rows, it will take more than
> that for the SQL server to load it before it can start searching.

That's for regular searching.

> The best text search algorithm is to prepare an index of the stuff
> before you need to search it. It's possible to store index information
> in a database and search those efficiently, but I'm not sure that's
> the most efficient tack to take.  Datablades - if mysql has those,
> *please* let me know! - might be useful here, but I've not had a
> chance to play with them.  Someone who's more current on the issue may
> suggest something else.  Unless your requirements are strange, your
> best bet is probably using a text search tool of some kind, preferably
> one that text that's structured like mail messages.  The best sucess
> I've had is with WAIS (there are two versions in the ports), and your
> database seems to be small enough for it to handle.

I don't know about datablades - Informix I believe - but the traditional
answer to this is to use inverted files. I used that approach once
on a TurboDOS system, works fine. Basically you stuff each reference
to non-filler words (those being words like 'the', 'for', 'in', etc)
at the end of the list which is indexed by word. Variable length if
possible, although I mimicked that by adding a new row every 10 or so
references were added to a word.

If you want to get fancy you can add all sorts of options like going
to the root of a verb or noun and using a list of synonyms. There is
by now extensive literature on the topic available.

You'll also need a mechanism to handle queries. The Z39.50 (can't be far
off ;) used bij WAIS is a very extensive one. I don't remember what
I had written for that. But basically you expect some words with some
boolean operators, then you hit the inverted files to retrieve a
list of references which you feed to an expression evaluator. Like
if 'A and (B or C)' then keep the reference, when done display. By using
the inverted lists this is a simple matter of checking whether or
not each reference is in those lists according to the pattern.

Basically you aren't searching for word patterns but for reference
number patterns. No Boyer-Moore needed. Very fast.

Roelof

-- 
_______________________________________________________________________
eBOAŽ                                               est. 1982
http://eBOA.com/                                    tel. +31-58-2123014
mailto:info@eBOA.com?subject=Information_request    fax. +31-58-2160293

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-questions" in the body of the message