From owner-freebsd-questions Sat May 12 19:44:18 2001 Delivered-To: freebsd-questions@freebsd.org Received: from nisser.com (c0039.upc-c.chello.nl [212.187.0.39]) by hub.freebsd.org (Postfix) with ESMTP id B477837B423 for ; Sat, 12 May 2001 19:44:14 -0700 (PDT) (envelope-from roelof@nisser.com) Received: from nisser.com (roelof [10.0.0.2]) by nisser.com (8.9.3/8.9.2) with ESMTP id EAA55305; Sun, 13 May 2001 04:43:43 +0200 (CEST) (envelope-from roelof@nisser.com) Message-ID: <3AFDF4DE.1196C383@nisser.com> Date: Sun, 13 May 2001 04:43:42 +0200 From: Roelof Osinga Organization: eBOA - Programming the Web X-Mailer: Mozilla 4.77 [en] (Windows NT 5.0; U) X-Accept-Language: en,pdf MIME-Version: 1.0 To: Mike Meyer Cc: Nathan@Vidican.com, questions@FreeBSD.ORG, Ted Mittelstaedt Subject: Re: email to SQL References: <15100.38970.996390.52851@guru.mired.org> Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Sender: owner-freebsd-questions@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Mike Meyer wrote: > > ... > > >it will be much faster than trying to search a couple hundred > > >thousand lines > > >of a text file. > > I think you've misfigured. The amount of time it takes to search a > text is pretty much determined by the search algorithm, not whether > the text is stored in an SQL server or a flat file. In fact, assuming > the same search algorithm is being used, the flat text file should be > faster. mmap it in and you've got it all to search. Since your text is > be scattered across multiple database rows, it will take more than > that for the SQL server to load it before it can start searching. That's for regular searching. > The best text search algorithm is to prepare an index of the stuff > before you need to search it. It's possible to store index information > in a database and search those efficiently, but I'm not sure that's > the most efficient tack to take. Datablades - if mysql has those, > *please* let me know! - might be useful here, but I've not had a > chance to play with them. Someone who's more current on the issue may > suggest something else. Unless your requirements are strange, your > best bet is probably using a text search tool of some kind, preferably > one that text that's structured like mail messages. The best sucess > I've had is with WAIS (there are two versions in the ports), and your > database seems to be small enough for it to handle. I don't know about datablades - Informix I believe - but the traditional answer to this is to use inverted files. I used that approach once on a TurboDOS system, works fine. Basically you stuff each reference to non-filler words (those being words like 'the', 'for', 'in', etc) at the end of the list which is indexed by word. Variable length if possible, although I mimicked that by adding a new row every 10 or so references were added to a word. If you want to get fancy you can add all sorts of options like going to the root of a verb or noun and using a list of synonyms. There is by now extensive literature on the topic available. You'll also need a mechanism to handle queries. The Z39.50 (can't be far off ;) used bij WAIS is a very extensive one. I don't remember what I had written for that. But basically you expect some words with some boolean operators, then you hit the inverted files to retrieve a list of references which you feed to an expression evaluator. Like if 'A and (B or C)' then keep the reference, when done display. By using the inverted lists this is a simple matter of checking whether or not each reference is in those lists according to the pattern. Basically you aren't searching for word patterns but for reference number patterns. No Boyer-Moore needed. Very fast. Roelof -- _______________________________________________________________________ eBOAź est. 1982 http://eBOA.com/ tel. +31-58-2123014 mailto:info@eBOA.com?subject=Information_request fax. +31-58-2160293 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-questions" in the body of the message