Date: Sat, 12 May 2001 22:59:47 -0500 From: Mike Meyer <mwm@mired.org> To: Roelof Osinga <roelof@nisser.com> Cc: Nathan@Vidican.com, questions@FreeBSD.ORG Subject: Re: email to SQL Message-ID: <15102.1715.514686.357518@guru.mired.org> In-Reply-To: <3AFDF4DE.1196C383@nisser.com> References: <15100.38970.996390.52851@guru.mired.org> <3AFDF4DE.1196C383@nisser.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Roelof Osinga <roelof@nisser.com> types: > Mike Meyer wrote: > > > >it will be much faster than trying to search a couple hundred > > > >thousand lines > > > >of a text file. > > I think you've misfigured. The amount of time it takes to search a > > text is pretty much determined by the search algorithm, not whether > > the text is stored in an SQL server or a flat file. In fact, assuming > > the same search algorithm is being used, the flat text file should be > > faster. mmap it in and you've got it all to search. Since your text is > > be scattered across multiple database rows, it will take more than > > that for the SQL server to load it before it can start searching. > That's for regular searching. Yup. > > The best text search algorithm is to prepare an index of the stuff > > before you need to search it. It's possible to store index information > > in a database and search those efficiently, but I'm not sure that's > > the most efficient tack to take. Datablades - if mysql has those, > > *please* let me know! - might be useful here, but I've not had a > > chance to play with them. Someone who's more current on the issue may > > suggest something else. Unless your requirements are strange, your > > best bet is probably using a text search tool of some kind, preferably > > one that text that's structured like mail messages. The best sucess > > I've had is with WAIS (there are two versions in the ports), and your > > database seems to be small enough for it to handle. > I don't know about datablades - Informix I believe - but the traditional > answer to this is to use inverted files. I used that approach once > on a TurboDOS system, works fine. Basically you stuff each reference > to non-filler words (those being words like 'the', 'for', 'in', etc) > at the end of the list which is indexed by word. Variable length if > possible, although I mimicked that by adding a new row every 10 or so > references were added to a word. That's basically what the second sentence in the paragraph of mine you quoted says. I added weight information to the index table, so that I could generate a weight for each page and search phrase, without having to ever look at the text. > You'll also need a mechanism to handle queries. The Z39.50 (can't be far > off ;) used bij WAIS is a very extensive one. I don't remember what > I had written for that. But basically you expect some words with some > boolean operators, then you hit the inverted files to retrieve a > list of references which you feed to an expression evaluator. Like > if 'A and (B or C)' then keep the reference, when done display. By using > the inverted lists this is a simple matter of checking whether or > not each reference is in those lists according to the pattern. z39.50 is a protocol for talking to structured text search engines. You've just barely touched on the kind of things it will allow you to do. For instance, combining the above on the body with searches that are restricted to headers, proximity relationships between words, common spelling errors, etc. As you indicated, all this is in the literature if you look for it. > Basically you aren't searching for word patterns but for reference > number patterns. No Boyer-Moore needed. Very fast. Yup. An SQL server can do this very fast if you structure the database properly. But a search engine designed specifically for text will also do it very fast, without requiring either a DBA to set it up, or code outside the server to turn the text into indices. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/home/mwm/ Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-questions" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?15102.1715.514686.357518>