Date: Mon, 30 Mar 1998 13:36:52 -0800 (PST) From: Simon Shapiro <shimon@simon-shapiro.org> To: John Fieber <jfieber@indiana.edu> Cc: freebsd-database@FreeBSD.ORG Subject: RE: Mail indexing infrastructure Message-ID: <XFMail.980330133652.shimon@simon-shapiro.org> In-Reply-To: <Pine.BSF.3.96.980330154855.8177B-100000@fallout.campusview.indiana.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On 30-Mar-98 John Fieber wrote: > On Mon, 30 Mar 1998, Simon Shapiro wrote: > >> > The FreeBSD mailing list archive is 620MB large. There are currently >> > 270,000 messages. The archive grow with 100,000 messages/year. >> >> Excellent. How many years back do we want to keep? > > The current indexed archive goes back to 1994. This is not an answer to my question :-) Currently we are keeping 4 years. Do we want to keep 40? 10? 5? Some (theoretical) limit has to be put. >> Also, if the current engine is so great, how come all these people are >> excited about replacing it? > > Thread retrieval and date scoping. However, most proposed > solutions involve a wholesale replacement rather than augumenting > what we have, which works pretty well, all told. If thread retrieval is based on Subject: line, an RDBMS is a trivially good solution. One can even apply regex to the subject, limit dates, etc. I admit having an interest in this which goes beyond mail archives search. In this context here, though, My RDBMS tilt can be viewed as intelectually satisfying. If the current system is good and should only be augmented, rather than replaced, this is fine by me. > Basically, the vector-space ranked retrieval we already have, > possibly scoped by date, is the best way to start a search, > followed by thread retrieval once a promising message has been > found. Wolfram's home-brew solution for threads is more along the > lines of what we need. Don't confuse solutions and problems. You currently have a text searching system, which you are happy with. Aside from that, not replacing it, not augmenting it, just pondering the problems that exist and wether an RDBMS soultion can be applied to such a problem. My takr on it is that until we actually build the core RDBMS schema for it, load it and run some tests, we will really not know if it is worth it, in the performace department. for other instances there are some other consideration, of course. > I have working date scoping in prototype, but there are > performance problems--freeWAIS really doesn't handle that sort of > thing very well and I'm a bit concerned about killing > www.freebsd.org with it because I know it will be a popular > feature. Of course. You will be doing ``full table scan'' for date scoping. > I also have half a mind to provide relevance feedback (a "find > more like this..." link) but my free time is much smaller than > the things I have to fill it with. :( Thsi is where RDBMS can help. You do not arrange the data for a query. You ``normalize'' the data. Queries come later, in unplanned for manner and are serviced with reasonable efficiency. ---------- Sincerely Yours, Simon Shapiro Shimon@Simon-Shapiro.ORG Voice: 503.799.2313 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.980330133652.shimon>
