FreeBSD Mail Archives

Date:      Mon, 30 Mar 1998 13:36:52 -0800 (PST)
From:      Simon Shapiro <shimon@simon-shapiro.org>
To:        John Fieber <jfieber@indiana.edu>
Cc:        freebsd-database@FreeBSD.ORG
Subject:   RE: Mail indexing infrastructure
Message-ID:  <XFMail.980330133652.shimon@simon-shapiro.org>
In-Reply-To: <Pine.BSF.3.96.980330154855.8177B-100000@fallout.campusview.indiana.edu>

index | next in thread | previous in thread | raw e-mail

On 30-Mar-98 John Fieber wrote:
> On Mon, 30 Mar 1998, Simon Shapiro wrote:
> 
>> > The FreeBSD mailing list archive is 620MB large. There are currently
>> > 270,000 messages. The archive grow with 100,000 messages/year.
>> 
>> Excellent.  How many years back do we want to keep?
> 
> The current indexed archive goes back to 1994.

This is not an answer to my question :-)  Currently we are keeping 4 years. 
Do we want to keep 40? 10? 5? Some (theoretical) limit has to be put.

>> Also, if the current engine is so great, how come all these people are
>> excited about replacing it?
> 
> Thread retrieval and date scoping.  However, most proposed
> solutions involve a wholesale replacement rather than augumenting
> what we have, which works pretty well, all told.

If thread retrieval is based on Subject: line, an RDBMS is a trivially good
solution.  One can even apply regex to the subject, limit dates, etc.

I admit having an interest in this which goes beyond mail archives search.
In this context here, though, My RDBMS tilt can be viewed as intelectually
satisfying.  If the current system is good and should only be augmented,
rather than replaced, this is fine by me.

> Basically, the vector-space ranked retrieval we already have,
> possibly scoped by date, is the best way to start a search,
> followed by thread retrieval once a promising message has been
> found. Wolfram's home-brew solution for threads is more along the
> lines of what we need.

Don't confuse solutions and problems.  You currently have a text searching
system, which you are happy with.  Aside from that, not replacing it, not
augmenting it, just pondering the problems that exist and wether an RDBMS
soultion can be applied to such a problem.

My takr on it is that until we actually build the core RDBMS schema for
it, load it and run some tests, we will really not know if it is worth it,
in the performace department.  for other instances there are some other
consideration, of course.

> I have working date scoping in prototype, but there are
> performance problems--freeWAIS really doesn't handle that sort of
> thing very well and I'm a bit concerned about killing
> www.freebsd.org with it because I know it will be a popular
> feature.

Of course.  You will be doing ``full table scan'' for date scoping.

> I also have half a mind to provide relevance feedback (a "find
> more like this..." link) but my free time is much smaller than
> the things I have to fill it with.  :(

Thsi is where RDBMS can help.  You do not arrange the data for a query. 
You ``normalize'' the data.  Queries come later, in unplanned for manner
and are serviced with reasonable efficiency.

----------

Sincerely Yours, 

Simon Shapiro
Shimon@Simon-Shapiro.ORG                      Voice:   503.799.2313

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-database" in the body of the message

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.980330133652.shimon>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation