From owner-freebsd-database  Mon Mar 30 06:49:01 1998
Return-Path: <owner-freebsd-database@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id GAA12735
          for freebsd-database-outgoing; Mon, 30 Mar 1998 06:49:01 -0800 (PST)
          (envelope-from owner-freebsd-database@FreeBSD.ORG)
Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id GAA12723;
          Mon, 30 Mar 1998 06:48:58 -0800 (PST)
          (envelope-from jfieber@indiana.edu)
Received: from localhost (jfieber@localhost)
	by fallout.campusview.indiana.edu (8.8.8/8.8.7) with SMTP id JAA06977;
	Mon, 30 Mar 1998 09:48:45 -0500 (EST)
Date: Mon, 30 Mar 1998 09:48:45 -0500 (EST)
From: John Fieber <jfieber@indiana.edu>
To: nik@iii.co.uk
cc: shimon@simon-shapiro.org, Wolfram Schneider <wosch@cs.tu-berlin.de>,
        freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org,
        Satoshi Asami <asami@FreeBSD.ORG>,
        Amancio Hasty <hasty@rah.star-gate.com>
Subject: Re: Mailing list search interface
In-Reply-To: <19980330110200.17368@iii.co.uk>
Message-ID: <Pine.BSF.3.96.980330091604.485T-100000@fallout.campusview.indiana.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-database@FreeBSD.ORG
Precedence: bulk

On Mon, 30 Mar 1998 nik@iii.co.uk wrote:

> I mentioned MHonArc to Jordan, and his first response was 
> 
> > Eeek!  The evil MHonArc resurfaces! ;-)
> >
> > It doesn't scale at all well - just try MHonArc'ing a really big mailing
> > list archive.  You soon get a set of monster html files that are
> > essentially unusable - I know, I did the short-lived "FreeBSD Docs"
> > CD for awhile using MHonArc.

Listen to the man!  He knows what he is talking about...well, in
this case at least.  :)

> I think he's been using an older version of MHonArc. I did some tests
> late last week, archiving and indexing the archives for -hackers from
> the beginning of 1998. That's 11,265K or thereabouts.
> 
> At the end of the conversion (which consisted of running MHonArc 2.2.0
> over the files, and then using Glimpse 4.1 to index them) I had a total
> of 32,910K HTML and index files.
> 
> The output of 'time -l' on the conversion process was:
>
>       626.11 real       438.83 user        93.13 sys

On what sort of hardware?

By quick back-of-an-envelope calculations, this is slower than
the current indexing scheme on hub by at least a factor of 10.
Indexing anything large is typically an I/O bound operation and
when you start indexing much more than can fit in RAM, your
performance will degrade dramatically, so it is probably slower
by much more than a factor of 10.

It currently takes about 45 minutes to index all 620+ megabytes
of mail from scratch on hub and most of that is waiting for disk
i/o, since the disks on hub are pretty busy even without disk
activity.

> At the end of the conversion process I had a threaded copy of the -hackers
> mail archives going back almost three months.

Three months of -hackers != to 5 years of all the mailings lists. 
I am confident that you will find that this scheme becomes a big
hairy hassle when you throw the whole thing at it.  

It is space inefficient because you have the original archive,
plus the HTML versions (most of which will *never* be viewed I
might add), the index, and the filesystem overhead of one file
per message. 

Because the theading is done in batch mode, it is awkward to make
enhancements to the threading algorithm. 

It is a hassle to retro-actively change the conversion to HTML.

Though I have no first-hand proof, knowing how Glimpse works, I
suspect searches will generate quite a bit more disk I/O on the
server than freeWAIS.

The ranking algorithm that Glimpse uses (or used last I checked)
is primative. (In an survey of what people liked, hated and most
wanted in the mailing list archives, people wanted thread
searching and date sorting, but only second and third *after* the
currently implemented ranking algorithm, which most people found
to work very well most of the time.)

And on and on...  I think it is time to add an FAQ entry on why
we don't use hypermail or MHonArc for the mailing list archives. 


It isn't that things like MHonArc are not valliant efforts, but
they are merely refinemests of what is fundamentally a
quick-and-dirty, non-scalable solution.  As I hinted in another
message, a proper solution would be based on a hybrid full
text/RDBMS.  Whether a true hybrid system is built, or just the
illusion is built using some crafty CGI scripts is a detail to be
worked out. 

-john


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-database" in the body of the message