From owner-freebsd-database Mon Mar 30 06:49:01 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id GAA12735 for freebsd-database-outgoing; Mon, 30 Mar 1998 06:49:01 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id GAA12723; Mon, 30 Mar 1998 06:48:58 -0800 (PST) (envelope-from jfieber@indiana.edu) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.8/8.8.7) with SMTP id JAA06977; Mon, 30 Mar 1998 09:48:45 -0500 (EST) Date: Mon, 30 Mar 1998 09:48:45 -0500 (EST) From: John Fieber To: nik@iii.co.uk cc: shimon@simon-shapiro.org, Wolfram Schneider , freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami , Amancio Hasty Subject: Re: Mailing list search interface In-Reply-To: <19980330110200.17368@iii.co.uk> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On Mon, 30 Mar 1998 nik@iii.co.uk wrote: > I mentioned MHonArc to Jordan, and his first response was > > > Eeek! The evil MHonArc resurfaces! ;-) > > > > It doesn't scale at all well - just try MHonArc'ing a really big mailing > > list archive. You soon get a set of monster html files that are > > essentially unusable - I know, I did the short-lived "FreeBSD Docs" > > CD for awhile using MHonArc. Listen to the man! He knows what he is talking about...well, in this case at least. :) > I think he's been using an older version of MHonArc. I did some tests > late last week, archiving and indexing the archives for -hackers from > the beginning of 1998. That's 11,265K or thereabouts. > > At the end of the conversion (which consisted of running MHonArc 2.2.0 > over the files, and then using Glimpse 4.1 to index them) I had a total > of 32,910K HTML and index files. > > The output of 'time -l' on the conversion process was: > > 626.11 real 438.83 user 93.13 sys On what sort of hardware? By quick back-of-an-envelope calculations, this is slower than the current indexing scheme on hub by at least a factor of 10. Indexing anything large is typically an I/O bound operation and when you start indexing much more than can fit in RAM, your performance will degrade dramatically, so it is probably slower by much more than a factor of 10. It currently takes about 45 minutes to index all 620+ megabytes of mail from scratch on hub and most of that is waiting for disk i/o, since the disks on hub are pretty busy even without disk activity. > At the end of the conversion process I had a threaded copy of the -hackers > mail archives going back almost three months. Three months of -hackers != to 5 years of all the mailings lists. I am confident that you will find that this scheme becomes a big hairy hassle when you throw the whole thing at it. It is space inefficient because you have the original archive, plus the HTML versions (most of which will *never* be viewed I might add), the index, and the filesystem overhead of one file per message. Because the theading is done in batch mode, it is awkward to make enhancements to the threading algorithm. It is a hassle to retro-actively change the conversion to HTML. Though I have no first-hand proof, knowing how Glimpse works, I suspect searches will generate quite a bit more disk I/O on the server than freeWAIS. The ranking algorithm that Glimpse uses (or used last I checked) is primative. (In an survey of what people liked, hated and most wanted in the mailing list archives, people wanted thread searching and date sorting, but only second and third *after* the currently implemented ranking algorithm, which most people found to work very well most of the time.) And on and on... I think it is time to add an FAQ entry on why we don't use hypermail or MHonArc for the mailing list archives. It isn't that things like MHonArc are not valliant efforts, but they are merely refinemests of what is fundamentally a quick-and-dirty, non-scalable solution. As I hinted in another message, a proper solution would be based on a hybrid full text/RDBMS. Whether a true hybrid system is built, or just the illusion is built using some crafty CGI scripts is a detail to be worked out. -john To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message