Date: Mon, 30 Mar 1998 11:02:00 +0100 From: nik@iii.co.uk To: shimon@simon-shapiro.org Cc: Wolfram Schneider <wosch@cs.tu-berlin.de>, freebsd-database@FreeBSD.ORG, andreas@klemm.gtn.com, scrappy@hub.org, Satoshi Asami <asami@FreeBSD.ORG>, Amancio Hasty <hasty@rah.star-gate.com> Subject: Mailing list search interface Message-ID: <19980330110200.17368@iii.co.uk> In-Reply-To: <XFMail.980329135730.shimon@simon-shapiro.org>; from Simon Shapiro on Sun, Mar 29, 1998 at 01:57:30PM -0800 References: <p1i3eg5jdbb.fsf@panke.panke.de> <XFMail.980329135730.shimon@simon-shapiro.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Gents,
On Sun, Mar 29, 1998 at 01:57:30PM -0800, Simon Shapiro wrote:
> On 26-Mar-98 Wolfram Schneider wrote:
> > The FreeBSD mailing list search interface support threads. The
> > thread database will be updated hourly. Of course there are
> > many things to do to make the threads more user friendly.
>
> We have been playing with the idea of normalizing the archive into an
> RDBMS. Some of the benefits are:
<snip>
Could we coordinate on some of this? I've been working on a system (at
work) for making some of our mailing list archives visible and searchable
on our internal site. I'm using MHonArc, Glimpse (both of which are in
the ports tree) and a customised version of Wilma
<URL:http://www.hpc.uh.edu/majordomo/#wilma>
and it's almost at the point where this would be useful for the project.
I mentioned MHonArc to Jordan, and his first response was
> Eeek! The evil MHonArc resurfaces! ;-)
>
> It doesn't scale at all well - just try MHonArc'ing a really big mailing
> list archive. You soon get a set of monster html files that are
> essentially unusable - I know, I did the short-lived "FreeBSD Docs"
> CD for awhile using MHonArc.
I think he's been using an older version of MHonArc. I did some tests
late last week, archiving and indexing the archives for -hackers from
the beginning of 1998. That's 11,265K or thereabouts.
At the end of the conversion (which consisted of running MHonArc 2.2.0
over the files, and then using Glimpse 4.1 to index them) I had a total
of 32,910K HTML and index files.
The output of 'time -l' on the conversion process was:
626.11 real 438.83 user 93.13 sys
8572 maximum resident set size
390 average shared memory size
4311 average unshared data size
128 average unshared stack size
1054806 page reclaims
68 page faults
0 swaps
9725 block input operations
6115 block output operations
0 messages sent
0 messages received
0 signals received
18065 voluntary context switches
26547 involuntary context switches
That's a reasonably exceptional time, because it had to build the archive
for the year to date, and you only take this hit once. Once the archive
is up and running, you're only building HTML files for new messages since
the last update, which is (or should be) considerably faster.
Regrettably at the moment, there's a bug in Glimpse 4.1, which means that
you need to reindex the entire archive, rather than just those bits that
change. Fortunately, there are command line switches to tell the
glimpseindex program how much memory to use.
That 8572 max. resident size figure is from MHonArc rather than glimpse,
since it reads in (as far as I can tell) the whole of the mail archive
file before processing it.
While the conversion was happening the load on my machine hovered around
the .9-1.1 mark. With X, Netscape, XEmacs and a bunch of xterms open.
At the end of the conversion process I had a threaded copy of the -hackers
mail archives going back almost three months.
Each month has two indices -- a date index where you see all the messages
in the order they came in, and a threaded index.
Each index shows (at most) 200 messages (that's a configurable number).
This is so the size of the index files doesn't grow without end. Each
index shows a "This is page x of y of the threaded index" comment, with
navigation text to go backwards and forwards in the index.
This whole thing is searchable, allowing searches by combination of
keywords. You can specify the the number of misspellings to allow, the
number of hits to return, case sensitivity, and which months to restrict
your search to.
The only thing you can't do (at the moment) is search across more than one
mailing list. It shouldn't be too hard to add. Right now, I don't have a
URL I can give to show you the results, since I ran out of time last night
(I must be getting old, I used to be able to do 72 hour coding runs and not
really feel it <sigh>). I should be able to get something demonstrable
up on my freefall account by the middle of next week.
In light of all that, do you think this is worth pursuing further?
Thoughts?
N
--
Work: nik@iii.co.uk | FreeBSD + Perl + Apache
Rest: nik@nothing-going-on.demon.co.uk | Remind me again why we need
Play: nik@freebsd.org | Microsoft?
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-database" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19980330110200.17368>
