From owner-freebsd-doc  Sun Dec 29 11:00:18 1996
Return-Path: <owner-doc>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.4/8.8.4) id LAA07912
          for doc-outgoing; Sun, 29 Dec 1996 11:00:18 -0800 (PST)
Received: from tolstoy.mpd.ca (wlloyd.HIP.CAM.ORG [199.84.42.209])
          by freefall.freebsd.org (8.8.4/8.8.4) with ESMTP id LAA07903
          for <doc@freebsd.org>; Sun, 29 Dec 1996 11:00:11 -0800 (PST)
Received: from plato (plato.mpd.ca [206.123.11.34]) by tolstoy.mpd.ca (8.7.5/8.7.3) with SMTP id OAA04310; Sun, 29 Dec 1996 14:01:28 -0500 (EST)
Message-ID: <32C6BF88.4411@mpd.ca>
Date: Sun, 29 Dec 1996 13:59:20 -0500
From: Bill Lloyd <wlloyd@mpd.ca>
X-Mailer: Mozilla 3.0Gold (X11; I; SunOS 5.4 sun4c)
MIME-Version: 1.0
To: Francisco Reyes <francisco@natserv.com>
CC: FreeBSD doc Mailing list <doc@freebsd.org>,
        John Fieber <jfieber@indiana.edu>
Subject: Re: mailing list archives
References: <199612282245.RAA13615@revelstone.jvm.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-doc@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

Francisco Reyes wrote:
> 
> On Fri, 27 Dec 1996 21:06:34 -0500 (EST), John Fieber wrote:
> 
> >To get the answers, we need thread retrieval.  For this, I don't
> >think we need new indexing software, we just need to figure out
> >how to take an existing message and *automatically* formulate
> >appropriate queries to build the thread from it.
> 
> John,
> Do you think any of the existing tools can do it? I am about to start
> working on this project by doing a home-grown system. Should I proceed?
> Given my free time it will be a while before I have the system
> (anywhere from 1 to 3 months), but I already started thinking about the
> basics design.
> 
> The features I was thinking to have are:
> -Index any word.
> -Logical operators in searches: "and", "or", "not" . Later on "near"
> and lexical searches for selected words (doing the lexical matching by
> means of a table).
> -Capable of storing an expiration date for articles and re-use their
> allocated space after they have been expired.
> -Give  answers in a threaded form.
> 
> In the initial phase I have considered indexing the exiting files and
> later on develop the system to it handles the storage of the messages.
> Managing the messages would allow for compression and file expiration.
> 
> My initial considerations are:
> -- Use a Red-Black-Tree to index all words. For each word in the tree
> have a linked list. Basically use the tree to search for the start of a
> linked list for each word; This will save space since I won't store the
> word key/value for all the elements in the linked list..
> -- Keep a file with the last physical location of each message file
> processed. When the program is run it will only index what has been
> added to each of the message files.

I don't have any knowledge as to how the current archives are
stored/compressed/indexed etc, but I thought I'd throw in some of my
thoughts. 

One thing that I think would be very helpfull, is if links to the
messages from the search page were made to be absolute, instead of
relative to the current search terms.  Ie one to one mapping.

For example from the search page..

The comment stuff is mine.
<!--
<A
HREF="/cgi/search.cgi?words=help+scsi&source=freebsd-questions&max=25&docnum=1">"Rodney
C. Re: SCSI woes.</A>
<br>Score: <em>982</em>; Lines: <em>38</em>; 07-Mar-1996; Archive:
<em>freebsd-questions</em><p></p></li>
-->

I often find myself re-reading the same messages over and over, if I
back up and change the number of "results" to 50. 

General comments to implement threading etc...

One way to implement this and provide for threading would be to borrow
from gnats.  I have gnats running locally, and I have done some hacking
of the www interface for it, for a local application.  I think that it
provides a good model to implement the mailing list archives.

For example, a mailing list discussion that follows the gnats model
would simply exist as one long message.  You would be able to read the
entire discussion with all headers etc removed.  A search will simply
pull up the one message, start to finish.  Removes all threading
problems at search time.

All discussions messages would be numbered, and could be referred to by
an absolute URL, example..
<!--
<a href="/cgi/query-majordomo.cgi?pr=666">questions/666 SCSI woes.</A>
<br>Score: <em>982</em>; Lines: <em>38</em>; 07-Mar-1996; Archive:
<em>freebsd-questions</em><p></p></li>
-->

The gnats distribution includes a www interface that provides for full
regex searching of the database.

The major disadvantage is that the current archives would be be
orphaned, or would have to be converted.

I'm not suggesting that current gnats problem database be expanded to
include mailing list information, just that the model has some
advantages that would be usefull.

gnats on it's own would not be a good way to handle the majordomo
stuff.  Would it be possible to have majordomo call a modified version
of send-pr to file a new message in gnats form for the archive?  

It would be possible to maintain seperate independant searches of the
mailing lists.  This would be usefull for a year switch over period for
example.

gnats would need to be hacked to remove a lot of stuff.  It is far too
verbose for a mailing lists archive.

-bill

-- 
William Lloyd (wlloyd@mpd.ca)			|    <http://www.mpd.ca>