From owner-freebsd-doc Sun Dec 29 11:00:18 1996 Return-Path: Received: (from root@localhost) by freefall.freebsd.org (8.8.4/8.8.4) id LAA07912 for doc-outgoing; Sun, 29 Dec 1996 11:00:18 -0800 (PST) Received: from tolstoy.mpd.ca (wlloyd.HIP.CAM.ORG [199.84.42.209]) by freefall.freebsd.org (8.8.4/8.8.4) with ESMTP id LAA07903 for ; Sun, 29 Dec 1996 11:00:11 -0800 (PST) Received: from plato (plato.mpd.ca [206.123.11.34]) by tolstoy.mpd.ca (8.7.5/8.7.3) with SMTP id OAA04310; Sun, 29 Dec 1996 14:01:28 -0500 (EST) Message-ID: <32C6BF88.4411@mpd.ca> Date: Sun, 29 Dec 1996 13:59:20 -0500 From: Bill Lloyd X-Mailer: Mozilla 3.0Gold (X11; I; SunOS 5.4 sun4c) MIME-Version: 1.0 To: Francisco Reyes CC: FreeBSD doc Mailing list , John Fieber Subject: Re: mailing list archives References: <199612282245.RAA13615@revelstone.jvm.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-doc@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Francisco Reyes wrote: > > On Fri, 27 Dec 1996 21:06:34 -0500 (EST), John Fieber wrote: > > >To get the answers, we need thread retrieval. For this, I don't > >think we need new indexing software, we just need to figure out > >how to take an existing message and *automatically* formulate > >appropriate queries to build the thread from it. > > John, > Do you think any of the existing tools can do it? I am about to start > working on this project by doing a home-grown system. Should I proceed? > Given my free time it will be a while before I have the system > (anywhere from 1 to 3 months), but I already started thinking about the > basics design. > > The features I was thinking to have are: > -Index any word. > -Logical operators in searches: "and", "or", "not" . Later on "near" > and lexical searches for selected words (doing the lexical matching by > means of a table). > -Capable of storing an expiration date for articles and re-use their > allocated space after they have been expired. > -Give answers in a threaded form. > > In the initial phase I have considered indexing the exiting files and > later on develop the system to it handles the storage of the messages. > Managing the messages would allow for compression and file expiration. > > My initial considerations are: > -- Use a Red-Black-Tree to index all words. For each word in the tree > have a linked list. Basically use the tree to search for the start of a > linked list for each word; This will save space since I won't store the > word key/value for all the elements in the linked list.. > -- Keep a file with the last physical location of each message file > processed. When the program is run it will only index what has been > added to each of the message files. I don't have any knowledge as to how the current archives are stored/compressed/indexed etc, but I thought I'd throw in some of my thoughts. One thing that I think would be very helpfull, is if links to the messages from the search page were made to be absolute, instead of relative to the current search terms. Ie one to one mapping. For example from the search page.. The comment stuff is mine. I often find myself re-reading the same messages over and over, if I back up and change the number of "results" to 50. General comments to implement threading etc... One way to implement this and provide for threading would be to borrow from gnats. I have gnats running locally, and I have done some hacking of the www interface for it, for a local application. I think that it provides a good model to implement the mailing list archives. For example, a mailing list discussion that follows the gnats model would simply exist as one long message. You would be able to read the entire discussion with all headers etc removed. A search will simply pull up the one message, start to finish. Removes all threading problems at search time. All discussions messages would be numbered, and could be referred to by an absolute URL, example.. The gnats distribution includes a www interface that provides for full regex searching of the database. The major disadvantage is that the current archives would be be orphaned, or would have to be converted. I'm not suggesting that current gnats problem database be expanded to include mailing list information, just that the model has some advantages that would be usefull. gnats on it's own would not be a good way to handle the majordomo stuff. Would it be possible to have majordomo call a modified version of send-pr to file a new message in gnats form for the archive? It would be possible to maintain seperate independant searches of the mailing lists. This would be usefull for a year switch over period for example. gnats would need to be hacked to remove a lot of stuff. It is far too verbose for a mailing lists archive. -bill -- William Lloyd (wlloyd@mpd.ca) |