Date: Tue, 19 Nov 1996 00:09:37 -0500 (EST) From: John Fieber <jfieber@indiana.edu> To: Mark Mayo <mark@quickweb.com> Cc: hackers@freebsd.org Subject: Re: Announce: Alternative Mail Archive Message-ID: <Pine.BSI.3.95.961118225255.28546P-100000@fallout.campusview.indiana.edu> In-Reply-To: <Pine.BSF.3.94.961118103016.10044A-100000@vinyl.quickweb.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 18 Nov 1996, Mark Mayo wrote: > Hi all, I've been playing with setting up my own mail archive of the > freebsd discussion lists! To start with, I've been archiving the > -questions list for several days, and i was wondering if someone would > take a look and tell me what you think? I'm glad to see someone take a crack at this. I hastily threw together the existing mail archives just over two years ago and they have needed an overhaul ever since. Please don't take any of the comments/criticisms below as an attempt at defending the existing archives. :) Since I've been down the hypermail path before, a couple things: First, browsing, as hypermail sets it up, is of very limited utility for finding anything in list archives of FreeBSD scale (currently about 300 megabytes and growing fast). Browsing is much better suited as a second step after an initial search has identified a few key messages. Using those keys, it is then useful to retrieve the thread context. Being able to re-sort a chunk of message by date, subject, author is useful, but only if the searcher has control over what is in the chunk. Hypermail just blindly chops things up into time segments and the chunk composition is static. The proper place for chunk sorting is on a set of retrieved messages. Second, indexing the messages after they have been processed by hypermail is a Bad Idea. This because you loose the potential of selectively searching header fields, and there is a lot of extra cruft that mucks up searches. Just as an example, because the word "thread" appears in almost every message generated by hypermail, it effectively becomes a stopword. Now that is a bummer if you want to look up something on threads in the programming sense. Third, hypermail is going to be a pain when you try and throw a useful chunk of the archives at it. Considering that a majority of the messages in the database will probably never be retrieved in full, it is probably a lot more efficient in the long run to store and index them in their native mailbox format and generate HTML on the fly. Threads can be re-constructed with reasonable reliability with creative behind the scenes queries on message ID and subject fields. This also avoids hypermail's annoying trait of breaking threads on arbitrary (month|week) boundaries. Fourth, the choices "All", "Some", "Boolean" are in fact all boolean. I have no problem with "All". "Some" is not really correct and I think "Any" would be a better choice. "Boolean" implies that the first two are not which is false. Fifth, the long format for the search summary has too much garbage while the short doesn't have enough. The essential stuff to have is author, subject and date. I gather from a quick look at the ht://Dig docs that this is tweakable. Sixth, (another hypermail slam) using a proportional font for the message body is a Bad Idea, particularly for technically oriented lists where people are prone to including ascii diagrams. Seventh, does the search even work? I tried "ASUS" and it turned up nothing, while the browse list clearly has messages with that word in it. Hm..... Essentially, I'm glad someone is doing this, but I don't think the architecture is right. The problem is that good IR systems are proprietary, and free IR systems are crap. Of course, I've spent quite a lot of time reading and writing about IR theory, so I'm pretty cynical about the whole field. (Since this is the direction of my Ph.D. research, maybe it isn't such a good thing?) An article some time back in WIRED about web indexing mentioned that the field hasn't had any great developments in the last 20 years. Absolutely true. Despite claims to the contrary, the computer science hot-shots with their whiz-bang web search engines haven't changed things a bit. Once you peel away the web glitz, its still 1960's boolean search technology. Oh well..... :) -john == jfieber@indiana.edu =========================================== == http://fallout.campusview.indiana.edu/~jfieber ================
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSI.3.95.961118225255.28546P-100000>