Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 19 Nov 1996 00:09:37 -0500 (EST)
From:      John Fieber <jfieber@indiana.edu>
To:        Mark Mayo <mark@quickweb.com>
Cc:        hackers@freebsd.org
Subject:   Re: Announce: Alternative Mail Archive
Message-ID:  <Pine.BSI.3.95.961118225255.28546P-100000@fallout.campusview.indiana.edu>
In-Reply-To: <Pine.BSF.3.94.961118103016.10044A-100000@vinyl.quickweb.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 18 Nov 1996, Mark Mayo wrote:

> Hi all, I've been playing with setting up my own mail archive of the
> freebsd discussion lists! To start with, I've been archiving the
> -questions list for several days, and i was wondering if someone would
> take a look and tell me what you think?

I'm glad to see someone take a crack at this.  I hastily threw
together the existing mail archives just over two years ago and
they have needed an overhaul ever since.  Please don't take any
of the comments/criticisms below as an attempt at defending the
existing archives.  :)

Since I've been down the hypermail path before, a couple things:

First, browsing, as hypermail sets it up, is of very limited
utility for finding anything in list archives of FreeBSD scale
(currently about 300 megabytes and growing fast). Browsing is
much better suited as a second step after an initial search has
identified a few key messages.  Using those keys, it is then
useful to retrieve the thread context.  Being able to re-sort a
chunk of message by date, subject, author is useful, but only if
the searcher has control over what is in the chunk.  Hypermail
just blindly chops things up into time segments and the chunk
composition is static.  The proper place for chunk sorting is on
a set of retrieved messages.

Second, indexing the messages after they have been processed by
hypermail is a Bad Idea.  This because you loose the potential of
selectively searching header fields, and there is a lot of extra
cruft that mucks up searches.  Just as an example, because the
word "thread" appears in almost every message generated by
hypermail, it effectively becomes a stopword.  Now that is a
bummer if you want to look up something on threads in the
programming sense. 

Third, hypermail is going to be a pain when you try and throw a
useful chunk of the archives at it.  Considering that a majority
of the messages in the database will probably never be retrieved
in full, it is probably a lot more efficient in the long run to
store and index them in their native mailbox format and generate
HTML on the fly.  Threads can be re-constructed with reasonable
reliability with creative behind the scenes queries on message ID
and subject fields. This also avoids hypermail's annoying trait
of breaking threads on arbitrary (month|week) boundaries. 

Fourth, the choices "All", "Some", "Boolean" are in fact all
boolean.  I have no problem with "All".  "Some" is not really
correct and I think "Any" would be a better choice.  "Boolean"
implies that the first two are not which is false. 

Fifth, the long format for the search summary has too much
garbage while the short doesn't have enough.  The essential stuff
to have is author, subject and date.  I gather from a quick look
at the ht://Dig docs that this is tweakable.

Sixth, (another hypermail slam) using a proportional font for the
message body is a Bad Idea, particularly for technically oriented
lists where people are prone to including ascii diagrams.

Seventh, does the search even work?  I tried "ASUS" and it turned
up nothing, while the browse list clearly has messages with that
word in it.  Hm.....


Essentially, I'm glad someone is doing this, but I don't think
the architecture is right.

The problem is that good IR systems are proprietary, and free IR
systems are crap.  Of course, I've spent quite a lot of time
reading and writing about IR theory, so I'm pretty cynical about
the whole field.  (Since this is the direction of my Ph.D.
research, maybe it isn't such a good thing?)

An article some time back in WIRED about web indexing mentioned
that the field hasn't had any great developments in the last 20
years.  Absolutely true.  Despite claims to the contrary, the
computer science hot-shots with their whiz-bang web search
engines haven't changed things a bit.  Once you peel away the web
glitz, its still 1960's boolean search technology.

Oh well.....  :)

-john

== jfieber@indiana.edu ===========================================
== http://fallout.campusview.indiana.edu/~jfieber ================




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSI.3.95.961118225255.28546P-100000>