Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 1 Jan 2007 16:49:46 -0500 (EST)
From:      John L <johnl@iecc.com>
To:        freebsd-questions@freebsd.org
Subject:   Indexing a largish collection of mail and usenet messages?
Message-ID:  <20070101164839.G69971@simone.iecc.com>

next in thread | raw e-mail | index | archive | help
I have a collection of archives of mailing list and news messages. The 
largest collection is pretty big, about 150,000 messages which means about 
200 megabytes of text, shortly to be migrated to a FreeBSD server.  The 
lists are all active so archives typically add a few messages each day. 
I want to provide a full text search of each archive.  What software 
should I use?  I have been using the sturdy but ancient lqtext package. 
It's OK, but it has a few bugs I have yet to pick and I'm wondering if 
something better is available.

First, I am NOT, repeat NOT, asking about web spiders.  The messages are 
directly available to indexing software as files on my server, so there's 
no advantage to running them through Apache on the way to the indexer. 
Also, the messages in the archive never change and I know what files are 
new each day, so it would be pointless for a package to re-spider the 
whole archive to look for the new messages.  I am not unalterably opposed 
to something that spiders if it is otherwise wonderful, but that approach 
hasn't been fruitful in the past.

What I want ideally is something that knows enough about the structure of 
mail messages to deal intelligently with headers vs. body, that can do 
something reasonable with MIME and HTML bodies (not urgent, I can always 
run them through demime on the way to the index), and most importantly 
that actually works with 150,000 messages.  I've seen lots of packages 
that look promising but that fall over dead once they get past 10,000 
messages or so.

User interface isn't particularly important, I can plug it into my 
existing stuff so long as it has the basic functions of taking search 
terms and giving back the locations of the matches.  To see the current 
version, bugs and all, see http://compilers.iecc.com/compsearch.phtml

The comp.compilers archive is also indexed in Google and other public 
search engines, which works splendidly, but most of the other lists are 
private so Google is out.

Any suggestions?  Tnx.

R's,
John










Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070101164839.G69971>