From owner-freebsd-ports Mon Jan 8 11:22:31 1996 Return-Path: owner-ports Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id LAA29076 for ports-outgoing; Mon, 8 Jan 1996 11:22:31 -0800 (PST) Received: from sivka.carrier.kiev.ua (root@sivka.carrier.kiev.ua [193.125.68.130]) by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id LAA29065 for ; Mon, 8 Jan 1996 11:22:11 -0800 (PST) Received: from elvisti.kiev.ua (uucp@localhost) by sivka.carrier.kiev.ua (Sendmail 8.who.cares/5) with UUCP id VAA05552 for ports@freebsd.org; Mon, 8 Jan 1996 21:13:24 +0200 Received: from office.elvisti.kiev.ua (office.elvisti.kiev.ua [193.125.28.33]) by spider2.elvisti.kiev.ua (8.6.12/8.ElVisti) with ESMTP id UAA11729 for ; Mon, 8 Jan 1996 20:05:42 +0200 Received: (from stesin@localhost) by office.elvisti.kiev.ua (8.6.12/8.ElVisti) id UAA12117; Mon, 8 Jan 1996 20:05:41 +0200 From: "Andrew V. Stesin" Message-Id: <199601081805.UAA12117@office.elvisti.kiev.ua> Subject: Re: making ports To: chuckr@glue.umd.edu (Chuck Robey) Date: Mon, 8 Jan 1996 20:05:41 +0200 (EET) Cc: ports@freebsd.org In-Reply-To: from "Chuck Robey" at Jan 7, 96 10:29:44 am X-Mailer: ELM [version 2.4 PL24alpha5] Content-Type: text Sender: owner-ports@freebsd.org Precedence: bulk Hello again, # > Isn't Glimpse only a part of a whole lot bigger Harvest-1.4pl1 # > distribution now? # > I'm just compiling Harvest in order to learn this pretty complex # > thing. It's companion, cached-1.4pl0 (proxy HTTP caching daemon) # > is waiting here, too. # > # > (ftp://ftp.cs.colorado.edu/pub/distribs/harvest) # > # > -- # # It may be, but that's not immediately obvious from the glimpse side, at # least to me. If you're doing harvest, you'll probably find that out for # us. Glimpse seems to be a general purpose text search engine ... is # Harvest something that has been specialized for web stuff? Not only for web -- just for everything! Harvest in overall is pretty complex, but (as for my opinion) has really powerful high-level design, based on good ideas. Just now I'm printing and reading a couple of techreports on Harvest and it's User Manual (all in .ps, that's sloooow on 9pin Epson :) in order to catch more details. That's what I've figure for now, very shortly -- if one is interested, all this info is available electronically. The design is 2-level: 1. "Gatherer"-like tools. Their purpose is to extract relevant information from different sources with a tunable degree of detaileness. "Different" here means that files to be processed may be a) accessed locally or remotely from a set of servers, public or private ones, via FTP, HTTP, NNTP and whatever else; b) of different formats (including .ps, SGML, HTML as SGML subset, RCS/CVS, .o, netnews, e-mail archives, even .gif, and _many_ others; one can add new custom file types. I.e. the tool to convert WordPerfect files to smth like RTF than to SGML and than make "juice" from it is available, too). In brief: Gatherer makes X litres of juice from Y tons of oranges :-) Y is much greater than X. The Gatherer wich comes with Harvest is based on a tool named Essence, developed by Hardy and Shwartz at cs.colorado.edu. It is capable to make "juice" even from "nested" things, like .tgz files comtaining any of the above formats! How high the percent of juice is? It's tunable; you may set it into "full" mode, when nothing is lost, or in some less detailed mode when it drops some words. Techreport on Essence has comparisons with WAIS's content capturing efficiency and index sizes; Essence is better (I'd going to beleive this without precise testing -- I tried WAIS and wasn't happy with it). All this makes the idea to install a gatherer on a _big_ FTP server (like ftp.cdrom.com :) pretty attractive, and see below. I'm going to look deeper into Gatherer/Essence soon and try to figure -- why do they use GNU dbm library (and packaged it with Harvest) and GNU malloc? I'm not too happy with this idea; I suspect that using Berkeley DB and PHK malloc (thanks, Paul! that's the very best malloc() I've used!) will give some performance benefit. 2. "Brokers". Broker can collect "juice" from one or more, local or remote "gatherers"; information is retrieved from them via some conventional protocols (FTP?). It than makes an index of what he got, and provides an out-of-the-box WWW binding for searches. Damn cool! Harvest' Broker can use different search engines. Glimpse and WAIS bindings are already present, and there are hooks for others. There is some stuff for use a commercial Topic search engine from Verity Inc. (see http://www.verity.com). I already asked them about the details, hope they will send me some answer. (!) I have an opinion for now that there aren't too many search engines around, neither free, nor commercial. If someone will point me at something other than Glimpse, WAIS and Topic, I'd be very grateful, and will test just every one which is freely available. Glimpse is a "default" engine for Harvest. It's written recently by the author(s) of 'agrep' tool, it's agrep-based and includes it's distribution. Harvest distribution includes a lemon-fresh Glimpse distribution it it. So, as about ftp.cdrom.com: once youv'e a Gatherer working, you may launch a Broker, hook it to www.cdrom.com and voila! one can even find a name of the author of /pub/msdos/virus.exe, hidden somewhere in the executable! :-) Another great tool (especially for those who has a LAN with a slow link to the world) is Cached -- I'm planning to put it on our firewall-gateway which I'm building now step by step. So people, I beleive it's cool! and let's make FreeBSD a Harvest's platform of choice! :-) I'll inform people about my experince with Harvest later when I'll get it running. Yes, I'm considering making a port, too; but only after I'll become a Harvest expert and shall have answers for all of my today's questions. BTW: what is the preferred way to handle GNU autoconf when making a port for FreeBSD? -- With best regards -- Andrew Stesin. +380 (44) 2760188 +380 (44) 2713457 +380 (44) 2713560 An undocumented feature is a coding error.