From owner-freebsd-database Mon Mar 30 11:53:38 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id LAA05906 for freebsd-database-outgoing; Mon, 30 Mar 1998 11:53:38 -0800 (PST) (envelope-from owner-freebsd-database@FreeBSD.ORG) Received: from sendero.simon-shapiro.org (sendero-fddi.Simon-Shapiro.ORG [206.190.148.2]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id LAA05893 for ; Mon, 30 Mar 1998 11:53:32 -0800 (PST) (envelope-from shimon@simon-shapiro.org) Received: (qmail 2922 invoked from network); 30 Mar 1998 20:02:45 -0000 Received: from localhost.simon-shapiro.org (HELO sendero-fxp0.simon-shapiro.org) (@127.0.0.1) by localhost.simon-shapiro.org with SMTP; 30 Mar 1998 20:02:45 -0000 Message-ID: X-Mailer: XFMail 1.3-alpha-032398 [p0] on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: Date: Mon, 30 Mar 1998 12:02:45 -0800 (PST) Reply-To: shimon@simon-shapiro.org Organization: The Simon Shapiro Foundation From: Simon Shapiro To: John Fieber Subject: Re: [PORTS] Pgaccess doesn't run on -current anymore, Update Cc: Amancio Hasty , Satoshi Asami , scrappy@hub.org, andreas@klemm.gtn.com, freebsd-database@FreeBSD.ORG, Wolfram Schneider Sender: owner-freebsd-database@FreeBSD.ORG Precedence: bulk On 30-Mar-98 John Fieber wrote: ... > It has been well established for many years by professionals in > database R&D that traditional a RDBMS are utterly and completely > the wrong tool for free text searching. This turns out to be > true even for some relatively structured data types like > bibliographic records. I made a descent carreer building systems that `established professionals'' said could not be built. We can discuss some of these privately :-) Having said that, you are probably right, to a degree. The way around it is NOT to search free text in the database. > There *are* some tasks in a real-world applications that are > RDBMS type things--a message-id based thread index is simple to > implement for instance--so I'm all for hybrid systems. The big > RDBMS vendors usually have some optional module optimized for > free-text searching module and some SQL extensions to access it. > I've pondered writing such a module for postgres, but don't > really know enough about extending postgres to know how well it > would work. This is what I had in mind exactly. To normalize what can be normalized, and leave the rest of it as text. The problem, in a UFS, is that when the number of files in the filesystem grows, directory searches become very costly. The mail archive (as secondary as it may appear) is an opportunity to investigate these issues. With million messages split across several dozens directories (unless you has the message IDs into the lists, etc.), you should be seeing some performance dgradation in open(2), which does directory scans. How about putting the message body as TEXT datatype into the RDBMS. At least you can query it by some integer index. This means you can use a B-Tree to find the message, rather than dirscan. If the message is in a blob, applying regex to it, from within the database can be optimized. Another option you mention, and Postgres is IDEAL for that, is a new, native data type. Search logic can then be applied, and even integrated with the system. Something to think about. Simon To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-database" in the body of the message