From owner-freebsd-fs@FreeBSD.ORG Wed Oct 22 01:44:02 2003 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 03A5A16A4B3 for ; Wed, 22 Oct 2003 01:44:02 -0700 (PDT) Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net [207.217.120.189]) by mx1.FreeBSD.org (Postfix) with ESMTP id 787CD43FBD for ; Wed, 22 Oct 2003 01:44:00 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from user-2ivfjup.dialup.mindspring.com ([165.247.207.217] helo=mindspring.com) by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 1ACEaz-0002BE-00; Wed, 22 Oct 2003 01:43:58 -0700 Message-ID: <3F96431E.A30656E3@mindspring.com> Date: Wed, 22 Oct 2003 01:43:10 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: "Robert J. Adams (jason)" References: <3F95B946.8010309@newshosting.com> <20031021233414.GJ99943@elvis.mu.org> <3F95C6F3.8030005@siscom.net> Content-Type: text/plain; charset=big5 Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a494b6a1359814014566978bc4660100a9350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c cc: freebsd-fs@freebsd.org Subject: Re: >1 systems 1 FS X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Oct 2003 08:44:02 -0000 "Robert J. Adams (jason)" wrote: > Alfred Perlstein wrote: > >>Hello, > >> > >>I'm working on a new cluster design and had a quick question. If I have > >>a few boxes mounting the same FS (over a SAN) all read-only will it > >>work? Will I have any trouble? Has anyone tried this with UFS/UFS2 .. > > > > You shouldn't. > > I shouldn't do this or I shouldn't have trouble? :) > > >>Lets take it one step further.. lets say I have 1 box that mounts it > >>RW.. and it updates the contents .. will the other systems that have it > >>mounted RO puke? > > > > > > Likely. > > Well shit.. I need this. Then you need a new FS. The issue is that you effectively need block-level or range of blocks locking on the device over the shared interface wire to be able to do this effectively, since a device that is a target of multiple master devices has to know who to permit onto the blocks and who not to permit onto the blocks. Firewire was supposed to fix this, and so was SCSI 3. The parts of the SCSI 3 standard that deal with this particular issue have not been finalized, because each device vendor is jockeying to get their implementation standardized to get a jump on all the other vendors, instead of cooperating on establishing an open standard. This is one of the main reasons that the SCSI 3 standard is not yet final (the other main reason is that a number of the participants also sell IDE disks, and whatever's bad for SCSI is good for IDE, so they are being obstructionist jerks because they can). There are a number of FS implementations that can deal with this, however, and they way they deal with this is by implementing an out-of-device-control-band block-level or range of blocks locking protocol, usually over ethernet, to ensure that they can get exclusive access to the blocks. Usually, this is implemented as multiple reader, single writer locking, with the ability to go exclusive ("SIX locking" -- "Shared Intention eXclusive"; look for it in your favorite search engine). Obviously, doing this in-band with explicit enforcement, and no issue of inter-node failure recovery being necessary because the locks are stored in the physical device (i.e. the SCSI 3 approach) would have significant performance benefits over the external lock manager that relies on the machines voluntarily participating and not going down. One example of an FS that can do this is GFS, from Sistina; they used to have an open-source version (under the GPL), but appear to have since come to their senses. I ported all the user space tools for GFS to FreeBSD in about 4 hours of work one night, when it was still available under the GPL. See their propaganda at: http://www.sistina.com/products_gfs.htm IBM also has two FS's that can do this, but they don't even run on Linux, let alone FreeBSD. In theory, SGI CXFS will also do this (I haven't gotten enough information from non-proprietary channels to be able to disclose much here and be on sound legal footing). Another company that had a product in this space was Zambeel; they were a Fremont startup, and, among other people, they had hired Mohit Aron from Rice University (he did the ResCon LRP implementation and was associated with the SCALA Server project and Peter Druschel's group). The company showed a lot of promise, but apparently burnt all it's first round money to the tune $65M at the rate of $1M/month, with only 90 people in headcount a little more than a year ago. Unfortunately, they croaked last April: http://www.byteandswitch.com/document.asp?doc_id=31886&site=byteandswitch and it's not likely that anyone will be jumping into the space very soon, since it hasn't been very profitable for the companies trying to stake out territory there. Anyway, the normal way this is handled for SAN/NAS devices is to carve out a logical volume region on a per-machine basis, and forget the locking altogether (giving a management node "ownership" of the "as yet unallocated regions"), which avoid contention by separation of the contention domain entirely. Not a very satisfying way of doing it, if you ask me. -- Terry