From owner-freebsd-fs@FreeBSD.ORG Fri Aug 11 02:49:22 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B44B316A4DF; Fri, 11 Aug 2006 02:49:22 +0000 (UTC) (envelope-from jd@ugcs.caltech.edu) Received: from mark.ugcs.caltech.edu (mark.ugcs.caltech.edu [131.215.176.117]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5B0D443D45; Fri, 11 Aug 2006 02:49:22 +0000 (GMT) (envelope-from jd@ugcs.caltech.edu) Received: by mark.ugcs.caltech.edu (Postfix, from userid 3640) id DAA7F3F050; Thu, 10 Aug 2006 19:49:21 -0700 (PDT) Date: Thu, 10 Aug 2006 19:49:21 -0700 From: Paul Allen To: Pawel Jakub Dawidek Message-ID: <20060811024921.GF308@mark.ugcs.caltech.edu> References: <20060808195202.GA1564@garage.freebsd.pl> <20060810184702.GA8567@nowhere> <20060810192841.GA1345@garage.freebsd.pl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20060810192841.GA1345@garage.freebsd.pl> Sender: jd@ugcs.caltech.edu Cc: freebsd-fs@freebsd.org, Craig Boston , freebsd-geom@freebsd.org, freebsd-arch@freebsd.org Subject: Re: GJournal (hopefully) final patches. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Aug 2006 02:49:22 -0000 It's a bit disturbing that a geom-class quite far away from the storage drivers presumes that the proper action here is a cache flush. The underlying hardware may support tagged command queuing (i.e., SCSIs ability to receive not only transaction completion notications but also to permit partial-orderings to be dictated to the controller) or native-command queuing (command completion). It's true that this functionality may not always work as advertised but that's a problem to be solved with dev. sysctls, not by taking a LCD approach in a high-level geom class. This really needs broader architecture consideration, not just what it takes it make it work. Paul >From Pawel Jakub Dawidek , Thu, Aug 10, 2006 at 09:28:41PM +0200: > On Thu, Aug 10, 2006 at 01:47:23PM -0500, Craig Boston wrote: > > Hi, > > > > It's great to see this project so close to completion! I'm trying it > > out on a couple machines to see how it goes. > > > > A few comments and questions: > > > > * It took me a little by surprise that it carves 1G out of the device > > for the journal. Depending on the size of the device that can be a > > pretty hefty price to pay (and I didn't see any mention of it in the > > setup notes). For a couple of my smaller filesystems I reduced it to > > 512MB. Perhaps some algorithm for auto-sizing the journal based on > > the size / expected workload of the device would be in order? > > It will be pointed out in documentation when I finally prepare it. > I don't have plans about autosizing currently. > > > * Attached is a quick patch for geom_eli to allow it to pass BIO_FLUSH > > down to its backing device. It seems like the right thing to do and > > fixes the "BIO_FLUSH not supported" warning on my laptop that uses a > > geli encrypted disk. > > I've this already in my perforce tree. I also implemented BIO_FLUSH > passing in gmirror and graid3. > > I also added a flag for gmirror and graid3 which says "don't > resynchronize components after a power failure - trust they are > consistent". And they are always consistent when placed below gjournal. > > > * On a different system, however, it complains about it even on a raw > > ATA slice: > > > > atapci1: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 31.1 on pci0 > > ata0: on atapci1 > > ad0: 114473MB at ata0-master UDMA100 > > GEOM_JOURNAL: BIO_FLUSH not supported by ad0s1e. > > > > It seems like a reasonably modern controller and disk, at least it > > should be capable of issuing a cache flush command. Not sure why it > > doesn't like it :/ > > We would need to add some printfs to diagnoze this probably - you can > try adding some lines to ad_init() to get this: > > if (atadev->param.support.command1 & ATA_SUPPORT_WRITECACHE) { > if (ata_wc) > ata_controlcmd(dev, ATA_SETFEATURES, ATA_SF_ENAB_WCACHE, 0, 0); > else > ata_controlcmd(dev, ATA_SETFEATURES, ATA_SF_DIS_WCACHE, 0, 0); > } else { > printf("ad_init: WRITE CACHE not supported by ad%d.\n", > device_get_unit(dev)); > } > > > * How "close" does the filesystem need to be to the gjournal device in > > order for the UFS hooks to work? Directly on it? > > > > The geom stack on my laptop currently looks something like this: > > > > [geom_disk] ad0 <- [geom_eli] ad0.eli <- [geom_gpt] ad0.elip6 <- > > [geom_label] gjtest <- [geom_journal] gjtest.journal <- UFS > > > > I was wondering if an arrangement like this would work: > > > > [geom_journal] ad0p6.journal <- [geom_eli] ad0p6.journaleli <- UFS > > > > and if it would be any more efficient (journal the encrypted data > > rather than encrypt the journal). Or even gjournal the whole disk at > > once? > > When you mount file system it sends BIO_GETATTR "GJOURNAL::provider" > requests. So as long as classes between the file system and gjournal > provider pass BIO_GETATTR down, it will work. > > On my home machine I've the following configuration: > > raid3/DATA1.elid.journal > > So it's UFS over gjournal over bsdlabel over geli over raid3 over ata. > > I prefer to put gjournal on the top, because it gives consistency to > layers below it. For example I can use geli with bigger sector size > (sector size greater than disk sector size in encryption-only-mode can > be unreliable on power failures, which is not the case when gjournal is > above geli), I can turn off synchronization of gmirror/graid3 after a > power failure, etc. > > On the other hand configuring geli on top of gjournal can be more > effective for large files - geli will not encrypt the data twice. > > Fortunatelly with GEOM you can freely mix your puzzles. > > > Haven't been brave enough to try gjournal on root yet, but my /usr and > > /compile (src, obj, ports) partitions are already on it so I'm sure I'll > > try it soon ;) > > Markus Trippelsdorf reported that it doesn't work out of the box, but he > manage to make it to work with some small changes to fsck_ffs(8). > > -- > Pawel Jakub Dawidek http://www.wheel.pl > pjd@FreeBSD.org http://www.FreeBSD.org > FreeBSD committer Am I Evil? Yes, I Am!