From owner-freebsd-arch@FreeBSD.ORG Thu Mar 18 13:47:43 2010 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 01E031065673 for ; Thu, 18 Mar 2010 13:47:43 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 574168FC1C for ; Thu, 18 Mar 2010 13:47:41 +0000 (UTC) Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua [212.40.38.101]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id PAA06489 for ; Thu, 18 Mar 2010 15:47:40 +0200 (EET) (envelope-from avg@freebsd.org) Message-ID: <4BA22EFC.8020902@freebsd.org> Date: Thu, 18 Mar 2010 15:47:40 +0200 From: Andriy Gapon User-Agent: Thunderbird 2.0.0.23 (X11/20100211) MIME-Version: 1.0 To: freebsd-arch@freebsd.org X-Enigmail-Version: 0.95.7 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: g_vfs_open and bread(devvp, ...) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Mar 2010 13:47:43 -0000 Forwarding here in an attempt to increase exposure. -------- Original Message -------- Subject: g_vfs_open and bread(devvp, ...) Date: Wed, 17 Mar 2010 11:52:32 +0200 From: Andriy Gapon To: freebsd-fs@FreeBSD.org, freebsd-geom@freebsd.org CC: Poul-Henning Kamp , Bruce Evans I've given a fresh look to the issue of g_vfs_open and bread(devvp, ...) in filesystem code. This time I hope to present my reasoning more clearly than I did in my previous attempts. For this reason I am omitting historical references and dramatic examples/demonstrations but they are still available upon request (and in archives). I hope that my shortened notation in references to the code and data structures will not be confusing. bread() and the API family it belongs to is an interface to buffer cache system. Buffer cache system can be roughly divided into two parts: cache part and I/O path part. I think that it is understood that both parts must have the same notion of a block size when translating a block number to a byte offset. (If this point is not clear, I can follow up with another essay). In the case of cache code the translation is explicit and simple: A) offset = blkno * bo_bsize In the case of I/O code the translation is not that straightforward, because it can be altered/overridden by bop_strategy which can in turn hook vop_strategy, etc. Let's consider a simple case of a filesystem that: a) connects to a geom provider via g_vfs_open b) doesn't modify anything in devvp->v_bufobj (in particular bo_ops and bo_bsize) c) uses bread(devvp) (e.g. to access an equivalent of superblock, etc) Short overview of geom_vfs glue: 1) g_vfs_open sets devvp->v_bufobj.bo_ops to g_vfs_bufops, where bop_strategy is g_vfs_strategy 2) bo_bsize is set to pp->sectorsize 3) g_vfs_strategy doesn't perform any block-to-offset translation of its own, it expects b_iooffset to be correctly set and passes its value to bio_offset When a filesystem issues bread(devvp) the following happens in the I/O path: I) bread() calls breadn() II) in breadn(): bp->b_iooffset = dbtob(bp->b_blkno), that is b_iooffset is set to blkno * DEV_BSIZE (where DEV_BSIZE is 512) III) breadn() then calls bstrategy() which is a simple wrapper around BO_STRATEGY IV) g_vfs_strategy gets called and, as described in (3) above, it simply passes on b_iooffset value to bio_offset V) thus, a block size used for I/O operation is 512 (DEV_BSIZE) VI) on the other hand, as stated in (A) above, block size used in caching code is bo_bsize Thus, if bo_bsize != DEV_BSIZE, or alternatively said, pp->sectorsize != 512, we have a trouble of data getting cached with incorrect offsets. Additionally, from (V) above we must conclude that a filesystem must specify block numbers in DEV_BSIZE units to bread(devvp, blkno, ...) if the conditions (a), (b), (c) are met. In fact, all such filesystems already do that, because otherwise they would read incorrect data from the media. So, the problem is only with the caching of the data. As such, this issue has little practical effects, because only a small number of reads is done via devvp and only for sufficiently small chunks of data (I hope). fs/udf used to be greatly affected by this issue when it was reading directory nodes via devvp, but that was in the past (prior to 189082). Still I think that we should fix this issue for general code correctness/quality reasons. And also to avoid possible future bugs. As demonstrated by (V) and (VI) above, obvious and easiest fix is to (always) set bo_bsize to DEV_BSIZE in g_vfs_open(): --- a/sys/geom/geom_vfs.c +++ b/sys/geom/geom_vfs.c @@ -179,7 +179,7 @@ g_vfs_open(struct vnode *vp, struct g_consumer **cpp, const char *fsname, int wr bo = &vp->v_bufobj; bo->bo_ops = g_vfs_bufops; bo->bo_private = cp; - bo->bo_bsize = pp->sectorsize; + bo->bo_bsize = DEV_BSIZE; gp->softc = bo; return (error); I tested this change with ufs, udf and cd9660 and haven't observed any regressions. P.S. There might something to changing bread(devvp) convention, so that blkno is expected in sectorsize units. But setting bo_bsize to sectorsize is only a tiny portion of what needs to be changed to make it actually work. -- Andriy Gapon -- Andriy Gapon