Date: Fri, 26 Mar 2010 15:17:41 +0200 From: Andriy Gapon <avg@freebsd.org> To: freebsd-fs@freebsd.org, freebsd-geom@freebsd.org Subject: Re: g_vfs_open and bread(devvp, ...) Message-ID: <4BACB3F5.7010905@freebsd.org> In-Reply-To: <4BA0A660.3000902@freebsd.org> References: <4BA0A660.3000902@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Will an offer of a beer help reviewing this change? :-) on 17/03/2010 11:52 Andriy Gapon said the following: > I've given a fresh look to the issue of g_vfs_open and bread(devvp, ...) in > filesystem code. This time I hope to present my reasoning more clearly than I did > in my previous attempts. > For this reason I am omitting historical references and dramatic > examples/demonstrations but they are still available upon request (and in archives). > I hope that my shortened notation in references to the code and data structures > will not be confusing. > > bread() and the API family it belongs to is an interface to buffer cache system. > Buffer cache system can be roughly divided into two parts: cache part and I/O path > part. > I think that it is understood that both parts must have the same notion of a block > size when translating a block number to a byte offset. (If this point is not > clear, I can follow up with another essay). > > In the case of cache code the translation is explicit and simple: > A) offset = blkno * bo_bsize > > In the case of I/O code the translation is not that straightforward, because it > can be altered/overridden by bop_strategy which can in turn hook vop_strategy, etc. > > Let's consider a simple case of a filesystem that: > a) connects to a geom provider via g_vfs_open > b) doesn't modify anything in devvp->v_bufobj (in particular bo_ops and bo_bsize) > c) uses bread(devvp) (e.g. to access an equivalent of superblock, etc) > > Short overview of geom_vfs glue: > 1) g_vfs_open sets devvp->v_bufobj.bo_ops to g_vfs_bufops, where bop_strategy is > g_vfs_strategy > 2) bo_bsize is set to pp->sectorsize > 3) g_vfs_strategy doesn't perform any block-to-offset translation of its own, it > expects b_iooffset to be correctly set and passes its value to bio_offset > > When a filesystem issues bread(devvp) the following happens in the I/O path: > I) bread() calls breadn() > II) in breadn(): bp->b_iooffset = dbtob(bp->b_blkno), that is b_iooffset is set to > blkno * DEV_BSIZE (where DEV_BSIZE is 512) > III) breadn() then calls bstrategy() which is a simple wrapper around BO_STRATEGY > IV) g_vfs_strategy gets called and, as described in (3) above, it simply passes on > b_iooffset value to bio_offset > V) thus, a block size used for I/O operation is 512 (DEV_BSIZE) > VI) on the other hand, as stated in (A) above, block size used in caching code is > bo_bsize > > Thus, if bo_bsize != DEV_BSIZE, or alternatively said, pp->sectorsize != 512, we > have a trouble of data getting cached with incorrect offsets. > > Additionally, from (V) above we must conclude that a filesystem must specify block > numbers in DEV_BSIZE units to bread(devvp, blkno, ...) if the conditions (a), (b), > (c) are met. In fact, all such filesystems already do that, because otherwise > they would read incorrect data from the media. > > So, the problem is only with the caching of the data. > As such, this issue has little practical effects, because only a small number of > reads is done via devvp and only for sufficiently small chunks of data (I hope). > fs/udf used to be greatly affected by this issue when it was reading directory > nodes via devvp, but that was in the past (prior to 189082). > > Still I think that we should fix this issue for general code correctness/quality > reasons. And also to avoid possible future bugs. > > As demonstrated by (V) and (VI) above, obvious and easiest fix is to (always) set > bo_bsize to DEV_BSIZE in g_vfs_open(): > --- a/sys/geom/geom_vfs.c > +++ b/sys/geom/geom_vfs.c > @@ -179,7 +179,7 @@ g_vfs_open(struct vnode *vp, struct g_consumer **cpp, const > char *fsname, int wr > bo = &vp->v_bufobj; > bo->bo_ops = g_vfs_bufops; > bo->bo_private = cp; > - bo->bo_bsize = pp->sectorsize; > + bo->bo_bsize = DEV_BSIZE; > gp->softc = bo; > > return (error); > > I tested this change with ufs, udf and cd9660 and haven't observed any regressions. > > P.S. > There might something to changing bread(devvp) convention, so that blkno is > expected in sectorsize units. But setting bo_bsize to sectorsize is only a tiny > portion of what needs to be changed to make it actually work. > -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4BACB3F5.7010905>