Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 18 Mar 2010 15:47:40 +0200
From:      Andriy Gapon <avg@freebsd.org>
To:        freebsd-arch@freebsd.org
Subject:   g_vfs_open and bread(devvp, ...)
Message-ID:  <4BA22EFC.8020902@freebsd.org>

next in thread | raw e-mail | index | archive | help

Forwarding here in an attempt to increase exposure.

-------- Original Message --------
Subject: g_vfs_open and bread(devvp, ...)
Date: Wed, 17 Mar 2010 11:52:32 +0200
From: Andriy Gapon <avg@freebsd.org>
To: freebsd-fs@FreeBSD.org, freebsd-geom@freebsd.org
CC: Poul-Henning Kamp <phk@freebsd.org>, Bruce Evans <bde@zeta.org.au>


I've given a fresh look to the issue of g_vfs_open and bread(devvp, ...) in
filesystem code.  This time I hope to present my reasoning more clearly than I did
in my previous attempts.
For this reason I am omitting historical references and dramatic
examples/demonstrations but they are still available upon request (and in archives).
I hope that my shortened notation in references to the code and data structures
will not be confusing.

bread() and the API family it belongs to is an interface to buffer cache system.
Buffer cache system can be roughly divided into two parts: cache part and I/O path
part.
I think that it is understood that both parts must have the same notion of a block
size when translating a block number to a byte offset.  (If this point is not
clear, I can follow up with another essay).

In the case of cache code the translation is explicit and simple:
A) offset = blkno * bo_bsize

In the case of I/O code the translation is not that straightforward, because it
can be altered/overridden by bop_strategy which can in turn hook vop_strategy, etc.

Let's consider a simple case of a filesystem that:
a) connects to a geom provider via g_vfs_open
b) doesn't modify anything in devvp->v_bufobj (in particular bo_ops and bo_bsize)
c) uses bread(devvp) (e.g. to access an equivalent of superblock, etc)

Short overview of geom_vfs glue:
1) g_vfs_open sets devvp->v_bufobj.bo_ops to g_vfs_bufops, where bop_strategy is
g_vfs_strategy
2) bo_bsize is set to pp->sectorsize
3) g_vfs_strategy doesn't perform any block-to-offset translation of its own, it
expects b_iooffset to be correctly set and passes its value to bio_offset

When a filesystem issues bread(devvp) the following happens in the I/O path:
I) bread() calls breadn()
II) in breadn(): bp->b_iooffset = dbtob(bp->b_blkno), that is b_iooffset is set to
blkno * DEV_BSIZE (where DEV_BSIZE is 512)
III) breadn() then calls bstrategy() which is a simple wrapper around BO_STRATEGY
IV) g_vfs_strategy gets called and, as described in (3) above, it simply passes on
b_iooffset value to bio_offset
V) thus, a block size used for I/O operation is 512 (DEV_BSIZE)
VI) on the other hand, as stated in (A) above, block size used in caching code is
bo_bsize

Thus, if bo_bsize != DEV_BSIZE, or alternatively said, pp->sectorsize != 512, we
have a trouble of data getting cached with incorrect offsets.

Additionally, from (V) above we must conclude that a filesystem must specify block
numbers in DEV_BSIZE units to bread(devvp, blkno, ...) if the conditions (a), (b),
(c) are met.  In fact, all such filesystems already do that, because otherwise
they would read incorrect data from the media.

So, the problem is only with the caching of the data.
As such, this issue has little practical effects, because only a small number of
reads is done via devvp and only for sufficiently small chunks of data (I hope).
fs/udf used to be greatly affected by this issue when it was reading directory
nodes via devvp, but that was in the past (prior to 189082).

Still I think that we should fix this issue for general code correctness/quality
reasons.  And also to avoid possible future bugs.

As demonstrated by (V) and (VI) above, obvious and easiest fix is to (always) set
bo_bsize to DEV_BSIZE in g_vfs_open():
--- a/sys/geom/geom_vfs.c
+++ b/sys/geom/geom_vfs.c
@@ -179,7 +179,7 @@ g_vfs_open(struct vnode *vp, struct g_consumer **cpp, const
char *fsname, int wr
 	bo = &vp->v_bufobj;
 	bo->bo_ops = g_vfs_bufops;
 	bo->bo_private = cp;
-	bo->bo_bsize = pp->sectorsize;
+	bo->bo_bsize = DEV_BSIZE;
 	gp->softc = bo;

 	return (error);

I tested this change with ufs, udf and cd9660 and haven't observed any regressions.

P.S.
There might something to changing bread(devvp) convention, so that blkno is
expected in sectorsize units.  But setting bo_bsize to sectorsize is only a tiny
portion of what needs to be changed to make it actually work.

-- 
Andriy Gapon


-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4BA22EFC.8020902>