From owner-freebsd-hackers@FreeBSD.ORG Tue Feb 12 15:58:21 2008 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A56C116A41B; Tue, 12 Feb 2008 15:58:21 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail16.syd.optusnet.com.au (mail16.syd.optusnet.com.au [211.29.132.197]) by mx1.freebsd.org (Postfix) with ESMTP id 34D7613C4DB; Tue, 12 Feb 2008 15:58:21 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-219-213.carlnfd3.nsw.optusnet.com.au [211.30.219.213]) by mail16.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m1CFwC0G014582 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 13 Feb 2008 02:58:18 +1100 Date: Wed, 13 Feb 2008 02:58:12 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Andriy Gapon In-Reply-To: <47B1A631.1000504@icyb.net.ua> Message-ID: <20080213015738.L24074@besplex.bde.org> References: <4299.1202816854@critter.freebsd.dk> <20080212234018.O92415@delplex.bde.org> <47B1A631.1000504@icyb.net.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Mailman-Approved-At: Wed, 13 Feb 2008 02:15:00 +0000 Cc: Bruce Evans , freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org, Poul-Henning Kamp , Pav Lucistnik , Bruce Evans Subject: Re: fs/udf: vm pages "overlap" while reading large dir X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Feb 2008 15:58:21 -0000 On Tue, 12 Feb 2008, Andriy Gapon wrote: > on 12/02/2008 15:11 Bruce Evans said the following: >> On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: >> >>> In message <47B181F2.2070808@icyb.net.ua>, Andriy Gapon writes: >>>> 3.1. for a fresh buf getlbk would assign the following: >>>> bsize = bo->bo_bsize; >>>> offset = blkno * bsize; >>> That's clearly wrong. >> >> If units were always 512-blocks, then anything except bsize = DEV_BSIZE >> would be clearly wrong. Things aren't that simple (but probably should >> be). Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks. > > O, I missed this obvious thing! Me too. > Bruce, thank you for putting me back on the ground :-) > Even in UDF case, when we bread() via a file or directory vnode we pass > a logical 2048-byte block number (within the file) to bread(). In this > case the code in getblk() does the right thing when it calculates offset > as blkno * 2048. Calculating it assuming 512-byte units would be a disaster. I think the size is mnt_stat.f_iosize in general (but not for device vnodes). > And the actual reading works correctly because udf_strategy is called > for such vnodes and it translates block numbers from physical to logical > and also correctly re-calculates b_iooffset for actual reading. So > b_iooffset value set in breadn() (using bdtob) is completely ignored. Is this setting ever used (and the corresponding setting for writing) ever used? > I remember that this is why g_vfs_* was my primary target. > It seems that I revert my opinion back: either g_vfs_open should be > smart about setting bo_bsize, or g_vfs_strategy should be smart about > calculating bio_offset, or individual filesystems should not depend on > g_vfs_* always doing the best thing for them. In fact, ffs already overrides the setting of bo_bsize for the device vnode to a different wrong setting -- g_vfs_open() sets the sector size, and ffs_mount() changes the setting to fs_bsize, but ffs actually needs the setting to be DEV_BSIZE (I think). Other bugs from this: - ffs_rawread wants the sector size, and it assumes that this is in bo_bsize for the device vnode, but ffs_mount() has changed this to fs_bsize. - multiple r/o mounts are supposed to work, but don't, since there is only one device vnode with a shared bufobj, but the bufobj needs to be per-file-system since all mounts write to it. Various bad things including panics result. There is a well-know panic from bo_private becoming garbage on unmount. I just noticed more possibilities for panics. bo_ops points to static storage, so it never becomes complete garbage. However, at least ffs sets it blindly early on in ffs_mountfs(), before looking at the file system to see if ffs can mount it. Thus if the file system is already mounted by another ffs, then ffs clobbers the other fs's bo_ops. The ffs mount will presumably fail, leaving bo_ops clobbered. Also, a successful g_vfs_open() has clobbered bo_ops, bo_private and bo_bsize a little earlier. g_vfs_open() is fundamental to looking at the file system, since I/O is not set up until it completes. Clobbering the pointers is most dangerous, but just clobbering bo_bsize breaks blkno calculations for any code that uses bo_bsize. Apart from these bugs, the fix for the blkno calculations for device vp's may be as simple as setting bo_bsize to DEV_BSIZE for the device vp of all disk file systems (since all disk file systems use expect this size). Currently, ffs seems to be the only file system that overrides g_vfs_open()'s default of the sector size. Nothing in any of fs/, gnu/fs/ and contrib/ references bo_bsize. > In any case, it seems that it is the g_vfs_* that is currently > inconsistent: it sets bo_bsize to a somewhat arbitrary value, but > expects that it would always be provided with correct b_iooffset (the > latter being rigidly calculated via bdtob() in buffcache code). So this > leaves filesystems dead in the water while reading via device vnode: > provide blkno in 512-byte units - screw up VM cache, provide blkno in > bo_bsize units - screw up reading from disk. I think I/O on the disk doesn't use bo_bsize, but only the sector size (to scale the byte count to a sector count). Otherwise, ffs's override would break I/O. geom is another place that has few references to bo_bsize -- it just clobbers it in g_vfs_open(). > Not sure if the FS-s should have wits to set bo_bsize properly and/or > provide proper bo_ops - we are talking about a device vnode here. > Should filesystems be involved in the business of setting its > properties? Not sure. Yes. They can set it more easily as they can tell g_vfs_open() what to set it to. Except, until the bugs are fixed properly, g_vfs_open() can more easily do tests to prevent clobbering. For bo_bsize and bo_ops, sharing a common value is safe and safe changes can be detected by setting to a special value on last unmount. For bo_private, a safety check would probably disallow multiple mounts (since cp is dynamically allocated (?)). Bruce