From owner-freebsd-hackers@FreeBSD.ORG  Tue Feb 12 15:58:21 2008
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A56C116A41B;
	Tue, 12 Feb 2008 15:58:21 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail16.syd.optusnet.com.au (mail16.syd.optusnet.com.au
	[211.29.132.197])
	by mx1.freebsd.org (Postfix) with ESMTP id 34D7613C4DB;
	Tue, 12 Feb 2008 15:58:21 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c211-30-219-213.carlnfd3.nsw.optusnet.com.au
	[211.30.219.213])
	by mail16.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	m1CFwC0G014582
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 13 Feb 2008 02:58:18 +1100
Date: Wed, 13 Feb 2008 02:58:12 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Andriy Gapon <avg@icyb.net.ua>
In-Reply-To: <47B1A631.1000504@icyb.net.ua>
Message-ID: <20080213015738.L24074@besplex.bde.org>
References: <4299.1202816854@critter.freebsd.dk>
	<20080212234018.O92415@delplex.bde.org>
	<47B1A631.1000504@icyb.net.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Mailman-Approved-At: Wed, 13 Feb 2008 02:15:00 +0000
Cc: Bruce Evans <bde@zeta.org.au>, freebsd-hackers@freebsd.org,
	freebsd-fs@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>,
	Pav Lucistnik <pav@freebsd.org>, Bruce Evans <brde@optusnet.com.au>
Subject: Re: fs/udf: vm pages "overlap" while reading large dir
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Feb 2008 15:58:21 -0000

On Tue, 12 Feb 2008, Andriy Gapon wrote:

> on 12/02/2008 15:11 Bruce Evans said the following:
>> On Tue, 12 Feb 2008, Poul-Henning Kamp wrote:
>>
>>> In message <47B181F2.2070808@icyb.net.ua>, Andriy Gapon writes:

>>>> 3.1. for a fresh buf getlbk would assign the following:
>>>> bsize = bo->bo_bsize;
>>>> offset = blkno * bsize;
>>> That's clearly wrong.
>>
>> If units were always 512-blocks, then anything except bsize = DEV_BSIZE
>> would be clearly wrong.  Things aren't that simple (but probably should
>> be).  Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks.
>
> O, I missed this obvious thing!

Me too.

> Bruce, thank you for putting me back on the ground :-)
> Even in UDF case, when we bread() via a file or directory vnode we pass
> a logical 2048-byte block number (within the file) to bread(). In this
> case the code in getblk() does the right thing when it calculates offset
> as blkno * 2048. Calculating it assuming 512-byte units would be a disaster.

I think the size is mnt_stat.f_iosize in general (but not for device vnodes).

> And the actual reading works correctly because udf_strategy is called
> for such vnodes and it translates block numbers from physical to logical
> and also correctly re-calculates b_iooffset for actual reading. So
> b_iooffset value set in breadn() (using bdtob) is completely ignored.

Is this setting ever used (and the corresponding setting for writing)
ever used?

> I remember that this is why g_vfs_* was my primary target.
> It seems that I revert my opinion back: either g_vfs_open should be
> smart about setting bo_bsize, or g_vfs_strategy should be smart about
> calculating bio_offset, or individual filesystems should not depend on
> g_vfs_* always doing the best thing for them.

In fact, ffs already overrides the setting of bo_bsize for the device
vnode to a different wrong setting -- g_vfs_open() sets the sector size,
and ffs_mount() changes the setting to fs_bsize, but ffs actually needs
the setting to be DEV_BSIZE (I think).  Other bugs from this:
- ffs_rawread wants the sector size, and it assumes that this is in bo_bsize
   for the device vnode, but ffs_mount() has changed this to fs_bsize.
- multiple r/o mounts are supposed to work, but don't, since there is only
   one device vnode with a shared bufobj, but the bufobj needs to be
   per-file-system since all mounts write to it.  Various bad things
   including panics result.  There is a well-know panic from bo_private
   becoming garbage on unmount.  I just noticed more possibilities for
   panics.  bo_ops points to static storage, so it never becomes complete
   garbage.  However, at least ffs sets it blindly early on in
   ffs_mountfs(), before looking at the file system to see if ffs can
   mount it.  Thus if the file system is already mounted by another
   ffs, then ffs clobbers the other fs's bo_ops.  The ffs mount will
   presumably fail, leaving bo_ops clobbered.  Also, a successful
   g_vfs_open() has clobbered bo_ops, bo_private and bo_bsize a little
   earlier.  g_vfs_open() is fundamental to looking at the file system,
   since I/O is not set up until it completes.  Clobbering the pointers
   is most dangerous, but just clobbering bo_bsize breaks blkno
   calculations for any code that uses bo_bsize.

Apart from these bugs, the fix for the blkno calculations for device
vp's may be as simple as setting bo_bsize to DEV_BSIZE for the device
vp of all disk file systems (since all disk file systems use expect
this size).  Currently, ffs seems to be the only file system that
overrides g_vfs_open()'s default of the sector size.  Nothing in any
of fs/, gnu/fs/ and contrib/ references bo_bsize.

> In any case, it seems that it is the g_vfs_* that is currently
> inconsistent: it sets bo_bsize to a somewhat arbitrary value, but
> expects that it would always be provided with correct b_iooffset (the
> latter being rigidly calculated via bdtob() in buffcache code). So this
> leaves filesystems dead in the water while reading via device vnode:
> provide blkno in 512-byte units - screw up VM cache, provide blkno in
> bo_bsize units - screw up reading from disk.

I think I/O on the disk doesn't use bo_bsize, but only the sector size
(to scale the byte count to a sector count).  Otherwise, ffs's override
would break I/O.  geom is another place that has few references to
bo_bsize -- it just clobbers it in g_vfs_open().

> Not sure if the FS-s should have wits to set bo_bsize properly and/or
> provide proper bo_ops - we are talking about a device vnode here.
> Should filesystems be involved in the business of setting its
> properties? Not sure.

Yes.  They can set it more easily as they can tell g_vfs_open() what to
set it to.  Except, until the bugs are fixed properly, g_vfs_open() can
more easily do tests to prevent clobbering.  For bo_bsize and bo_ops,
sharing a common value is safe and safe changes can be detected by
setting to a special value on last unmount.  For bo_private, a safety
check would probably disallow multiple mounts (since cp is dynamically
allocated (?)).

Bruce