From owner-freebsd-fs@FreeBSD.ORG  Tue Feb 12 11:24:40 2008
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 12F9E16A469;
	Tue, 12 Feb 2008 11:24:40 +0000 (UTC) (envelope-from avg@icyb.net.ua)
Received: from falcon.cybervisiontech.com (falcon.cybervisiontech.com
	[217.20.163.9])
	by mx1.freebsd.org (Postfix) with ESMTP id 7746113C45E;
	Tue, 12 Feb 2008 11:24:38 +0000 (UTC) (envelope-from avg@icyb.net.ua)
Received: from localhost (localhost [127.0.0.1])
	by falcon.cybervisiontech.com (Postfix) with ESMTP id C710143D123;
	Tue, 12 Feb 2008 13:24:36 +0200 (EET)
X-Virus-Scanned: Debian amavisd-new at falcon.cybervisiontech.com
Received: from falcon.cybervisiontech.com ([127.0.0.1])
	by localhost (falcon.cybervisiontech.com [127.0.0.1]) (amavisd-new,
	port 10024)
	with ESMTP id Pnl6RIIlY1lj; Tue, 12 Feb 2008 13:24:36 +0200 (EET)
Received: from [10.2.1.87] (gateway.cybervisiontech.com.ua [88.81.251.18])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by falcon.cybervisiontech.com (Postfix) with ESMTP id D19EA43D097;
	Tue, 12 Feb 2008 13:24:35 +0200 (EET)
Message-ID: <47B181F2.2070808@icyb.net.ua>
Date: Tue, 12 Feb 2008 13:24:34 +0200
From: Andriy Gapon <avg@icyb.net.ua>
User-Agent: Thunderbird 2.0.0.9 (X11/20080123)
MIME-Version: 1.0
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
References: <3027.1202806417@critter.freebsd.dk>
In-Reply-To: <3027.1202806417@critter.freebsd.dk>
Content-Type: text/plain; charset=KOI8-U
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@FreeBSD.org, freebsd-hackers@FreeBSD.org,
	Pav Lucistnik <pav@FreeBSD.org>, Bruce Evans <bde@zeta.org.au>
Subject: Re: fs/udf: vm pages "overlap" while reading large dir
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Feb 2008 11:24:40 -0000

on 12/02/2008 10:53 Poul-Henning Kamp said the following:
> In message <47B0CE54.6090204@icyb.net.ua>, Andriy Gapon writes:
>> on 06/02/2008 18:29 Andriy Gapon said the following:
>>> Small summary of the above long description.
>>> For directory reading fs/udf performs bread() on a (underlying) device
>>> vnode. It passes block number as if block size was 512 bytes (i.e.
>>> byte_offset_within_dev/512).
> 
> We have three sizes of relevance here, the sectorsize of the provider,
> the blocksize of the filesystem and the page size of the system.
> 
> In general it is adventurous to have any of them be anything but
> powers of two, and it is at best ill-adviced and more likely asking
> for trouble to do requests that are not multiple of and aligned to
> the sectorsize of the provider.
> 
> So up front, I would say that it is an UDF bug to do 512 byte reads off
> a 2k sector provider.
> 
> Making the buffer cache handle this is possible, but it is not the
> direction we have planned for the buffercache (ie: that it should
> become a wrapper for using the VM system, rather than being a
> separately allocated cache).
> 
> So if the objective is to make UDF work in the short term, your
> change might work, if the objective is to move FreeBSD's kernel
> architecture forward, the right thing to do is to fix UDF to not
> access subsectors.
> 

Poul,

I agree with what you say, but I think that I didn't properly explain
what is UDF code doing and why it might be important in general.
Let me try to do it step-by-step (sorry if I'll say something too
obvious). And please correct me if I misunderstand something in the
fundamental steps.

1.1. UDF is typically used with CD/DVD media, so provider's sector size
is 2048.
1.2. udf vfs mount method calls g_vfs_open.
1.3. g_vfs_open creates vobject for a device vnode.
1.4. g_vfs_open sets bo_bsize=pp->sectorsize in the device vnode's bufobj.
1.5. g_vfs_open also overrides bo_ops for the bufobj.

2.1. UDF directory reading code performs bread() via the device vnode. [*]
2.2. this code passes to bread a size that is multiple of 2048.
2.3. this code passes to bread blkno that is calculated as 4*sector,
where sector is a number of a physical 2048-byte sector. [**]

[*]  - this seems to be an uncommon thing among filesystems.
[**] - I think that this is a requirement of buffcache system, because
internally it performs many calculations that seem to assume that block
size is always 512.
E.g. breadn() code has the following:
bp->b_iooffset = dbtob(bp->b_blkno); <-- effectively multiplies by 512
bstrategy(bp);
And g_vfs_strategy has the following:
bip->bio_offset = bp->b_iooffset;
So, if udf code would pass blkno==sector, then bio_offset would be
incorrect.
Or maybe g_vfs_strategy should do some translation here from b_iooffset
to bio_offset taking into account bo_bsize ?? So that the actual,
non-adjusted, sector number could be passed to the bread() ?

3.1. for a fresh buf getlbk would assign the following:
bsize = bo->bo_bsize;
offset = blkno * bsize;
...
bp->b_blkno = bp->b_lblkno = blkno;
bp->b_offset = offset; <--- this is where this discussion started
so b_offset of a buffer is 4*sector*2048.
This is a source of the trouble.
3.2. getblk would set bp->b_flags |= B_VMIO; if a vnode has a vobject
and our device vnode has it (step 1.3).
3.3. consequently allocbuf will execute code for B_VMIO case.
3.4. allocbuf will lookup/allocate pages by index which is calculated
from "base index" of OFF_TO_IDX(bp->b_offset), which is, in essence,
bp->b_offset divided by page size, 4096. So our "base index" is 2*sector.

4.1. Let's assume we bread() (via the device vnode, as described above)
from physical sector N and we read 6 physical sectors (6*2048 bytes).
4.2 Sectors will get "mapped"/tied/bound to VM pages as follows
(according to the above calculations):
sectors[N, N+1] -> page[2*N],
sectors[N+2, N+3] -> page[2*N + 1], /* the next page */
sectors[N+4, N+5] -> page[2*N + 2]  /* the next page */
4.5 Now lets consider "the next read" of X sectors but now starting from
sector N+1; repeating the calculations we get the following mapping:
sectors[(N+1), (N+1) + 1] -> page[2*(N+1)] = page[2N +2]
But this page already has cached data from sectors[N+4, N+5].
Problem!!

Theoretical calculations show it and practical testing confirms that.
So this is a proof that bread()-ing via a device vnode is broken if:
C1) the vnode was "set up" by g_vfs_open();
C2) sector size of underlying geom provider is not 512, but any
non-trivial multiple of it;
C3) bread size is sufficiently big;

Current UDF code for directory reading is the only known to me place
that meets all the 3 above conditions (for sufficiently large
directories to meet condition C3).

So I stated this, now let the wise speak.

I already have a patch that makes UDF read directories via the directory
vnodes. But the problem in general would remain.
Maybe g_vfs_open is a correct place to change, maybe g_vfs_strategy is
the place, maybe something, maybe "don't bother". I don't know.

-- 
Andriy Gapon