Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 12 Feb 2008 13:24:34 +0200
From:      Andriy Gapon <avg@icyb.net.ua>
To:        Poul-Henning Kamp <phk@phk.freebsd.dk>
Cc:        freebsd-fs@FreeBSD.org, freebsd-hackers@FreeBSD.org, Scott Long <scottl@samsco.org>, Pav Lucistnik <pav@FreeBSD.org>, Bruce Evans <bde@zeta.org.au>
Subject:   Re: fs/udf: vm pages "overlap" while reading large dir
Message-ID:  <47B181F2.2070808@icyb.net.ua>
In-Reply-To: <3027.1202806417@critter.freebsd.dk>
References:  <3027.1202806417@critter.freebsd.dk>

next in thread | previous in thread | raw e-mail | index | archive | help
on 12/02/2008 10:53 Poul-Henning Kamp said the following:
> In message <47B0CE54.6090204@icyb.net.ua>, Andriy Gapon writes:
>> on 06/02/2008 18:29 Andriy Gapon said the following:
>>> Small summary of the above long description.
>>> For directory reading fs/udf performs bread() on a (underlying) device
>>> vnode. It passes block number as if block size was 512 bytes (i.e.
>>> byte_offset_within_dev/512).
> 
> We have three sizes of relevance here, the sectorsize of the provider,
> the blocksize of the filesystem and the page size of the system.
> 
> In general it is adventurous to have any of them be anything but
> powers of two, and it is at best ill-adviced and more likely asking
> for trouble to do requests that are not multiple of and aligned to
> the sectorsize of the provider.
> 
> So up front, I would say that it is an UDF bug to do 512 byte reads off
> a 2k sector provider.
> 
> Making the buffer cache handle this is possible, but it is not the
> direction we have planned for the buffercache (ie: that it should
> become a wrapper for using the VM system, rather than being a
> separately allocated cache).
> 
> So if the objective is to make UDF work in the short term, your
> change might work, if the objective is to move FreeBSD's kernel
> architecture forward, the right thing to do is to fix UDF to not
> access subsectors.
> 

Poul,

I agree with what you say, but I think that I didn't properly explain
what is UDF code doing and why it might be important in general.
Let me try to do it step-by-step (sorry if I'll say something too
obvious). And please correct me if I misunderstand something in the
fundamental steps.

1.1. UDF is typically used with CD/DVD media, so provider's sector size
is 2048.
1.2. udf vfs mount method calls g_vfs_open.
1.3. g_vfs_open creates vobject for a device vnode.
1.4. g_vfs_open sets bo_bsize=pp->sectorsize in the device vnode's bufobj.
1.5. g_vfs_open also overrides bo_ops for the bufobj.

2.1. UDF directory reading code performs bread() via the device vnode. [*]
2.2. this code passes to bread a size that is multiple of 2048.
2.3. this code passes to bread blkno that is calculated as 4*sector,
where sector is a number of a physical 2048-byte sector. [**]

[*]  - this seems to be an uncommon thing among filesystems.
[**] - I think that this is a requirement of buffcache system, because
internally it performs many calculations that seem to assume that block
size is always 512.
E.g. breadn() code has the following:
bp->b_iooffset = dbtob(bp->b_blkno); <-- effectively multiplies by 512
bstrategy(bp);
And g_vfs_strategy has the following:
bip->bio_offset = bp->b_iooffset;
So, if udf code would pass blkno==sector, then bio_offset would be
incorrect.
Or maybe g_vfs_strategy should do some translation here from b_iooffset
to bio_offset taking into account bo_bsize ?? So that the actual,
non-adjusted, sector number could be passed to the bread() ?

3.1. for a fresh buf getlbk would assign the following:
bsize = bo->bo_bsize;
offset = blkno * bsize;
...
bp->b_blkno = bp->b_lblkno = blkno;
bp->b_offset = offset; <--- this is where this discussion started
so b_offset of a buffer is 4*sector*2048.
This is a source of the trouble.
3.2. getblk would set bp->b_flags |= B_VMIO; if a vnode has a vobject
and our device vnode has it (step 1.3).
3.3. consequently allocbuf will execute code for B_VMIO case.
3.4. allocbuf will lookup/allocate pages by index which is calculated
from "base index" of OFF_TO_IDX(bp->b_offset), which is, in essence,
bp->b_offset divided by page size, 4096. So our "base index" is 2*sector.

4.1. Let's assume we bread() (via the device vnode, as described above)
from physical sector N and we read 6 physical sectors (6*2048 bytes).
4.2 Sectors will get "mapped"/tied/bound to VM pages as follows
(according to the above calculations):
sectors[N, N+1] -> page[2*N],
sectors[N+2, N+3] -> page[2*N + 1], /* the next page */
sectors[N+4, N+5] -> page[2*N + 2]  /* the next page */
4.5 Now lets consider "the next read" of X sectors but now starting from
sector N+1; repeating the calculations we get the following mapping:
sectors[(N+1), (N+1) + 1] -> page[2*(N+1)] = page[2N +2]
But this page already has cached data from sectors[N+4, N+5].
Problem!!

Theoretical calculations show it and practical testing confirms that.
So this is a proof that bread()-ing via a device vnode is broken if:
C1) the vnode was "set up" by g_vfs_open();
C2) sector size of underlying geom provider is not 512, but any
non-trivial multiple of it;
C3) bread size is sufficiently big;

Current UDF code for directory reading is the only known to me place
that meets all the 3 above conditions (for sufficiently large
directories to meet condition C3).

So I stated this, now let the wise speak.

I already have a patch that makes UDF read directories via the directory
vnodes. But the problem in general would remain.
Maybe g_vfs_open is a correct place to change, maybe g_vfs_strategy is
the place, maybe something, maybe "don't bother". I don't know.

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?47B181F2.2070808>