Date: Tue, 12 Feb 2008 13:24:34 +0200 From: Andriy Gapon <avg@icyb.net.ua> To: Poul-Henning Kamp <phk@phk.freebsd.dk> Cc: freebsd-fs@FreeBSD.org, freebsd-hackers@FreeBSD.org, Scott Long <scottl@samsco.org>, Pav Lucistnik <pav@FreeBSD.org>, Bruce Evans <bde@zeta.org.au> Subject: Re: fs/udf: vm pages "overlap" while reading large dir Message-ID: <47B181F2.2070808@icyb.net.ua> In-Reply-To: <3027.1202806417@critter.freebsd.dk> References: <3027.1202806417@critter.freebsd.dk>
next in thread | previous in thread | raw e-mail | index | archive | help
on 12/02/2008 10:53 Poul-Henning Kamp said the following: > In message <47B0CE54.6090204@icyb.net.ua>, Andriy Gapon writes: >> on 06/02/2008 18:29 Andriy Gapon said the following: >>> Small summary of the above long description. >>> For directory reading fs/udf performs bread() on a (underlying) device >>> vnode. It passes block number as if block size was 512 bytes (i.e. >>> byte_offset_within_dev/512). > > We have three sizes of relevance here, the sectorsize of the provider, > the blocksize of the filesystem and the page size of the system. > > In general it is adventurous to have any of them be anything but > powers of two, and it is at best ill-adviced and more likely asking > for trouble to do requests that are not multiple of and aligned to > the sectorsize of the provider. > > So up front, I would say that it is an UDF bug to do 512 byte reads off > a 2k sector provider. > > Making the buffer cache handle this is possible, but it is not the > direction we have planned for the buffercache (ie: that it should > become a wrapper for using the VM system, rather than being a > separately allocated cache). > > So if the objective is to make UDF work in the short term, your > change might work, if the objective is to move FreeBSD's kernel > architecture forward, the right thing to do is to fix UDF to not > access subsectors. > Poul, I agree with what you say, but I think that I didn't properly explain what is UDF code doing and why it might be important in general. Let me try to do it step-by-step (sorry if I'll say something too obvious). And please correct me if I misunderstand something in the fundamental steps. 1.1. UDF is typically used with CD/DVD media, so provider's sector size is 2048. 1.2. udf vfs mount method calls g_vfs_open. 1.3. g_vfs_open creates vobject for a device vnode. 1.4. g_vfs_open sets bo_bsize=pp->sectorsize in the device vnode's bufobj. 1.5. g_vfs_open also overrides bo_ops for the bufobj. 2.1. UDF directory reading code performs bread() via the device vnode. [*] 2.2. this code passes to bread a size that is multiple of 2048. 2.3. this code passes to bread blkno that is calculated as 4*sector, where sector is a number of a physical 2048-byte sector. [**] [*] - this seems to be an uncommon thing among filesystems. [**] - I think that this is a requirement of buffcache system, because internally it performs many calculations that seem to assume that block size is always 512. E.g. breadn() code has the following: bp->b_iooffset = dbtob(bp->b_blkno); <-- effectively multiplies by 512 bstrategy(bp); And g_vfs_strategy has the following: bip->bio_offset = bp->b_iooffset; So, if udf code would pass blkno==sector, then bio_offset would be incorrect. Or maybe g_vfs_strategy should do some translation here from b_iooffset to bio_offset taking into account bo_bsize ?? So that the actual, non-adjusted, sector number could be passed to the bread() ? 3.1. for a fresh buf getlbk would assign the following: bsize = bo->bo_bsize; offset = blkno * bsize; ... bp->b_blkno = bp->b_lblkno = blkno; bp->b_offset = offset; <--- this is where this discussion started so b_offset of a buffer is 4*sector*2048. This is a source of the trouble. 3.2. getblk would set bp->b_flags |= B_VMIO; if a vnode has a vobject and our device vnode has it (step 1.3). 3.3. consequently allocbuf will execute code for B_VMIO case. 3.4. allocbuf will lookup/allocate pages by index which is calculated from "base index" of OFF_TO_IDX(bp->b_offset), which is, in essence, bp->b_offset divided by page size, 4096. So our "base index" is 2*sector. 4.1. Let's assume we bread() (via the device vnode, as described above) from physical sector N and we read 6 physical sectors (6*2048 bytes). 4.2 Sectors will get "mapped"/tied/bound to VM pages as follows (according to the above calculations): sectors[N, N+1] -> page[2*N], sectors[N+2, N+3] -> page[2*N + 1], /* the next page */ sectors[N+4, N+5] -> page[2*N + 2] /* the next page */ 4.5 Now lets consider "the next read" of X sectors but now starting from sector N+1; repeating the calculations we get the following mapping: sectors[(N+1), (N+1) + 1] -> page[2*(N+1)] = page[2N +2] But this page already has cached data from sectors[N+4, N+5]. Problem!! Theoretical calculations show it and practical testing confirms that. So this is a proof that bread()-ing via a device vnode is broken if: C1) the vnode was "set up" by g_vfs_open(); C2) sector size of underlying geom provider is not 512, but any non-trivial multiple of it; C3) bread size is sufficiently big; Current UDF code for directory reading is the only known to me place that meets all the 3 above conditions (for sufficiently large directories to meet condition C3). So I stated this, now let the wise speak. I already have a patch that makes UDF read directories via the directory vnodes. But the problem in general would remain. Maybe g_vfs_open is a correct place to change, maybe g_vfs_strategy is the place, maybe something, maybe "don't bother". I don't know. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?47B181F2.2070808>