From owner-freebsd-fs@FreeBSD.ORG Tue Feb 12 11:24:40 2008 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 12F9E16A469; Tue, 12 Feb 2008 11:24:40 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from falcon.cybervisiontech.com (falcon.cybervisiontech.com [217.20.163.9]) by mx1.freebsd.org (Postfix) with ESMTP id 7746113C45E; Tue, 12 Feb 2008 11:24:38 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from localhost (localhost [127.0.0.1]) by falcon.cybervisiontech.com (Postfix) with ESMTP id C710143D123; Tue, 12 Feb 2008 13:24:36 +0200 (EET) X-Virus-Scanned: Debian amavisd-new at falcon.cybervisiontech.com Received: from falcon.cybervisiontech.com ([127.0.0.1]) by localhost (falcon.cybervisiontech.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Pnl6RIIlY1lj; Tue, 12 Feb 2008 13:24:36 +0200 (EET) Received: from [10.2.1.87] (gateway.cybervisiontech.com.ua [88.81.251.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by falcon.cybervisiontech.com (Postfix) with ESMTP id D19EA43D097; Tue, 12 Feb 2008 13:24:35 +0200 (EET) Message-ID: <47B181F2.2070808@icyb.net.ua> Date: Tue, 12 Feb 2008 13:24:34 +0200 From: Andriy Gapon User-Agent: Thunderbird 2.0.0.9 (X11/20080123) MIME-Version: 1.0 To: Poul-Henning Kamp References: <3027.1202806417@critter.freebsd.dk> In-Reply-To: <3027.1202806417@critter.freebsd.dk> Content-Type: text/plain; charset=KOI8-U Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org, freebsd-hackers@FreeBSD.org, Pav Lucistnik , Bruce Evans Subject: Re: fs/udf: vm pages "overlap" while reading large dir X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Feb 2008 11:24:40 -0000 on 12/02/2008 10:53 Poul-Henning Kamp said the following: > In message <47B0CE54.6090204@icyb.net.ua>, Andriy Gapon writes: >> on 06/02/2008 18:29 Andriy Gapon said the following: >>> Small summary of the above long description. >>> For directory reading fs/udf performs bread() on a (underlying) device >>> vnode. It passes block number as if block size was 512 bytes (i.e. >>> byte_offset_within_dev/512). > > We have three sizes of relevance here, the sectorsize of the provider, > the blocksize of the filesystem and the page size of the system. > > In general it is adventurous to have any of them be anything but > powers of two, and it is at best ill-adviced and more likely asking > for trouble to do requests that are not multiple of and aligned to > the sectorsize of the provider. > > So up front, I would say that it is an UDF bug to do 512 byte reads off > a 2k sector provider. > > Making the buffer cache handle this is possible, but it is not the > direction we have planned for the buffercache (ie: that it should > become a wrapper for using the VM system, rather than being a > separately allocated cache). > > So if the objective is to make UDF work in the short term, your > change might work, if the objective is to move FreeBSD's kernel > architecture forward, the right thing to do is to fix UDF to not > access subsectors. > Poul, I agree with what you say, but I think that I didn't properly explain what is UDF code doing and why it might be important in general. Let me try to do it step-by-step (sorry if I'll say something too obvious). And please correct me if I misunderstand something in the fundamental steps. 1.1. UDF is typically used with CD/DVD media, so provider's sector size is 2048. 1.2. udf vfs mount method calls g_vfs_open. 1.3. g_vfs_open creates vobject for a device vnode. 1.4. g_vfs_open sets bo_bsize=pp->sectorsize in the device vnode's bufobj. 1.5. g_vfs_open also overrides bo_ops for the bufobj. 2.1. UDF directory reading code performs bread() via the device vnode. [*] 2.2. this code passes to bread a size that is multiple of 2048. 2.3. this code passes to bread blkno that is calculated as 4*sector, where sector is a number of a physical 2048-byte sector. [**] [*] - this seems to be an uncommon thing among filesystems. [**] - I think that this is a requirement of buffcache system, because internally it performs many calculations that seem to assume that block size is always 512. E.g. breadn() code has the following: bp->b_iooffset = dbtob(bp->b_blkno); <-- effectively multiplies by 512 bstrategy(bp); And g_vfs_strategy has the following: bip->bio_offset = bp->b_iooffset; So, if udf code would pass blkno==sector, then bio_offset would be incorrect. Or maybe g_vfs_strategy should do some translation here from b_iooffset to bio_offset taking into account bo_bsize ?? So that the actual, non-adjusted, sector number could be passed to the bread() ? 3.1. for a fresh buf getlbk would assign the following: bsize = bo->bo_bsize; offset = blkno * bsize; ... bp->b_blkno = bp->b_lblkno = blkno; bp->b_offset = offset; <--- this is where this discussion started so b_offset of a buffer is 4*sector*2048. This is a source of the trouble. 3.2. getblk would set bp->b_flags |= B_VMIO; if a vnode has a vobject and our device vnode has it (step 1.3). 3.3. consequently allocbuf will execute code for B_VMIO case. 3.4. allocbuf will lookup/allocate pages by index which is calculated from "base index" of OFF_TO_IDX(bp->b_offset), which is, in essence, bp->b_offset divided by page size, 4096. So our "base index" is 2*sector. 4.1. Let's assume we bread() (via the device vnode, as described above) from physical sector N and we read 6 physical sectors (6*2048 bytes). 4.2 Sectors will get "mapped"/tied/bound to VM pages as follows (according to the above calculations): sectors[N, N+1] -> page[2*N], sectors[N+2, N+3] -> page[2*N + 1], /* the next page */ sectors[N+4, N+5] -> page[2*N + 2] /* the next page */ 4.5 Now lets consider "the next read" of X sectors but now starting from sector N+1; repeating the calculations we get the following mapping: sectors[(N+1), (N+1) + 1] -> page[2*(N+1)] = page[2N +2] But this page already has cached data from sectors[N+4, N+5]. Problem!! Theoretical calculations show it and practical testing confirms that. So this is a proof that bread()-ing via a device vnode is broken if: C1) the vnode was "set up" by g_vfs_open(); C2) sector size of underlying geom provider is not 512, but any non-trivial multiple of it; C3) bread size is sufficiently big; Current UDF code for directory reading is the only known to me place that meets all the 3 above conditions (for sufficiently large directories to meet condition C3). So I stated this, now let the wise speak. I already have a patch that makes UDF read directories via the directory vnodes. But the problem in general would remain. Maybe g_vfs_open is a correct place to change, maybe g_vfs_strategy is the place, maybe something, maybe "don't bother". I don't know. -- Andriy Gapon