From owner-freebsd-hackers@FreeBSD.ORG Tue Feb 12 13:59:18 2008 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 10F2016A41A; Tue, 12 Feb 2008 13:59:18 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from falcon.cybervisiontech.com (falcon.cybervisiontech.com [217.20.163.9]) by mx1.freebsd.org (Postfix) with ESMTP id 776DB13C4CE; Tue, 12 Feb 2008 13:59:17 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from localhost (localhost [127.0.0.1]) by falcon.cybervisiontech.com (Postfix) with ESMTP id 935FF43D123; Tue, 12 Feb 2008 15:59:15 +0200 (EET) X-Virus-Scanned: Debian amavisd-new at falcon.cybervisiontech.com Received: from falcon.cybervisiontech.com ([127.0.0.1]) by localhost (falcon.cybervisiontech.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id UdP0cuSbTNyG; Tue, 12 Feb 2008 15:59:15 +0200 (EET) Received: from [10.2.1.87] (gateway.cybervisiontech.com.ua [88.81.251.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by falcon.cybervisiontech.com (Postfix) with ESMTP id C7A9A43CBDF; Tue, 12 Feb 2008 15:59:14 +0200 (EET) Message-ID: <47B1A631.1000504@icyb.net.ua> Date: Tue, 12 Feb 2008 15:59:13 +0200 From: Andriy Gapon User-Agent: Thunderbird 2.0.0.9 (X11/20080123) MIME-Version: 1.0 To: Bruce Evans References: <4299.1202816854@critter.freebsd.dk> <20080212234018.O92415@delplex.bde.org> In-Reply-To: <20080212234018.O92415@delplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Poul-Henning Kamp , Pav Lucistnik , Bruce Evans Subject: Re: fs/udf: vm pages "overlap" while reading large dir X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Feb 2008 13:59:18 -0000 on 12/02/2008 15:11 Bruce Evans said the following: > On Tue, 12 Feb 2008, Poul-Henning Kamp wrote: > >> In message <47B181F2.2070808@icyb.net.ua>, Andriy Gapon writes: >> >>> 2.3. this code passes to bread blkno that is calculated as 4*sector, >>> where sector is a number of a physical 2048-byte sector. [**] >>> [**] - I think that this is a requirement of buffcache system, because >>> internally it performs many calculations that seem to assume that block >>> size is always 512. >> Yes, the buf-cache works in 512 bytes units throughout. > > I missed the dbtob() conversions in vfs_bio.c when I replied previously > So blkno cannot be a cookie; it must be for a 512-block. So how did > the cases where bsize != DEV_BSIZE in getblk() ever work? It seems that this is because specific VOP_STRATEGY and/or BO_STRATEGY kicked in and did the right thing, see below. >>> 3.1. for a fresh buf getlbk would assign the following: >>> bsize = bo->bo_bsize; >>> offset = blkno * bsize; >> That's clearly wrong. > > If units were always 512-blocks, then anything except bsize = DEV_BSIZE > would be clearly wrong. Things aren't that simple (but probably should > be). Even RELENG_3 has bsize = f_iosize (possibly != 512) for non-disks. O, I missed this obvious thing! Bruce, thank you for putting me back on the ground :-) Even in UDF case, when we bread() via a file or directory vnode we pass a logical 2048-byte block number (within the file) to bread(). In this case the code in getblk() does the right thing when it calculates offset as blkno * 2048. Calculating it assuming 512-byte units would be a disaster. And the actual reading works correctly because udf_strategy is called for such vnodes and it translates block numbers from physical to logical and also correctly re-calculates b_iooffset for actual reading. So b_iooffset value set in breadn() (using bdtob) is completely ignored. I remember that this is why g_vfs_* was my primary target. It seems that I revert my opinion back: either g_vfs_open should be smart about setting bo_bsize, or g_vfs_strategy should be smart about calculating bio_offset, or individual filesystems should not depend on g_vfs_* always doing the best thing for them. In any case, it seems that it is the g_vfs_* that is currently inconsistent: it sets bo_bsize to a somewhat arbitrary value, but expects that it would always be provided with correct b_iooffset (the latter being rigidly calculated via bdtob() in buffcache code). So this leaves filesystems dead in the water while reading via device vnode: provide blkno in 512-byte units - screw up VM cache, provide blkno in bo_bsize units - screw up reading from disk. Not sure if the FS-s should have wits to set bo_bsize properly and/or provide proper bo_ops - we are talking about a device vnode here. Should filesystems be involved in the business of setting its properties? Not sure. But it seems that there are many possibilities for "fixups" and various filesystems [have to] do stuff in their unique ways (*_STRATEGY, etc). -- Andriy Gapon