From owner-freebsd-fs@FreeBSD.ORG Thu Nov 4 16:56:17 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C7B5C106566B; Thu, 4 Nov 2010 16:56:17 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 70B638FC0A; Thu, 4 Nov 2010 16:56:15 +0000 (UTC) Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua [212.40.38.101]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id SAA29317; Thu, 04 Nov 2010 18:56:11 +0200 (EET) (envelope-from avg@freebsd.org) Message-ID: <4CD2E5AA.1090208@freebsd.org> Date: Thu, 04 Nov 2010 18:56:10 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.11) Gecko/20101021 Lightning/1.0b2 Thunderbird/3.1.5 MIME-Version: 1.0 To: Kostik Belousov , Pawel Jakub Dawidek , Alan Cox References: <4C91F031.1010801@freebsd.org> <20101010134717.GA59922@deviant.kiev.zoral.com.ua> In-Reply-To: <20101010134717.GA59922@deviant.kiev.zoral.com.ua> X-Enigmail-Version: 1.1.2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org Subject: Re: vop_getpages for zfs X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 04 Nov 2010 16:56:18 -0000 on 10/10/2010 16:47 Kostik Belousov said the following: > I think that readahead requests should be satisfied, if possible. Esp. > in the case the readahead does not cause additional i/o. So, per Kostik's suggestion, here is patch for your review and testing: http://people.freebsd.org/~avg/zfs_getpages.2.diff The patch tries to fill as many of the passed in pages as possible as long as that can be done from a block (or even two blocks) that has to be read to fill the requested page. This means that the optimization kicks in only if a vnode has a block size greater than the page size. I mentioned "two blocks" above because I tried to optimize the case when, for example, block size is 6KB and page size is 4KB and the requested page covers parts of two adjacent blocks (e.g. range from 4KB to 8KB). Not sure if this was worth the trouble. In other words, the code tries to avoid reading more blocks than is strictly required, because additional blocks may be far apart on a disk and thus read latency could get increased. You will also notice that I added vop_bmap method to ZFS vops. This is needed to satisfy logic in vnode_pager_haspage(); otherwise, for example, vm_fault() would never request any additional pages besides the required one. zfs_bmap() implementation is really simple - it returns zeros for a_runp and a_runb (and some reasonable-ish but unused values in other fields), and vnode_pager_haspage() figures out number of eligible pages from the block size (f_iosize actually[*]) and the page size. Now, that ZFS has vop_getpages method, it is safe to provide this dummy-ish implementation of vop_bmap, because this operation would not be called by any other place but vnode_pager_haspage(). Code from vfs_bio.c and vfs_cluster.c is never called for ZFS, because it doesn't use the buffer cache layer. [*] There is a few interesting questions with respect to f_iosize usage in vnode_pager_haspage() and ZFS. For one, ZFS has a variable block size and f_iosize for ZFS has a currently configured maximum block ('record' in ZFS terms) size. This should not result in non-optimal behavior for files with smaller block sizes, because those files would have smaller sizes too (within the block size). Unless, of course, recordsize property is modified after some files are created on a ZFS filesystem. In that case there could be large files with some smaller (previously maximum) block size; or files with a block size larger than the current recordsize == f_iosize. So, such situations may lead to sub-optimal performance. Another issue related to the above is that f_iosize for ZFS can change on the fly when an administrators runs zfs set recordsize=NNN, and that may happen concurrently with vnode_pager_haspage() using f_iosize for its calculations. Not sure how to handle this correctly. I will appreciate your comment comments, test reports, reviews and suggestions. Thanks a lot! -- Andriy Gapon