From owner-freebsd-fs@FreeBSD.ORG  Thu Nov  4 16:56:17 2010
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C7B5C106566B;
	Thu,  4 Nov 2010 16:56:17 +0000 (UTC) (envelope-from avg@freebsd.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 70B638FC0A;
	Thu,  4 Nov 2010 16:56:15 +0000 (UTC)
Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua
	[212.40.38.101])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id SAA29317;
	Thu, 04 Nov 2010 18:56:11 +0200 (EET) (envelope-from avg@freebsd.org)
Message-ID: <4CD2E5AA.1090208@freebsd.org>
Date: Thu, 04 Nov 2010 18:56:10 +0200
From: Andriy Gapon <avg@freebsd.org>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US;
	rv:1.9.2.11) Gecko/20101021 Lightning/1.0b2 Thunderbird/3.1.5
MIME-Version: 1.0
To: Kostik Belousov <kostikbel@gmail.com>,
	Pawel Jakub Dawidek <pjd@freebsd.org>, Alan Cox <alc@freebsd.org>
References: <4C91F031.1010801@freebsd.org>
	<20101010134717.GA59922@deviant.kiev.zoral.com.ua>
In-Reply-To: <20101010134717.GA59922@deviant.kiev.zoral.com.ua>
X-Enigmail-Version: 1.1.2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@freebsd.org
Subject: Re: vop_getpages for zfs
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 04 Nov 2010 16:56:18 -0000

on 10/10/2010 16:47 Kostik Belousov said the following:
> I think that readahead requests should be satisfied, if possible. Esp.
> in the case the readahead does not cause additional i/o.

So, per Kostik's suggestion, here is patch for your review and testing:
http://people.freebsd.org/~avg/zfs_getpages.2.diff

The patch tries to fill as many of the passed in pages as possible as long as that
can be done from a block (or even two blocks) that has to be read to fill the
requested page.
This means that the optimization kicks in only if a vnode has a block size greater
than the page size.
I mentioned "two blocks" above because I tried to optimize the case when, for
example, block size is 6KB and page size is 4KB and the requested page covers
parts of two adjacent blocks (e.g. range from 4KB to 8KB).  Not sure if this was
worth the trouble.

In other words, the code tries to avoid reading more blocks than is strictly
required, because additional blocks may be far apart on a disk and thus read
latency could get increased.

You will also notice that I added vop_bmap method to ZFS vops.
This is needed to satisfy logic in vnode_pager_haspage(); otherwise, for example,
vm_fault() would never request any additional pages besides the required one.
zfs_bmap() implementation is really simple - it returns zeros for a_runp and
a_runb (and some reasonable-ish but unused values in other fields), and
vnode_pager_haspage() figures out number of eligible pages from the block size
(f_iosize actually[*]) and the page size.

Now, that ZFS has vop_getpages method, it is safe to provide this dummy-ish
implementation of vop_bmap, because this operation would not be called by any
other place but vnode_pager_haspage().  Code from vfs_bio.c and vfs_cluster.c is
never called for ZFS, because it doesn't use the buffer cache layer.

[*] There is a few interesting questions with respect to f_iosize usage in
vnode_pager_haspage() and ZFS.
For one, ZFS has a variable block size and f_iosize for ZFS has a currently
configured maximum block ('record' in ZFS terms) size.  This should not result in
non-optimal behavior for files with smaller block sizes, because those files would
have smaller sizes too (within the block size).
Unless, of course, recordsize property is modified after some files are created on
a ZFS filesystem.  In that case there could be large files with some smaller
(previously maximum) block size; or files with a block size larger than the
current recordsize == f_iosize.  So, such situations may lead to sub-optimal
performance.
Another issue related to the above is that f_iosize for ZFS can change on the fly
when an administrators runs zfs set recordsize=NNN, and that may happen
concurrently with vnode_pager_haspage() using f_iosize for its calculations.
Not sure how to handle this correctly.

I will appreciate your comment comments, test reports, reviews and suggestions.
Thanks a lot!
-- 
Andriy Gapon