From owner-freebsd-fs@FreeBSD.ORG Wed Jun 12 02:17:04 2013 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id C5AB6AEC; Wed, 12 Jun 2013 02:17:04 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail110.syd.optusnet.com.au (mail110.syd.optusnet.com.au [211.29.132.97]) by mx1.freebsd.org (Postfix) with ESMTP id 750491980; Wed, 12 Jun 2013 02:17:03 +0000 (UTC) Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id A342D7804C2; Wed, 12 Jun 2013 11:48:12 +1000 (EST) Date: Wed, 12 Jun 2013 11:48:11 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: "Kenneth D. Merry" Subject: Re: An order of magnitude higher IOPS needed with ZFS than UFS In-Reply-To: <20130611232124.GA42577@nargothrond.kdm.org> Message-ID: <20130612104903.A1146@besplex.bde.org> References: <51B79023.5020109@fsn.hu> <253074981.119060.1370985609747.JavaMail.root@erie.cs.uoguelph.ca> <20130611232124.GA42577@nargothrond.kdm.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=Q6eKePKa c=1 sm=1 a=uNq0K1xFbOwA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=_0xpXSU753EA:10 a=fMB5tdty3pWOc5zq9kgA:9 a=CjuIK1q_8ugA:10 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117 Cc: freebsd-fs@FreeBSD.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Jun 2013 02:17:04 -0000 On Tue, 11 Jun 2013, Kenneth D. Merry wrote: > On Tue, Jun 11, 2013 at 17:20:09 -0400, Rick Macklem wrote: >> Attila Nagy wrote: >>> ... >>> I've seen a lot of cases where ZFS required more memory and CPU (and >>> even IO) to handle the same load, but they were nowhere this bad >>> (often >>> a 10x increase). >>> >>> Any ideas? >>> >> ken@ recently committed a change to the new NFS server to add file >> handle affinity support to it. He reported that he had found that, >> without file handle affinity, that ZFS's sequential reading heuristic >> broke badly (or something like that, you can probably find the email >> thread or maybe he will chime in). > > That is correct. The problem, if the I/O is sequential, is that simultaneous > requests for adjacent blocks in a file will get farmed out to different > threads in the NFS server. These can easily go down into ZFS out of order, > and make the ZFS prefetch code think that the file is not being read > sequentially. It blows away the zfetch stream, and you wind up with a lot > of I/O bandwidth getting used (with a lot of prefetching done and then > re-done), but not much performance. I saw the nfsd's getting in each other's way when debugging nfs write slowness some time ago. I used the "fix" of using only 1 nfsd. This worked fine on a lightly-loaded nfs server and client doing nothing nearly as heavy as the write benchmark for all other uses combined. With this and some other changes that are supposed to be in -current now, the write performance for large files was close to the drive's maximum. But reads were at best 75% of the maximum. Maybe FHA fixes the read case. More recently, I noticed that vfs clustering works poorly partly because it has too many, yet not enough sequential pointers. There is a pointer (fp->f_nextoff and fp->f_seqcount) for the sequential heuristic at the struct file level. This is shared between reads and writes, so mixed reads, writes and seeks break the heuristic for the reads and writes in the case that the seeks are to get back to position after the previous write (the rewrite benchmark in bonnie does this). The seeks mean that the i/o is not really sequential although it is sequential for the read part and the write part. FreeBSD is only trying to guess if these parts are sequential per-file. Mixed reads and writes on the same file shouldn't affect the guess any more than non-mixed reads or writes on different files, or mixed reads and writes on the same file when the kernel does the read to fill buffers before partial writes. However, at a lower level the only seeks that matter re physical ones. The per-file pointers should be combined somehow to predict and minimize the physical seeks. Nothing is done. The kernel read-before write does significant physical seeks but since everything is below the file level the per-file pointer is not clobbered so pure sequential writes are still guessed to be sequential although they aren't really. There is also a pointer (vp->v_lastw and vp->vp->v_lasta) for cluster writes. This is closer to the physical disk pointer that is needed, but since it is per-vnode it share a fundamental design error with the buffer cache (buffer cache code wants to access one vnode at a time, vnode data and metadata may be very non-sequential). vnodes are below the file level, so this pointer gets clobbered by writes (but not reads) on separate open files. The clobbering keeps the vnode pointer closer to the physical disk pointer if and only iff all accesses are to the same vnode. I think it mostly helps to not try to track per-file sequentiality for writes, but the per-file sequentiality guess is used for cluster writing too. The 2 types of sequentiality combine in a confusing way even if there is only 1 writer (but a reader on the same file). Write accesses are then sequential from the point of view of the vnode pointer, but random from the point of view of the file pointer, since only the latter is clobbered by intermediate reads. As mentioned above, bonnie's atypical i/o pattern clobbers the file pointer, but kernel's more typical i/o pattern for read-before-write doesn't. I first thought that clobbering the pointer was a bug, but now I think it is a feature. The i/o really is non-sequential. Basing most i/o sequentiality guesses on a single per-disk pointer (shared across different partitions on the same disk) might work better than all the separate pointers. Accesses that are sequential at the file level would only be considered sequential if no other physical accesses intervene. After getting that right, use sequentiality guesses again to delay some physical accesses if they would intervene. Bruce