From owner-freebsd-fs@FreeBSD.ORG  Wed Jun 12 02:17:04 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id C5AB6AEC;
 Wed, 12 Jun 2013 02:17:04 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail110.syd.optusnet.com.au (mail110.syd.optusnet.com.au
 [211.29.132.97])
 by mx1.freebsd.org (Postfix) with ESMTP id 750491980;
 Wed, 12 Jun 2013 02:17:03 +0000 (UTC)
Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au
 (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23])
 by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id A342D7804C2;
 Wed, 12 Jun 2013 11:48:12 +1000 (EST)
Date: Wed, 12 Jun 2013 11:48:11 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: "Kenneth D. Merry" <ken@FreeBSD.org>
Subject: Re: An order of magnitude higher IOPS needed with ZFS than UFS
In-Reply-To: <20130611232124.GA42577@nargothrond.kdm.org>
Message-ID: <20130612104903.A1146@besplex.bde.org>
References: <51B79023.5020109@fsn.hu>
 <253074981.119060.1370985609747.JavaMail.root@erie.cs.uoguelph.ca>
 <20130611232124.GA42577@nargothrond.kdm.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=Q6eKePKa c=1 sm=1 a=uNq0K1xFbOwA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=_0xpXSU753EA:10
 a=fMB5tdty3pWOc5zq9kgA:9 a=CjuIK1q_8ugA:10 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117
Cc: freebsd-fs@FreeBSD.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 12 Jun 2013 02:17:04 -0000

On Tue, 11 Jun 2013, Kenneth D. Merry wrote:

> On Tue, Jun 11, 2013 at 17:20:09 -0400, Rick Macklem wrote:
>> Attila Nagy wrote:
>>> ...
>>> I've seen a lot of cases where ZFS required more memory and CPU (and
>>> even IO) to handle the same load, but they were nowhere this bad
>>> (often
>>> a 10x increase).
>>>
>>> Any ideas?
>>>
>> ken@ recently committed a change to the new NFS server to add file
>> handle affinity support to it. He reported that he had found that,
>> without file handle affinity, that ZFS's sequential reading heuristic
>> broke badly (or something like that, you can probably find the email
>> thread or maybe he will chime in).
>
> That is correct.  The problem, if the I/O is sequential, is that simultaneous
> requests for adjacent blocks in a file will get farmed out to different
> threads in the NFS server.  These can easily go down into ZFS out of order,
> and make the ZFS prefetch code think that the file is not being read
> sequentially.  It blows away the zfetch stream, and you wind up with a lot
> of I/O bandwidth getting used (with a lot of prefetching done and then
> re-done), but not much performance.

I saw the nfsd's getting in each other's way when debugging nfs write
slowness some time ago.  I used the "fix" of using only 1 nfsd.  This
worked fine on a lightly-loaded nfs server and client doing nothing
nearly as heavy as the write benchmark for all other uses combined.
With this and some other changes that are supposed to be in -current
now, the write performance for large files was close to the drive's
maximum.  But reads were at best 75% of the maximum.  Maybe FHA fixes
the read case.

More recently, I noticed that vfs clustering works poorly partly because
it has too many, yet not enough sequential pointers.

There is a pointer (fp->f_nextoff and fp->f_seqcount) for the sequential
heuristic at the struct file level.  This is shared between reads and
writes, so mixed reads, writes and seeks break the heuristic for the
reads and writes in the case that the seeks are to get back to position
after the previous write (the rewrite benchmark in bonnie does this).
The seeks mean that the i/o is not really sequential although it is
sequential for the read part and the write part.  FreeBSD is only
trying to guess if these parts are sequential per-file.  Mixed reads
and writes on the same file shouldn't affect the guess any more than
non-mixed reads or writes on different files, or mixed reads and writes
on the same file when the kernel does the read to fill buffers before
partial writes.  However, at a lower level the only seeks that matter
re physical ones.  The per-file pointers should be combined somehow
to predict and minimize the physical seeks.  Nothing is done.  The
kernel read-before write does significant physical seeks but since
everything is below the file level the per-file pointer is not clobbered
so pure sequential writes are still guessed to be sequential although
they aren't really.

There is also a pointer (vp->v_lastw and vp->vp->v_lasta) for cluster
writes.  This is closer to the physical disk pointer that is needed,
but since it is per-vnode it share a fundamental design error with
the buffer cache (buffer cache code wants to access one vnode at a
time, vnode data and metadata may be very non-sequential).  vnodes
are below the file level, so this pointer gets clobbered by writes
(but not reads) on separate open files.  The clobbering keeps the
vnode pointer closer to the physical disk pointer if and only iff
all accesses are to the same vnode.  I think it mostly helps to
not try to track per-file sequentiality for writes, but the per-file
sequentiality guess is used for cluster writing too.  The 2 types
of sequentiality combine in a confusing way even if there is only
1 writer (but a reader on the same file).  Write accesses are then
sequential from the point of view of the vnode pointer, but random
from the point of view of the file pointer, since only the latter
is clobbered by intermediate reads.

As mentioned above, bonnie's atypical i/o pattern clobbers the file
pointer, but kernel's more typical i/o pattern for read-before-write
doesn't.

I first thought that clobbering the pointer was a bug, but now I think
it is a feature.  The i/o really is non-sequential.  Basing most i/o
sequentiality guesses on a single per-disk pointer (shared across
different partitions on the same disk) might work better than all
the separate pointers.  Accesses that are sequential at the file level
would only be considered sequential if no other physical accesses
intervene.  After getting that right, use sequentiality guesses again
to delay some physical accesses if they would intervene.

Bruce