From owner-freebsd-fs@freebsd.org  Mon Jan  4 08:30:20 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id C1BB8A61CE7
 for <freebsd-fs@mailman.ysv.freebsd.org>; Mon,  4 Jan 2016 08:30:20 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au
 [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id 70FCC1301
 for <freebsd-fs@freebsd.org>; Mon,  4 Jan 2016 08:30:20 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au
 (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197])
 by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id AAE06D67F09;
 Mon,  4 Jan 2016 19:30:09 +1100 (AEDT)
Date: Mon, 4 Jan 2016 19:30:08 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Rick Macklem <rmacklem@uoguelph.ca>
cc: "Mikhail T." <mi+thun@aldan.algebra.com>, freebsd-fs@freebsd.org
Subject: Re: NFS reads vs. writes
In-Reply-To: <495055121.147587416.1451871433217.JavaMail.zimbra@uoguelph.ca>
Message-ID: <20160104181759.E1028@besplex.bde.org>
References: <8291bb85-bd01-4c8c-80f7-2adcf9947366@email.android.com>
 <5688D3C1.90301@aldan.algebra.com>
 <495055121.147587416.1451871433217.JavaMail.zimbra@uoguelph.ca>
MIME-Version: 1.0
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=R4L+YolX c=1 sm=1 tr=0
 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8
 a=nlC_4_pT8q9DhB4Ho9EA:9 a=jQRj7DYTnsFuIglMeNMA:9 a=45ClL6m2LaAA:10
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 04 Jan 2016 08:30:20 -0000

On Sun, 3 Jan 2016, Rick Macklem wrote:

> Mikhail T. wrote:
>> On 03.01.2016 02:16, Karli Sj=C3=B6berg wrote:
>>>
>>> The difference between "mount" and "mount -o async" should tell you if
>>> you'd benefit from a separate log device in the pool.
>>>
>> This is not a ZFS problem. The same filesystem is being read in both
>> cases. The same data is being read from and written to the same
>> filesystems. For some reason, it is much faster to read via NFS than to
>> write to it, however.
>>
> This issue isn't new. It showed up when Sun introduced NFS in 1985.

nfs writes are slightly faster than reads in most configurations for me.
This is because writes are easier to stream and most or all configurations
don't do a very good just of trying to stream reads.

> NFSv3 did change things a little, by allowing UNSTABLE writes.

Of course I use async mounts (and ffs) if I want writes to be fast.  Both
the server and the client fs should be mounted async.  This is most importa=
nt
for the client.

> Here's what an NFSv3 or NFSv4 client does when writing:

nfs also has a badly designed sysctl vfs.nfsd.async which does something
more hackish for nfsv2 and might have undesirable side effects for nfsv3+.
Part of its bad design is that it is global.  It affects all clients.
This might be a feature if the clients don't support async mounts.  I never
use this.

> - Issues some # of UNSTABLE writes. The server need only have these is se=
rver
>  RAM before replying NFS_OK.
> - Then the client does a Commit. At this point the NFS server is required=
 to
>  store all the data written in the above writes and related metadata on s=
table
>  storage before replying NFS_OK.

async mounts in the FreeBSD client are implemented by 2 lines of code
(and "async" in the list of supported options) that seem to work by
pretending that UNSTABLE writes are FILESYNC so the Commit step is null.
Thus everything except possibly metadata is async and unstable but the
client doesn't know this.

If the server fs is mounted with inconsistent async flags or the async
flags give inconsistent policies, some async writes may turn into sync
and vice versa.  The worst inconsistencies are with a default (delayed
Commit) client and an async (non-soft updates) server.  Then async breaks
the Commits by writing sync data but still writing async metadata.  My
version has partial fixes (it syncs inodes but not directories in fsync()
for async mounts).

>  --> This is where the "sync" vs "async" is a big issue. If you use "sync=
=3Ddisabled"
>      (I'm not a ZFS guy, but I think that is what the ZFS option looks li=
kes) you
>      *break* the NFS protocol (ie. violate the RFC) and put your data at =
some risk,
>      but you will typically get better (often much better) write performa=
nce.

Is zfs really as broken as ffs with async mounts?  It takes ignoring FSYNC/
IO_SYNC flags when mounted async to get full brokenness.  async for ffs was
originally a hack to do something like that.  I think it now honors the
sync flags for everything except inodes and directories.

Sync everything is too slow to use for everything, but the delayed Commit
should make it usable, depending on how long the delay is.  Perhaps it
can interract badly with the server fs's delays.  Something like a pipeline
stall on a CPU -- to satisfy a synchronization request for 1 file, it might
be necessary to wait for many MB of i/o for other files first.

> Also, the NFS server was recently tweaked so that it could handle 128K rs=
ize/wsize,
> but the FreeBSD client is limited to MAXBSIZE and this has not been incre=
ased
> beyond 64K. To do so, you have to change the value of this in the kernel =
sources

Larger i/o sizes give negative benefits for me.  Changes in the default siz=
es
give confusing peformance differences with larger sizes mostly worse, but
there are too many combinations to test and I never figured out the details=
,
so I now force small sizes at mount time.  This depends on having a fast
network.  With a really slow network, the i/o sizes must be very large or
the streaming must be good.

> and rebuild your kernel. (The problem is that increasing MAXBSIZE makes t=
he kernel
> use more KVM for the buffer cache and if a system isn't doing significant=
 client
> side NFS, this is wasted.)
> Someday, I should see if MAXBSIZE can be made a TUNABLE, but I haven't do=
ne that.
> --> As such, unless you use a Linux NFS client, the reads/writes will be =
64K, whereas
>    128K would work better for ZFS.

Not for ffs with 16K-blocks.  Clustering usually turns these into 128K-bloc=
ks
but nfs client see little difference and may even work better with 8K-block=
s.

Bruce
From owner-freebsd-fs@freebsd.org  Mon Jan  4 09:02:11 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9E565A60D91
 for <freebsd-fs@mailman.ysv.freebsd.org>; Mon,  4 Jan 2016 09:02:11 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au
 [211.29.132.59]) by mx1.freebsd.org (Postfix) with ESMTP id 68A731757
 for <freebsd-fs@freebsd.org>; Mon,  4 Jan 2016 09:02:11 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au
 (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197])
 by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id B970A1A5E6C;
 Mon,  4 Jan 2016 20:02:02 +1100 (AEDT)
Date: Mon, 4 Jan 2016 20:02:02 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: "Mikhail T." <mi+thun@aldan.algebra.com>
cc: Rick Macklem <rmacklem@uoguelph.ca>, freebsd-fs@freebsd.org
Subject: Re: NFS reads vs. writes
In-Reply-To: <568A047B.1010000@aldan.algebra.com>
Message-ID: <20160104193054.E1028@besplex.bde.org>
References: <8291bb85-bd01-4c8c-80f7-2adcf9947366@email.android.com>
 <5688D3C1.90301@aldan.algebra.com>
 <495055121.147587416.1451871433217.JavaMail.zimbra@uoguelph.ca>
 <568A047B.1010000@aldan.algebra.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=PfoC/XVd c=1 sm=1 tr=0
 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8
 a=kj9zAlcOel0A:10 a=jxJqPt9aaO_jOatjG30A:9 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 04 Jan 2016 09:02:11 -0000

On Mon, 4 Jan 2016, Mikhail T. wrote:

> On 03.01.2016 20:37, Rick Macklem wrote:
>> This issue isn't new. It showed up when Sun introduced NFS in 1985.
>> NFSv3 did change things a little, by allowing UNSTABLE writes.
> Thank you very much, Rick, for the detailed explanation.
>> If you use "sync=disabled"
>>       (I'm not a ZFS guy, but I think that is what the ZFS option looks likes) you
>>       *break* the NFS protocol (ie. violate the RFC) and put your data at some risk,
>>       but you will typically get better (often much better) write performance.
> Yes, indeed. Disabling sync got the writing throughput all the way up to
> about 86Mb/s... I still don't fully understand, why local writes are
> able to achieve this speed without async and without being considered
> dangerous.

86 Mbits/S is still slow.  Do you mean Mbytes/S?

Try fsync() to make the local writes slow too.

There is considerable confusiion between sync, async and neither.  "neither"
used to mean to writes using the bawrite() ("async" write) function.  "async"
means to write _not_ using bawrite(), but using the bdwrite() ("delayed"
write) function.  Soft updates obfuscate this more.  "neither" with them
means to write with more order than bawrite() and with less delay than
with bdwrite(), so that writes are more robust and also faster than with
simple bawrite().

"neither" writes are dangerous in ffs with soft updates only if the system
crashes so that the delayed writes are never done.  In zfs they are supposed
to be safe by writing the delayed writes to small fast storage.  I forget
what this is named.  This is supposed to work without any async hacks too.
Apparently it doesn't.  Maybe the nfs Commits are too large.

Bruce