From owner-freebsd-fs@freebsd.org Mon Jan 4 08:30:20 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C1BB8A61CE7 for ; Mon, 4 Jan 2016 08:30:20 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id 70FCC1301 for ; Mon, 4 Jan 2016 08:30:20 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id AAE06D67F09; Mon, 4 Jan 2016 19:30:09 +1100 (AEDT) Date: Mon, 4 Jan 2016 19:30:08 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Rick Macklem cc: "Mikhail T." , freebsd-fs@freebsd.org Subject: Re: NFS reads vs. writes In-Reply-To: <495055121.147587416.1451871433217.JavaMail.zimbra@uoguelph.ca> Message-ID: <20160104181759.E1028@besplex.bde.org> References: <8291bb85-bd01-4c8c-80f7-2adcf9947366@email.android.com> <5688D3C1.90301@aldan.algebra.com> <495055121.147587416.1451871433217.JavaMail.zimbra@uoguelph.ca> MIME-Version: 1.0 X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=R4L+YolX c=1 sm=1 tr=0 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=nlC_4_pT8q9DhB4Ho9EA:9 a=jQRj7DYTnsFuIglMeNMA:9 a=45ClL6m2LaAA:10 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jan 2016 08:30:20 -0000 On Sun, 3 Jan 2016, Rick Macklem wrote: > Mikhail T. wrote: >> On 03.01.2016 02:16, Karli Sj=C3=B6berg wrote: >>> >>> The difference between "mount" and "mount -o async" should tell you if >>> you'd benefit from a separate log device in the pool. >>> >> This is not a ZFS problem. The same filesystem is being read in both >> cases. The same data is being read from and written to the same >> filesystems. For some reason, it is much faster to read via NFS than to >> write to it, however. >> > This issue isn't new. It showed up when Sun introduced NFS in 1985. nfs writes are slightly faster than reads in most configurations for me. This is because writes are easier to stream and most or all configurations don't do a very good just of trying to stream reads. > NFSv3 did change things a little, by allowing UNSTABLE writes. Of course I use async mounts (and ffs) if I want writes to be fast. Both the server and the client fs should be mounted async. This is most importa= nt for the client. > Here's what an NFSv3 or NFSv4 client does when writing: nfs also has a badly designed sysctl vfs.nfsd.async which does something more hackish for nfsv2 and might have undesirable side effects for nfsv3+. Part of its bad design is that it is global. It affects all clients. This might be a feature if the clients don't support async mounts. I never use this. > - Issues some # of UNSTABLE writes. The server need only have these is se= rver > RAM before replying NFS_OK. > - Then the client does a Commit. At this point the NFS server is required= to > store all the data written in the above writes and related metadata on s= table > storage before replying NFS_OK. async mounts in the FreeBSD client are implemented by 2 lines of code (and "async" in the list of supported options) that seem to work by pretending that UNSTABLE writes are FILESYNC so the Commit step is null. Thus everything except possibly metadata is async and unstable but the client doesn't know this. If the server fs is mounted with inconsistent async flags or the async flags give inconsistent policies, some async writes may turn into sync and vice versa. The worst inconsistencies are with a default (delayed Commit) client and an async (non-soft updates) server. Then async breaks the Commits by writing sync data but still writing async metadata. My version has partial fixes (it syncs inodes but not directories in fsync() for async mounts). > --> This is where the "sync" vs "async" is a big issue. If you use "sync= =3Ddisabled" > (I'm not a ZFS guy, but I think that is what the ZFS option looks li= kes) you > *break* the NFS protocol (ie. violate the RFC) and put your data at = some risk, > but you will typically get better (often much better) write performa= nce. Is zfs really as broken as ffs with async mounts? It takes ignoring FSYNC/ IO_SYNC flags when mounted async to get full brokenness. async for ffs was originally a hack to do something like that. I think it now honors the sync flags for everything except inodes and directories. Sync everything is too slow to use for everything, but the delayed Commit should make it usable, depending on how long the delay is. Perhaps it can interract badly with the server fs's delays. Something like a pipeline stall on a CPU -- to satisfy a synchronization request for 1 file, it might be necessary to wait for many MB of i/o for other files first. > Also, the NFS server was recently tweaked so that it could handle 128K rs= ize/wsize, > but the FreeBSD client is limited to MAXBSIZE and this has not been incre= ased > beyond 64K. To do so, you have to change the value of this in the kernel = sources Larger i/o sizes give negative benefits for me. Changes in the default siz= es give confusing peformance differences with larger sizes mostly worse, but there are too many combinations to test and I never figured out the details= , so I now force small sizes at mount time. This depends on having a fast network. With a really slow network, the i/o sizes must be very large or the streaming must be good. > and rebuild your kernel. (The problem is that increasing MAXBSIZE makes t= he kernel > use more KVM for the buffer cache and if a system isn't doing significant= client > side NFS, this is wasted.) > Someday, I should see if MAXBSIZE can be made a TUNABLE, but I haven't do= ne that. > --> As such, unless you use a Linux NFS client, the reads/writes will be = 64K, whereas > 128K would work better for ZFS. Not for ffs with 16K-blocks. Clustering usually turns these into 128K-bloc= ks but nfs client see little difference and may even work better with 8K-block= s. Bruce From owner-freebsd-fs@freebsd.org Mon Jan 4 09:02:11 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9E565A60D91 for ; Mon, 4 Jan 2016 09:02:11 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au [211.29.132.59]) by mx1.freebsd.org (Postfix) with ESMTP id 68A731757 for ; Mon, 4 Jan 2016 09:02:11 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id B970A1A5E6C; Mon, 4 Jan 2016 20:02:02 +1100 (AEDT) Date: Mon, 4 Jan 2016 20:02:02 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: "Mikhail T." cc: Rick Macklem , freebsd-fs@freebsd.org Subject: Re: NFS reads vs. writes In-Reply-To: <568A047B.1010000@aldan.algebra.com> Message-ID: <20160104193054.E1028@besplex.bde.org> References: <8291bb85-bd01-4c8c-80f7-2adcf9947366@email.android.com> <5688D3C1.90301@aldan.algebra.com> <495055121.147587416.1451871433217.JavaMail.zimbra@uoguelph.ca> <568A047B.1010000@aldan.algebra.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=PfoC/XVd c=1 sm=1 tr=0 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=kj9zAlcOel0A:10 a=jxJqPt9aaO_jOatjG30A:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Jan 2016 09:02:11 -0000 On Mon, 4 Jan 2016, Mikhail T. wrote: > On 03.01.2016 20:37, Rick Macklem wrote: >> This issue isn't new. It showed up when Sun introduced NFS in 1985. >> NFSv3 did change things a little, by allowing UNSTABLE writes. > Thank you very much, Rick, for the detailed explanation. >> If you use "sync=disabled" >> (I'm not a ZFS guy, but I think that is what the ZFS option looks likes) you >> *break* the NFS protocol (ie. violate the RFC) and put your data at some risk, >> but you will typically get better (often much better) write performance. > Yes, indeed. Disabling sync got the writing throughput all the way up to > about 86Mb/s... I still don't fully understand, why local writes are > able to achieve this speed without async and without being considered > dangerous. 86 Mbits/S is still slow. Do you mean Mbytes/S? Try fsync() to make the local writes slow too. There is considerable confusiion between sync, async and neither. "neither" used to mean to writes using the bawrite() ("async" write) function. "async" means to write _not_ using bawrite(), but using the bdwrite() ("delayed" write) function. Soft updates obfuscate this more. "neither" with them means to write with more order than bawrite() and with less delay than with bdwrite(), so that writes are more robust and also faster than with simple bawrite(). "neither" writes are dangerous in ffs with soft updates only if the system crashes so that the delayed writes are never done. In zfs they are supposed to be safe by writing the delayed writes to small fast storage. I forget what this is named. This is supposed to work without any async hacks too. Apparently it doesn't. Maybe the nfs Commits are too large. Bruce