From owner-freebsd-fs@FreeBSD.ORG Thu Apr 21 23:55:53 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 95498106566B for ; Thu, 21 Apr 2011 23:55:53 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au [211.29.132.188]) by mx1.freebsd.org (Postfix) with ESMTP id 1C0E08FC0A for ; Thu, 21 Apr 2011 23:55:52 +0000 (UTC) Received: from c122-106-155-58.carlnfd1.nsw.optusnet.com.au (c122-106-155-58.carlnfd1.nsw.optusnet.com.au [122.106.155.58]) by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id p3LNtbBE007759 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 22 Apr 2011 09:55:39 +1000 Date: Fri, 22 Apr 2011 09:55:37 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Rick Macklem In-Reply-To: <1143723691.393441.1303384281461.JavaMail.root@erie.cs.uoguelph.ca> Message-ID: <20110422075847.Y997@besplex.bde.org> References: <1143723691.393441.1303384281461.JavaMail.root@erie.cs.uoguelph.ca> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: make the experimental NFS subsystem the default one X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Apr 2011 23:55:53 -0000 On Thu, 21 Apr 2011, Rick Macklem wrote: >> I've hacked up the old NFS server a bit to better serve my ESX/NFS >> needs (forcing it to always write async so it works faster with ZFS >> and a ZIL). I guess I could do the same to the new code as well... >> >> ..However, if you didn't mind adding switch to force the NFS server to >> quietly only do an async VOP_WRITE, that would be really handy. Very >> dangerous for some people, but then again, so are most switches if you >> don't understand what they do. ESX will only opens it's files with >> O_SYNC, and it's not an editable option, so people like me running >> ESX+NFS over a ZFS share that has a ZIL are making the system do a lot >> of unnecessary sync writes. Yes, I know about SSD/RAM for the ZIL, but >> it still slows it down quite a bit over an async write. > > Have you tries the "async" mount option on the client side? > (or did you say you couldn't do that above?) The async mount option should work quite wll for this, except in my version. It almost completely breaks O_SYNC (and fsync()). Thus applications and nfs have no way to force a sync. This is mostly fixed in my version -- O_SYNC and fsync() sync everything related to the file except possibly its directory entry (and its parent's directory entry...). Thus applications and nfs can force a not-quite-complete sync but doing so gratuitously is unnecessarily slow. > I would be very hesitant to do it on the server side, since it would > violate the RFC (and you do run a risk of losing data, which many might > not realize uptil it is too late). As you know, nfs has always had problems with syncing things, especially if the local file system is mounted async. In [Free]BSD there is just no way for the server to sync metadata independently of syncing data, although at least the v3 prootcol supports this. Syncing everything using IO_SYNC should work, but is too slow in general, and even IO_SYNC is defeated by the above brokenness. In the old nfs server, the code for this is: >From nfs.h: % /* % * The IO_METASYNC flag should be implemented for local filesystems. % * (Until then, it is nothin at all.) % */ % #ifndef IO_METASYNC % #define IO_METASYNC 0 % #endif >From nfs_serv.c: % % /* % * XXX % * The IO_METASYNC flag indicates that all metadata (and not just % * enough to ensure data integrity) mus be written to stable storage % * synchronously. % * (IO_METASYNC is not yet implemented in 4.4BSD-Lite.) % */ % if (stable == NFSV3WRITE_UNSTABLE) % ioflags = IO_NODELOCKED; % else if (stable == NFSV3WRITE_DATASYNC) % ioflags = (IO_SYNC | IO_NODELOCKED); % else % ioflags = (IO_METASYNC | IO_SYNC | IO_NODELOCKED); The experimental nfs server seems to have even less support for this: >From nfs_nfsdport.c % if (stable == NFSWRITE_UNSTABLE) % ioflags = IO_NODELOCKED; % else % ioflags = (IO_SYNC | IO_NODELOCKED); Since IO_METASYNC is 0, the new code is equivalent to the old code, but the old code would work better if IO_METASYNC actually worked. OTOH, IO_SYNC implies syncing metadata in FreeBSD, so IO_METASYNC is needed for something quite different -- to sync metadata when data is _not_ being synced. This should be done by a state between completely unstable and unstable-data. nfs is basically assuming that metadata is sync by default. This is the case with old ffs, but not always: - ffs with soft updates. Sync of metadata is delayed, and might not happen due to a crash. Only integrity of the file system is guaranteed. This rarely matters. It might matter if an application or nfs client thinks it has completed written a critical chown(). - ffs with async mounts. Now nothing is synced by default, and unless you have fixed it, almost nothing is synced by IO_SYNC. It would be useful for IO_METASYNC to sync all the metadata. This would give the same stability as the default for old ffs (sync metadata, async data). - msdosfs with with defaults. Now the most critical metadata (the FAT) is async, while less critical metadata (directory entries and pseudo- inodes) are sync. Sync FAT would be too slow, so it takes a sync mount to get it. Again IO_METASYNC (applied to the file and enough of its surrounding metadata) would be useful for giving almost the same stability as old ffs. The nfs protocol seems to be inadequate for handling various combinations of asyncness on the client and the server. E.g., suppose that the server file system is mounted async. You probably want most operations to be async. But clients know nothing of asyncness on servers, and the spec requires some operations to be sync, so clients will issue sync operations which break the default of asyncness unless the server dishonors IO_SYNC. But you want some operations like fsync() on the client to be sync, so you don't want the server to dishonor all IO_SYNCs. One way to control this would be to make async mounts on the client work (async mounts on clients are now handled bogusly by accepting them in mount options although they have no effect): async client, async server: client only asks for sync operations when the application asks for them; server uses default which keeps other operations async; server should honor IO_SYNC otherwise default client, async server: same as now -- client gets server's defaults even when they involve dishonoring IO_SYNC async client, default or sync server: client should have precedence, giving async for everything. Need another protocol stability value and more server support for it, so that the server doesn't use its configuration of sync for at least metadata, but uses the client's configuration of async everything. The NFSWRITE_UNSTABLE value permits but doesn't request the server to be unstable, but with an async client we want to request the server to be unstable :-) (really just to optimize for speed instead of stability) default client, default server: some combination of the client and server's defaults. Must include honoring the protocol and IO_SYNC and fsync() on the client sync client, async server: client should have precedence, giving sync for everything. Same as now, but quite broken if the server dishonors IO_SYNC sync client or server, default other side: as for async, explicitly asking for sync has precedence over defaults When I started writing the above (consolidating old ideas), I thought that the client would need to negotiate the [a]syncness with the server, but I now see that several useful cases can be handled by the client just expressing its preferences for async by not asking for FILESYNC or DATASYNC. Grepping for NFS.*WRITE_UNSTABLE in the new nfs client and server shows that it is now spelled without V3, except 3 comments have the old spelling. Bruce