From owner-freebsd-fs@FreeBSD.ORG Tue Jul 17 18:59:59 2007 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2F48616A405; Tue, 17 Jul 2007 18:59:59 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from galileo.cs.uoguelph.ca (galileo.cs.uoguelph.ca [131.104.94.215]) by mx1.freebsd.org (Postfix) with ESMTP id E4F1F13C4D3; Tue, 17 Jul 2007 18:59:58 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.96.170]) by galileo.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id l6HIxvut022978; Tue, 17 Jul 2007 14:59:57 -0400 Received: from localhost (rmacklem@localhost) by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id l6HJ43721111; Tue, 17 Jul 2007 15:04:03 -0400 (EDT) X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing -bs Date: Tue, 17 Jul 2007 15:04:03 -0400 (EDT) From: Rick Macklem X-X-Sender: rmacklem@muncher To: Eric Anderson In-Reply-To: <469CE76F.9040105@freebsd.org> Message-ID: References: <469CAE7D.8090609@freebsd.org> <469CE76F.9040105@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Scanned-By: MIMEDefang 2.57 on 131.104.94.215 Cc: freebsd-fs@freebsd.org Subject: Re: NFS on NFS? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jul 2007 18:59:59 -0000 On Tue, 17 Jul 2007, Eric Anderson wrote: > Rick Macklem wrote: > > Is that really true? It looked like the NFS handle was created by various > file system goo, which could come up again some time in the future. For > instance, file a file systems inode table, rm all the files, do it again > (with different data in the files). Wouldn't the NFS handle look the same to > the client then, but be a different file? Or when we say 'file' do we mean > 'inode' on a file system? > The file handle also has di_gen (the generation #) in it, which is there specifically to prevent the file handle from accidentally referring to a new file with the same i-node #. The server is expected to return ESTALE when a client tries to use a file handle after the file is deleted and this error is returned when the generation# in the file handle is not the same as di_gen in the i-node. (di_gen is incremented each time the i-node is re-used.) File systems that do not have the equivalent of di_gen cannot be exported via NFS correctly (but some people/systems do so anyhow). Ok if the file system is read-only. > Also, by 'T stable', does 'T' mean 'time' here? Yep. Capital T for a looonnngggg time. > I'm not certain I completely understand why the clients would get confused. > Wouldn't it look something like this: > > [File system->NFS server->NFS handle] > | > V > [NFS client->virtual file system->NFS server->NFS handle2] > | > V > [NFS Client->virtual file system->application] > So long as the intermediate server obeys all the rules, it can work: - File Handle is T-stable (recognized as ESTALE after the file is deleted) and still works the same after server reboots, etc. - fsid in getattr remains the same throughout the file system, even after server reboots, etc. - handles RPCs in an atomic way, so that they are either done or not (can't leave things half created after a crash) - NFSv2 and v3 clients don't expect servers to maintain any state and don't know the server rebooted. They simply retry the RPC until they get success or failure back from the server. Where these schemes usually break down is when the intermediate server reboots and no longer does the same file handle translations or assigns a new, different fsid to the file system or crosses a mount point boundary and changes the fsid or ??? Like I said, seems like a simple proxy that passes along the RPCs is easier to do. For NFSv3 (not v2) the intermediary can grow the size of the file handle (to a maximum of 64 bytes) so, if the real server creates file handles less than 64 bytes in size, it can add/remove stuff, but... - it then becomes useful for only certain servers - it has to do lots of copying of args, since the size changes rick