From owner-freebsd-hackers Mon Jul 27 07:58:39 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id HAA29891 for freebsd-hackers-outgoing; Mon, 27 Jul 1998 07:58:39 -0700 (PDT) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from Kitten.mcs.com (Kitten.mcs.com [192.160.127.90]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id HAA29876 for ; Mon, 27 Jul 1998 07:58:36 -0700 (PDT) (envelope-from karl@Mars.mcs.net) Received: from Mars.mcs.net (karl@Mars.mcs.net [192.160.127.85]) by Kitten.mcs.com (8.8.7/8.8.2) with ESMTP id JAA00835; Mon, 27 Jul 1998 09:58:07 -0500 (CDT) Received: (from karl@localhost) by Mars.mcs.net (8.8.7/8.8.2) id JAA24425; Mon, 27 Jul 1998 09:58:07 -0500 (CDT) Message-ID: <19980727095806.05337@mcs.net> Date: Mon, 27 Jul 1998 09:58:06 -0500 From: Karl Denninger To: Mike Smith Cc: Terry Lambert , hackers@FreeBSD.ORG Subject: Re: Odd problem we're seeing here References: <199803140858.BAA17978@usr08.primenet.com> <199807270539.WAA14548@antipodes.cdrom.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.84 In-Reply-To: <199807270539.WAA14548@antipodes.cdrom.com>; from Mike Smith on Sun, Jul 26, 1998 at 10:39:05PM -0700 Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Sun, Jul 26, 1998 at 10:39:05PM -0700, Mike Smith wrote: > > > I'm seeing an interesting problem here. > > > > No you're not. You are seeng a problem I already knew about, > > and provided a patch for, but the patch was not checked in by > > anyone (no, not the NFS locking patch; that should be checked > > in, too, though). > > Was this issue ever resolved to your satisfaction? Not really. I turned off NVSV3 on the affected machines; that "fixed" it, but its a hack, not a real fix. -- -- Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin http://www.mcs.net/ | T1's from $600 monthly / All Lines K56Flex/DOV | NEW! Corporate ISDN Prices dropped by up to 50%! Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS Fax: [+1 312 803-4929] | *SPAMBLOCK* Technology now included at no cost > > The problem is in the lease code, and has been exaggerated by > > recent changes that cause the paging path to be invoked for > > writes, as well. Leases are the same thing as opportunistic > > locks, and... "Opportunity only locks once". 8-) 8-). > > > > > > > Assuming two writers to an NFS file from the SAME machine. > > > > > > Writer #1 has opened the file O_RDWR and intends to do a set of random > > > operations on it. > > > > > > Writer #2 has opened the file O_WRONLY|O_APPEND and intends only to add > > > to the end of the file. > > > > > > Writer #1 gets an flock on the file, and does things to the file. He > > > might even do an ftruncate at some point to roll back the contents. > > > > > > Writer #2, in the meantime, attempts to get an flock on the file and > > > (correctly) blocks, waiting for Writer #1 to finish. > > > > > > Writer #1 gets done, and releases his lock. > > > > > > Writer #2 *SOMETIMES DOES NOT WRITE TO THE CORRECT (END OF FILE) PLACE*. > > > > > > This is difficult to reproduce, but it can be done. It appears that the > > > O_APPEND isn't causing the implied lseek(....., 0, SEEK_END) to be done > > > before each write in this situation. > > > > This behaviour is a result of a race condition. > > > > When you open a file, you are saying > > > > "Give me the index into my open file table of the struct > > file * whose f_data points to the struct vnode for the > > file I am opening, and whose f_offset is set to the end > > of the file *at the time the write is issued*" > > > > Now what does O_APPEND mean? > > > > "When you can vn_open on the nameidata pointer, when you > > call the VOP_OPEN, pass the FAPPEND bit in the fmode > > argument, where all the FS's will (incorrectly) compare > > it to IO_APPEND, which will work because FAPPEND is > > defined as IO_APPEND in fcntl.h" > > > > When nfs_write() is called (in /sys/nfs/nfs_bio.c), if the ioflag > > contains IO_APPEND, then a VOP_GETATTR is issued on the vp, and > > the uio->uio_offset is set to the value of n_size, which came > > from the cache. > > > > The cache value is set from va_size from a proxied VOP_GETATTR > > to the remote system, but *ONLY* if the lease has expired on its > > own, or has explicitly been expired. > > > > What you are seeing is the result of n_attrstamp not being updated > > when it should be. > > > > In fact, you are seeing a race between the flock( fd, LOCK_UN), and > > the atime/mtime updates as a result of the operations by write #1. > > > > > > And if you trace down this race window, you will see that what is > > happening is that your write call is not calling VOP_LEASE. > > > > You can get the patch for the files: > > > > /sys/kern/kern_ktrace.c > > /sys/kern/link_aout.c > > /src/sys/kern/tty_tty.c > > /sys/kern/vfs_vnops.c > > > > From: > > > > http://www.freebsd.org/~terry/DIFF.LEASE > > > > Note: I am not terribly happy about this code. I believe VOP_LEASE > > should be called with the vp locked, but I don't do this in two > > vn_rdwr cases because of the indeterminate vp lock state. Of course, > > the existing code blows this all to hell (SMP problems with NFS, > > anyone?). > > > > The only way to really fix this reasonable would be to get rid of > > vn_rdwr, and replace it with vn_read or vn_write, as appropriate > > to the situation. Luckily this is really only a problem for the > > ktrace and other unlikely code (it used to be a problem for the > > image activator, too, until John fixed it after I complained about > > it). > > > > Pretty clearly vn_rdwr was a hack to avoid locking and unlocking > > the vp over every call (cough, cough, lazy thinking, cough, cough). > > > > > > > I'm trying to nail this one down. The latest patches to the NFS code appear > > > to make this happen more often than it did before, but it did occasionally > > > happen even before the patches. > > > > Writes to mapped NFS pages need to trigger the lease revocation code > > as well. This basically means that there needs to be a VOP_LEASE > > call for each and every VOP_PUTPAGES in to NFS. > > > > This should probably go in the vnode pager code before VOP_PUTPAGES > > is called. Before the recent changes, paging *to* an NFS FS was a > > hopeless case. Now, it can cause problems (though probably not this > > one; I don't believe a putpages can be used to extend a file). The > > generic routine should be used in case we ever want to export an > > opportunity locking mechanism with full coherency to user space > > programs. Like SAMBA. But only if we want O_APPEND to work for > > SMB and NFS clients accessing the same file (and what idiot would > > want a consistently functional system? Where's the adventure? 8-)). > > > > > > > This NEVER happens on locally mounted files, but occasionally happens > > > on NFS mounted files. > > > > Yes, well, it's a real coding error in the VFS client code; what do > > you expect? ;-). > > > > Like I said, though, doing the patches won't fix all the races with > > putpages (consider a record being written, and making the decision of > > whether or not to read before write, and a record in an adjacent page > > was just touched), and it won't work well under SMP unless vn_rdwr > > dies (but then the existing code can't work well under SMP, since most > > of the other vn_rdwr references have the same bug). > > > > > > Regards, > > Terry Lambert > > terry@lambert.org > > --- > > Any opinions in this posting are my own and not those of my present > > or previous employers. > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > > with "unsubscribe freebsd-hackers" in the body of the message > > > > -- > \\ Sometimes you're ahead, \\ Mike Smith > \\ sometimes you're behind. \\ mike@smith.net.au > \\ The race is long, and in the \\ msmith@freebsd.org > \\ end it's only with yourself. \\ msmith@cdrom.com > > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-hackers" in the body of the message To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message