From owner-freebsd-hackers Sat Mar 14 00:58:46 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id AAA06159 for freebsd-hackers-outgoing; Sat, 14 Mar 1998 00:58:46 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id AAA06141 for ; Sat, 14 Mar 1998 00:58:35 -0800 (PST) (envelope-from tlambert@usr08.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.8.8/8.8.8) id BAA08404; Sat, 14 Mar 1998 01:58:35 -0700 (MST) Received: from usr08.primenet.com(206.165.6.208) via SMTP by smtp04.primenet.com, id smtpd008386; Sat Mar 14 01:58:27 1998 Received: (from tlambert@localhost) by usr08.primenet.com (8.8.5/8.8.5) id BAA17978; Sat, 14 Mar 1998 01:58:26 -0700 (MST) From: Terry Lambert Message-Id: <199803140858.BAA17978@usr08.primenet.com> Subject: Re: Odd problem we're seeing here To: karl@mcs.net (Karl Denninger) Date: Sat, 14 Mar 1998 08:58:25 +0000 (GMT) Cc: hackers@FreeBSD.ORG In-Reply-To: <19980312223102.50360@mcs.net> from "Karl Denninger" at Mar 12, 98 10:31:02 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > I'm seeing an interesting problem here. No you're not. You are seeng a problem I already knew about, and provided a patch for, but the patch was not checked in by anyone (no, not the NFS locking patch; that should be checked in, too, though). The problem is in the lease code, and has been exaggerated by recent changes that cause the paging path to be invoked for writes, as well. Leases are the same thing as opportunistic locks, and... "Opportunity only locks once". 8-) 8-). > Assuming two writers to an NFS file from the SAME machine. > > Writer #1 has opened the file O_RDWR and intends to do a set of random > operations on it. > > Writer #2 has opened the file O_WRONLY|O_APPEND and intends only to add > to the end of the file. > > Writer #1 gets an flock on the file, and does things to the file. He > might even do an ftruncate at some point to roll back the contents. > > Writer #2, in the meantime, attempts to get an flock on the file and > (correctly) blocks, waiting for Writer #1 to finish. > > Writer #1 gets done, and releases his lock. > > Writer #2 *SOMETIMES DOES NOT WRITE TO THE CORRECT (END OF FILE) PLACE*. > > This is difficult to reproduce, but it can be done. It appears that the > O_APPEND isn't causing the implied lseek(....., 0, SEEK_END) to be done > before each write in this situation. This behaviour is a result of a race condition. When you open a file, you are saying "Give me the index into my open file table of the struct file * whose f_data points to the struct vnode for the file I am opening, and whose f_offset is set to the end of the file *at the time the write is issued*" Now what does O_APPEND mean? "When you can vn_open on the nameidata pointer, when you call the VOP_OPEN, pass the FAPPEND bit in the fmode argument, where all the FS's will (incorrectly) compare it to IO_APPEND, which will work because FAPPEND is defined as IO_APPEND in fcntl.h" When nfs_write() is called (in /sys/nfs/nfs_bio.c), if the ioflag contains IO_APPEND, then a VOP_GETATTR is issued on the vp, and the uio->uio_offset is set to the value of n_size, which came from the cache. The cache value is set from va_size from a proxied VOP_GETATTR to the remote system, but *ONLY* if the lease has expired on its own, or has explicitly been expired. What you are seeing is the result of n_attrstamp not being updated when it should be. In fact, you are seeing a race between the flock( fd, LOCK_UN), and the atime/mtime updates as a result of the operations by write #1. And if you trace down this race window, you will see that what is happening is that your write call is not calling VOP_LEASE. You can get the patch for the files: /sys/kern/kern_ktrace.c /sys/kern/link_aout.c /src/sys/kern/tty_tty.c /sys/kern/vfs_vnops.c From: http://www.freebsd.org/~terry/DIFF.LEASE Note: I am not terribly happy about this code. I believe VOP_LEASE should be called with the vp locked, but I don't do this in two vn_rdwr cases because of the indeterminate vp lock state. Of course, the existing code blows this all to hell (SMP problems with NFS, anyone?). The only way to really fix this reasonable would be to get rid of vn_rdwr, and replace it with vn_read or vn_write, as appropriate to the situation. Luckily this is really only a problem for the ktrace and other unlikely code (it used to be a problem for the image activator, too, until John fixed it after I complained about it). Pretty clearly vn_rdwr was a hack to avoid locking and unlocking the vp over every call (cough, cough, lazy thinking, cough, cough). > I'm trying to nail this one down. The latest patches to the NFS code appear > to make this happen more often than it did before, but it did occasionally > happen even before the patches. Writes to mapped NFS pages need to trigger the lease revocation code as well. This basically means that there needs to be a VOP_LEASE call for each and every VOP_PUTPAGES in to NFS. This should probably go in the vnode pager code before VOP_PUTPAGES is called. Before the recent changes, paging *to* an NFS FS was a hopeless case. Now, it can cause problems (though probably not this one; I don't believe a putpages can be used to extend a file). The generic routine should be used in case we ever want to export an opportunity locking mechanism with full coherency to user space programs. Like SAMBA. But only if we want O_APPEND to work for SMB and NFS clients accessing the same file (and what idiot would want a consistently functional system? Where's the adventure? 8-)). > This NEVER happens on locally mounted files, but occasionally happens > on NFS mounted files. Yes, well, it's a real coding error in the VFS client code; what do you expect? ;-). Like I said, though, doing the patches won't fix all the races with putpages (consider a record being written, and making the decision of whether or not to read before write, and a record in an adjacent page was just touched), and it won't work well under SMP unless vn_rdwr dies (but then the existing code can't work well under SMP, since most of the other vn_rdwr references have the same bug). Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message