From owner-freebsd-hackers  Sat Mar 14 00:58:46 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id AAA06159
          for freebsd-hackers-outgoing; Sat, 14 Mar 1998 00:58:46 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id AAA06141
          for <hackers@FreeBSD.ORG>; Sat, 14 Mar 1998 00:58:35 -0800 (PST)
          (envelope-from tlambert@usr08.primenet.com)
Received: (from daemon@localhost)
	by smtp04.primenet.com (8.8.8/8.8.8) id BAA08404;
	Sat, 14 Mar 1998 01:58:35 -0700 (MST)
Received: from usr08.primenet.com(206.165.6.208)
 via SMTP by smtp04.primenet.com, id smtpd008386; Sat Mar 14 01:58:27 1998
Received: (from tlambert@localhost)
	by usr08.primenet.com (8.8.5/8.8.5) id BAA17978;
	Sat, 14 Mar 1998 01:58:26 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199803140858.BAA17978@usr08.primenet.com>
Subject: Re: Odd problem we're seeing here
To: karl@mcs.net (Karl Denninger)
Date: Sat, 14 Mar 1998 08:58:25 +0000 (GMT)
Cc: hackers@FreeBSD.ORG
In-Reply-To: <19980312223102.50360@mcs.net> from "Karl Denninger" at Mar 12, 98 10:31:02 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> I'm seeing an interesting problem here.

No you're not.  You are seeng a problem I already knew about,
and provided a patch for, but the patch was not checked in by
anyone (no, not the NFS locking patch; that should be checked
in, too, though).

The problem is in the lease code, and has been exaggerated by
recent changes that cause the paging path to be invoked for
writes, as well.  Leases are the same thing as opportunistic
locks, and...  "Opportunity only locks once".  8-) 8-).


> Assuming two writers to an NFS file from the SAME machine.
> 
> Writer #1 has opened the file O_RDWR and intends to do a set of random
> operations on it.
> 
> Writer #2 has opened the file O_WRONLY|O_APPEND and intends only to add
> to the end of the file.
> 
> Writer #1 gets an flock on the file, and does things to the file.  He
> might even do an ftruncate at some point to roll back the contents.
> 
> Writer #2, in the meantime, attempts to get an flock on the file and 
> (correctly) blocks, waiting for Writer #1 to finish.
> 
> Writer #1 gets done, and releases his lock.
> 
> Writer #2 *SOMETIMES DOES NOT WRITE TO THE CORRECT (END OF FILE) PLACE*.
> 
> This is difficult to reproduce, but it can be done.  It appears that the
> O_APPEND isn't causing the implied lseek(....., 0, SEEK_END) to be done
> before each write in this situation.

This behaviour is a result of a race condition.

When you open a file, you are saying

	"Give me the index into my open file table of the struct
	 file * whose f_data points to the struct vnode for the
	 file I am opening, and whose f_offset is set to the end
	 of the file *at the time the write is issued*"

Now what does O_APPEND mean?

	"When you can vn_open on the nameidata pointer, when you
	 call the VOP_OPEN, pass the FAPPEND bit in the fmode
	 argument, where all the FS's will (incorrectly) compare
	 it to IO_APPEND, which will work because FAPPEND is
	 defined as IO_APPEND in fcntl.h"

When nfs_write() is called (in /sys/nfs/nfs_bio.c), if the ioflag
contains IO_APPEND, then a VOP_GETATTR is issued on the vp, and
the uio->uio_offset is set to the value of n_size, which came
from the cache.

The cache value is set from va_size from a proxied VOP_GETATTR
to the remote system, but *ONLY* if the lease has expired on its
own, or has explicitly been expired.

What you are seeing is the result of n_attrstamp not being updated
when it should be.

In fact, you are seeing a race between the flock( fd, LOCK_UN), and
the atime/mtime updates as a result of the operations by write #1.


And if you trace down this race window, you will see that what is
happening is that your write call is not calling VOP_LEASE.

You can get the patch for the files:

	/sys/kern/kern_ktrace.c
	/sys/kern/link_aout.c
	/src/sys/kern/tty_tty.c
	/sys/kern/vfs_vnops.c

From:

	http://www.freebsd.org/~terry/DIFF.LEASE

Note: I am not terribly happy about this code.  I believe VOP_LEASE
should be called with the vp locked, but I don't do this in two
vn_rdwr cases because of the indeterminate vp lock state.  Of course,
the existing code blows this all to hell (SMP problems with NFS,
anyone?).

The only way to really fix this reasonable would be to get rid of
vn_rdwr, and replace it with vn_read or vn_write, as appropriate
to the situation.  Luckily this is really only a problem for the
ktrace and other unlikely code (it used to be a problem for the
image activator, too, until John fixed it after I complained about
it).

Pretty clearly vn_rdwr was a hack to avoid locking and unlocking
the vp over every call (cough, cough, lazy thinking, cough, cough).


> I'm trying to nail this one down.  The latest patches to the NFS code appear 
> to make this happen more often than it did before, but it did occasionally
> happen even before the patches.

Writes to mapped NFS pages need to trigger the lease revocation code
as well.  This basically means that there needs to be a VOP_LEASE
call for each and every VOP_PUTPAGES in to NFS.

This should probably go in the vnode pager code before VOP_PUTPAGES
is called.  Before the recent changes, paging *to* an NFS FS was a
hopeless case.  Now, it can cause problems (though probably not this
one; I don't believe a putpages can be used to extend a file).  The
generic routine should be used in case we ever want to export an
opportunity locking mechanism with full coherency to user space
programs.  Like SAMBA.  But only if we want O_APPEND to work for
SMB and NFS clients accessing the same file (and what idiot would
want a consistently functional system?  Where's the adventure?  8-)).


> This NEVER happens on locally mounted files, but occasionally happens
> on NFS mounted files.

Yes, well, it's a real coding error in the VFS client code; what do
you expect?  ;-).

Like I said, though, doing the patches won't fix all the races with
putpages (consider a record being written, and making the decision of
whether or not to read before write, and a record in an adjacent page
was just touched), and it won't work well under SMP unless vn_rdwr
dies (but then the existing code can't work well under SMP, since most
of the other vn_rdwr references have the same bug).


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message