From owner-freebsd-arch@FreeBSD.ORG  Wed Dec 20 16:22:14 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@freebsd.org
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id E814216A500;
	Wed, 20 Dec 2006 16:22:13 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 0A13343CA7;
	Wed, 20 Dec 2006 16:21:48 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id A3F4246E2C;
	Wed, 20 Dec 2006 10:48:59 -0500 (EST)
Date: Wed, 20 Dec 2006 15:48:59 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Daniel Eischen <deischen@freebsd.org>
In-Reply-To: <Pine.GSO.4.64.0612130918140.13170@sea.ntplx.net>
Message-ID: <20061220153126.G85384@fledge.watson.org>
References: <32874.1165905843@critter.freebsd.dk>
	<Pine.GSO.4.64.0612121543220.8780@sea.ntplx.net>
	<200612132010.49601.davidxu@freebsd.org>
	<Pine.GSO.4.64.0612130918140.13170@sea.ntplx.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: David Xu <davidxu@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: close() of active socket does not work on FreeBSD 6
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Dec 2006 16:22:14 -0000


On Wed, 13 Dec 2006, Daniel Eischen wrote:

> [CC trimmed]
>
> On Wed, 13 Dec 2006, David Xu wrote:
>
>> On Wednesday 13 December 2006 04:49, Daniel Eischen wrote:
>>> 
>>> Well, if threads waiting on IO are interruptable by signals, can't we make 
>>> a new signal that's only used by the kernel and send it to all threads 
>>> waiting on IO for that descriptor? When it gets out to actually setup the 
>>> signal handler, it just resumes like it is returning from an SA_RESTART 
>>> signal handler (which according to another posting would reissue the IO 
>>> command and get EBADF).
>> 
>> Even if you have implemented the close() with the interruption, another 
>> thread openning a file still can reuse the file handle immediately, 
>> according to specifications, the lowest free file handle will be returned, 
>> if SA_RESTART is used, the interrupted thread restart the syscall, it will 
>> be using a wrong file, I think even if we have implemented the feature in 
>> kernel, useland threads still has serious race to fix.
>
> If you use a special signal that is only used for this purpose, there is no 
> reason you have to try the IO operation again.  You can just return EBADF.
>
> Anyway, this was just a thought/idea.  I don't mean to argue against any of 
> the other reasons why this isn't a good idea.

Whatever may be implemented to solve this issue will require a fairly serious 
re-working of how we implement file descriptor reference counting in the 
kernel.  Do you propose similar "cancellation" of other system calls blocked 
on the file descriptor, including select(), etc?  Typically these system calls 
interact with the underlying object associated with the file descriptor, not 
the file descriptor itself, and often, they act directly on the object and 
release the file descriptor before performing their operation.  I think before 
we can put any reasonable implementation proposal on the table, we need a 
clear set of requirements:

- What is the scope of cancellation?  Are we cancelling oustanding
   simultaneous I/O operations on the same fd index in the process, use of any
   fd pointing at the same open file entry in the process (i.e., all dup'd
   instances), or the same open file entry across all processes?  I've been
   presuming only use of the same fd index in the same process is relevant, but
   if so, let's make sure we state that.  If not, what do we mean?

- Exactly which potentially blocking operations will be cancelled as a result
   of close() of an "in use" file descriptor?  read()?  write()?  sendfile()?
   connect()?  ioctl()?  select()?  poll()?  close()?  Is the set of possible
   cancellation points equal to the existing set of interruptible sleeps?
   Notice that in our current implementation, objects are often reached using a
   file descriptor, but then separately referenced for the duration of the
   operation, with the file descriptor being released.  This means that we
   currently don't maintain any useful list of threads currently interacting
   with the file descriptor, and only have a limited notion of which threads
   are interacting with the underlying object.

- What semantics are expected regarding the underlying object when an
   operation is cancelled due to simultaneous close() on the same file
   descriptor?  Keep in mind that the underlying object may be referenced by
   other file descriptor indexes pointing at the same open file state (shared
   offset, etc).  For example, if we cancel connect(), is it safe to say that
   what we've done is cancel the wait for connect() to complete, rather than
   the connection operation itself, which may continue and be visible on other
   file descriptor indexes referencing the same object, or to other processes
   also referencing it?

While providing Solaris-like semantics here makes some amount of sense, this 
is a very tricky area, and one where we're still refining performance 
behavior, reference counting behavior, etc.  I don't think there will be any 
easy answers, and we need to think through the semantic and performance 
implications of any change very carefully before starting to implement.

Robert N M Watson
Computer Laboratory
University of Cambridge