From owner-freebsd-arch@FreeBSD.ORG  Thu Dec 21 13:45:58 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@freebsd.org
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 984F716A509;
	Thu, 21 Dec 2006 13:45:58 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 4560C13C475;
	Thu, 21 Dec 2006 13:45:58 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id E39DC47112;
	Thu, 21 Dec 2006 05:38:33 -0500 (EST)
Date: Thu, 21 Dec 2006 10:38:33 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: David Xu <davidxu@freebsd.org>
In-Reply-To: <200612210820.09955.davidxu@freebsd.org>
Message-ID: <20061221102909.O83974@fledge.watson.org>
References: <32874.1165905843@critter.freebsd.dk>
	<20061220153126.G85384@fledge.watson.org>
	<Pine.GSO.4.64.0612201308220.23942@sea.ntplx.net>
	<200612210820.09955.davidxu@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Daniel Eischen <deischen@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: close() of active socket does not work on FreeBSD 6
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Dec 2006 13:45:58 -0000


On Thu, 21 Dec 2006, David Xu wrote:

> On Thursday 21 December 2006 02:18, Daniel Eischen wrote:
>> On Wed, 20 Dec 2006, Robert Watson wrote:
>>> On Wed, 13 Dec 2006, Daniel Eischen wrote:
>>>> Anyway, this was just a thought/idea.  I don't mean to argue against any
>>>> of the other reasons why this isn't a good idea.
>>>
>>> Whatever may be implemented to solve this issue will require a fairly 
>>> serious re-working of how we implement file descriptor reference counting 
>>> in the kernel.  Do you propose similar "cancellation" of other system 
>>> calls blocked on the file descriptor, including select(), etc?  Typically 
>>> these system calls interact with the underlying object associated with the 
>>> file descriptor, not the file descriptor itself, and often, they act 
>>> directly on the object and release the file descriptor before performing 
>>> their operation. I think before we can put any reasonable implementation 
>>> proposal on the table, we need a clear set of requirements:
>>
>> [ ... ]
>>
>>> While providing Solaris-like semantics here makes some amount of sense, 
>>> this is a very tricky area, and one where we're still refining performance 
>>> behavior, reference counting behavior, etc.  I don't think there will be 
>>> any easy answers, and we need to think through the semantic and 
>>> performance implications of any change very carefully before starting to 
>>> implement.
>>
>> I don't think the behavior here has to be any different that what we 
>> currently (or desire to) do with regard to (unblocked) signals interrupting 
>> threads waiting on IO.  You can spend a lot of time thinking about how 
>> close() should affect IO operations on the same file descriptor, but a very 
>> simple approach is to treat them the same as if the operations were 
>> interrupted by a signal.  I'm not suggesting it is implemented the same 
>> way, just that it seems to make a lot of sense to me that the behavior is 
>> consistent between the two.
>
> I think the main concern is if we will record every thread using a fd, that 
> means, when you call read() on a fd, you record your thread pointer into the 
> fd's thread list, when one wants to close the fd, it has to notify all the 
> threads in the list, set a flag for each thread, the flag indicates a thread 
> is interrupted because the fd was closed, when the thread returns from deep 
> code path to read() syscall, it should check the flag, and return EBADF to 
> user if it was set. whatever, a reserved signal or TDF_INTERRUPT may 
> interrupt a thread. but since there are many file operations, I don't know 
> if we are willing to pay such overheads to every file syscall, extra locking 
> is not welcomed.

Yes, as well as adding quite a bit of complexity and opening the door for some 
rather odd/unfortunate races.  You can inspect the bulk of the Solaris 
implementation by looking at three spots:

http://fxr.watson.org/fxr/ident?v=OPENSOLARIS;i=closeandsetf 
http://fxr.watson.org/fxr/ident?v=OPENSOLARIS;i=post_syscall 
http://fxr.watson.org/fxr/search?v=OPENSOLARIS&string=MUSTRETURN

In closeandsetf(), you can see that an additional layer of indirection 
associated with the file descriptor is maintained in order to count consumers 
of a particular fd, not just the open file record, and the set of active fds 
for each thread is maintained.  When a close() is performed and there are 
still other open consumers, the process is suspended and all threads are 
inspected to see if the fd is active for the thread, in which case a thread 
flag indicating that a stale fd is set.  I believe that the interrupt here is 
an implicit part of the process suspend/restart, and in post_syscall() the 
EINTR returns are remapped to EBADF.

That extra level of indirection and use tracking will be both complex and a 
performance hit in a critical kernel path.  I'm not opposed to investigating 
implementing something along these lines, but I think we should defer this for 
some time while we sort out more pressing issues in our kernel file 
descriptor/socket/etc code and revist this in a few months.  We will need to 
carefully evaluate the performance costs, and if they are significant, figure 
out how to avoid this causing a significant hit.  It's worth observing that 
removing one level of reference counting from the socket send/receive paths 
(using the file descriptor reference instead of the socket reference) made a 
5%+ difference in high speed send performance.

Robert N M Watson
Computer Laboratory
University of Cambridge