From owner-freebsd-hardware@FreeBSD.ORG  Mon Jul 24 16:19:06 2006
Return-Path: <owner-freebsd-hardware@FreeBSD.ORG>
X-Original-To: freebsd-hardware@freebsd.org
Delivered-To: freebsd-hardware@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 07D5116A4DA
	for <freebsd-hardware@freebsd.org>;
	Mon, 24 Jul 2006 16:19:06 +0000 (UTC)
	(envelope-from jrhett@mail.meer.net)
Received: from outbound0.sv.meer.net (outbound0.mx.meer.net [209.157.153.23])
	by mx1.FreeBSD.org (Postfix) with ESMTP id A2DBF43D45
	for <freebsd-hardware@freebsd.org>;
	Mon, 24 Jul 2006 16:19:05 +0000 (GMT)
	(envelope-from jrhett@mail.meer.net)
Received: from mail.meer.net (mail.meer.net [209.157.152.14])
	by outbound0.sv.meer.net (8.12.10/8.12.6) with ESMTP id k6OGJ5ih021998; 
	Mon, 24 Jul 2006 09:19:05 -0700 (PDT)
	(envelope-from jrhett@mail.meer.net)
Received: from mail.meer.net (mail.meer.net [209.157.152.14])
	by mail.meer.net (8.13.3/8.13.3/meer) with ESMTP id k6OGJ4lj091657;
	Mon, 24 Jul 2006 09:19:04 -0700 (PDT)
	(envelope-from jrhett@mail.meer.net)
Received: (from jrhett@localhost)
	by mail.meer.net (8.13.3/8.13.3) id k6OGJ4bN091654;
	Mon, 24 Jul 2006 09:19:04 -0700 (PDT) (envelope-from jrhett)
Date: Mon, 24 Jul 2006 09:19:04 -0700
From: Jo Rhett <jrhett@svcolo.com>
To: Bruce Evans <bde@zeta.org.au>
Message-ID: <20060724161904.GA86330@svcolo.com>
References: <20060721000018.GA99237@svcolo.com>
	<20060721001607.GA64376@megan.kiwi-computer.com>
	<20060721004731.GC8868@svcolo.com>
	<20060724154856.I58894@delplex.bde.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20060724154856.I58894@delplex.bde.org>
Organization: svcolo.com
User-Agent: Mutt/1.5.9i
Cc: "Rick C. Petty" <rick-freebsd@kiwi-computer.com>,
	freebsd-hardware@freebsd.org
Subject: Re: device busy -- no locks?
X-BeenThere: freebsd-hardware@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: General discussion of FreeBSD hardware <freebsd-hardware.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>, 
	<mailto:freebsd-hardware-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hardware>
List-Post: <mailto:freebsd-hardware@freebsd.org>
List-Help: <mailto:freebsd-hardware-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>, 
	<mailto:freebsd-hardware-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Jul 2006 16:19:06 -0000

Thank you for the detailed reply.  My answers are inline.

On Mon, Jul 24, 2006 at 04:57:57PM +1000, Bruce Evans wrote:
> This is related to a longstanding (design) bug in vfs (first-open/last-close
> semantics).  vfs counts devices as being open when it calls the device
> open routine, despite devices not actually being open until the device
> open routine returns successfully, which may happen much later (or not
> at all in the case of failure, but this doesn't cause any additional
> problems).  This causes the device close routine to not be called in
> some cases where it should be.  For bidirectional serial devices, not
> calling the device close routine results in "callin" devices that are
> sleeping in open being treated the same as "callin" devices that have
> successfully completed the open.  The former shouldn't give EBUSY for
> opens of the corresponding "callout" device, but the latter should and
> do.
> 
> FreeBSD-1 has a hack to vfs to work around the bug, but the hack was
> lost in FreeBSD-2.  I still use the hack locally.  Startting with about
> FreeBSD-4, there is a D_TRACKCLOSE device flag that can be used to fix
> the problem less hackishly (but still not in the right way, since it
> requires individual drivers to do generic things).  I haven't got around
> to using it to fix sio even locally.  D_TRACKCLOSE is mostly unused and
> mostly used bogusly when it is used.
> 
> The bug rarely causes problems since it is only activated by doing
> something like the following:
> 
>     thread1: "open" /dev/ttyd0.  Actually, block in open waiting for 
>     carrier.
>     thread2: open /dev/ttyd0 using O_NONBLOCK to prevent blocking.
>     thread2: perhaps actually use /dev/ttyd0
>     thread2: "close" /dev/ttyd0.  Actually, don't complete the close due to
> 	     the bogus vfs close.
>     thread1: remain blocked in open through all the above.
>     thread3: try to open /dev/cuaa^Hd0.  Get EBUSY because the non-open by
> 	     thread1 is seen as an open.
 
Well in this case it's a simple/standard/stock "getty" and qpage trying to
use the same phone line.  

> Starting with about FreeBSD-5, there may be additional problems from races.
> First-open/last-close semantics basically require opens to be synchronous.
> Sleeping in open for serial device drivers gives large race windows in
> which to open may race open/close of the same device in other threads.
> Prempting the kernel gives small race windows.  In practice, Giant locking
> limits problems.  For serial drivers, open/close should still be Giant-
> locked since the whole tty subsystem is still Giant-locked.  (Note that
> all vfs locks are dropped before calling device open/close.  The bogus vfs
> count provides some psuedo-locking.)
 
Okay, you lost me here.  Is there anything I can do about this?  Patch
qpage to use giant-locks?

> I think fstat and lsof can't see threads sleeping in open since the open
> hasn't really completed -- the open has completed enough to confuse vfs
> but not for vfs to report its confusion to userland.  It should be possible
> to see threads sleeping in open using "ps -lax | grep ttydcd" ("ttydcd" is
> the string for -current; the string for sio used to be "siodcd".  Grep for
> "tty" and "dcd" too).  This won't distinguish between threads sleeping
> normally in open (ttyopen) and ones that are in a bogus state due to a
> missing close.
 
For the record, 6.0-REL is apparently using ttydcd

root@arran 3# ps -lax |grep dcd
    0 11571     1   0   5  0  1260   792 ttydcd S     ??    0:00.00 /usr/libexec/getty std.9600 ttyd0

> Quite likely, but login doesn't use O_NONBLOCK so I don't know how it
> could trigger the bug.  Maybe nopise on DCD cound do it.  The easiest
> way to trigger the bug is "stty -f /dev/ttyd0" while there is a login
> blocked in open on ttyd0.
 
Ah... so that's why it happened so often while I was testing last Thursday.
I was checking the tty state while working on the problem, and...

> Killing all processes sleeping in serial device open unwedges the port for
> the bug that I know about (provided the close doesn't hang).  This and
 
Hm.  We have seen a repeated and oft-repeatable situation where getty will
start login on the port, and we try to kill login to clear it.  login
hangs, and stays in a zombie state for up to a full day.  Related?

> making the open succeed by raising DCD in hardware are the only ways that
> I know of to unwedge the port once the open gets stuck.

Very good to know.  That will make testing this problem much easier.
(reboots must be scheduled weeks in advance, so...)

Yes, I am trying to build a test system we can use to replicate/play
with this bug.

-- 
Jo Rhett
senior geek
SVcolo : Silicon Valley Colocation