Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 12 Oct 2006 14:24:22 -0700
From:      Jo Rhett <jrhett@svcolo.com>
To:        Bruce Evans <bde@zeta.org.au>
Cc:        "Rick C. Petty" <rick-freebsd@kiwi-computer.com>, freebsd-hardware@FreeBSD.org
Subject:   Bounty offered to fix sio device lock problem
Message-ID:  <452EB286.8000503@svcolo.com>
In-Reply-To: <20060724154856.I58894@delplex.bde.org>
References:  <20060721000018.GA99237@svcolo.com> <20060721001607.GA64376@megan.kiwi-computer.com> <20060721004731.GC8868@svcolo.com> <20060724154856.I58894@delplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Bruce -- who owns getting this fixed?  Or who should own it?   Or who 
will take on getting it fixed if we offer a bounty on it?

Replication scenario:
    Modem on sio0 (or sio1 or any normal i386 serial port)
    /etc/ttys has port enabled with "dialup"
    qpage (from ports, unchanged) uses modem for dialout
	** or just write a script that periodically dials out using tip

Within a day and often within a few hours, the serial port will go awol. 
  You can't talk to the modem any more.   Modem is just fine.
Rebooting the system solves the problem.   Rebooting the modem does not 
solve it.

100% replicable, and sooner versus later if you call out more often.

Bruce Evans wrote:
> On Thu, 20 Jul 2006, Jo Rhett wrote:
> 
>> Thanks for the super-quick reply!  Responses are inline...
>>
>> On Thu, Jul 20, 2006 at 07:16:07PM -0500, Rick C. Petty wrote:
>>>> root@scapa 47# fstat /dev/ttyd0
>>>> USER     CMD          PID   FD MOUNT      INUM MODE         SZ|DV 
>>>> R/W NAME
>>>
>>> What about "fstat /dev/cuad0" ?  Anyway, I've found that fstat is 
>>> useless,
>>> try using sysutils/lsof instead.
> 
> This is related to a longstanding (design) bug in vfs 
> (first-open/last-close
> semantics).  vfs counts devices as being open when it calls the device
> open routine, despite devices not actually being open until the device
> open routine returns successfully, which may happen much later (or not
> at all in the case of failure, but this doesn't cause any additional
> problems).  This causes the device close routine to not be called in
> some cases where it should be.  For bidirectional serial devices, not
> calling the device close routine results in "callin" devices that are
> sleeping in open being treated the same as "callin" devices that have
> successfully completed the open.  The former shouldn't give EBUSY for
> opens of the corresponding "callout" device, but the latter should and
> do.
> 
> FreeBSD-1 has a hack to vfs to work around the bug, but the hack was
> lost in FreeBSD-2.  I still use the hack locally.  Startting with about
> FreeBSD-4, there is a D_TRACKCLOSE device flag that can be used to fix
> the problem less hackishly (but still not in the right way, since it
> requires individual drivers to do generic things).  I haven't got around
> to using it to fix sio even locally.  D_TRACKCLOSE is mostly unused and
> mostly used bogusly when it is used.
> 
> The bug rarely causes problems since it is only activated by doing
> something like the following:
> 
>     thread1: "open" /dev/ttyd0.  Actually, block in open waiting for 
> carrier.
>     thread2: open /dev/ttyd0 using O_NONBLOCK to prevent blocking.
>     thread2: perhaps actually use /dev/ttyd0
>     thread2: "close" /dev/ttyd0.  Actually, don't complete the close due to
>          the bogus vfs close.
>     thread1: remain blocked in open through all the above.
>     thread3: try to open /dev/cuaa^Hd0.  Get EBUSY because the non-open by
>          thread1 is seen as an open.
> 
> Starting with about FreeBSD-5, there may be additional problems from races.
> First-open/last-close semantics basically require opens to be synchronous.
> Sleeping in open for serial device drivers gives large race windows in
> which to open may race open/close of the same device in other threads.
> Prempting the kernel gives small race windows.  In practice, Giant locking
> limits problems.  For serial drivers, open/close should still be Giant-
> locked since the whole tty subsystem is still Giant-locked.  (Note that
> all vfs locks are dropped before calling device open/close.  The bogus vfs
> count provides some psuedo-locking.)
> 
> Rearrangement of serial drivers in -current may have enlarged the bug,
> but I can't see any enlargement except that from more serial drivers
> now supporting bidirectional devices.
> 
>> Sorry, yes. Same results.  And if lsof shows things that fstat doesn't,
>> then this is a bug in FreeBSD.
>>
>> But anyway,
>> root@scapa 63# lsof /dev/cuad0
>> root@scapa 64# lsof /dev/ttyd0
>>
>> Nada.
> 
> I think fstat and lsof can't see threads sleeping in open since the open
> hasn't really completed -- the open has completed enough to confuse vfs
> but not for vfs to report its confusion to userland.  It should be possible
> to see threads sleeping in open using "ps -lax | grep ttydcd" ("ttydcd" is
> the string for -current; the string for sio used to be "siodcd".  Grep for
> "tty" and "dcd" too).  This won't distinguish between threads sleeping
> normally in open (ttyopen) and ones that are in a bogus state due to a
> missing close.
> 
>> Also note that this system is pretty bone stock.  Standard install, plus
>> mysql and apache.  Nothing else would be using the port.  It's something
>> that left it locked, and really only "login" could be the culprit.
> 
> Quite likely, but login doesn't use O_NONBLOCK so I don't know how it
> could trigger the bug.  Maybe nopise on DCD cound do it.  The easiest
> way to trigger the bug is "stty -f /dev/ttyd0" while there is a login
> blocked in open on ttyd0.
> 
>>>> No locks? No processes using it.  Okay, this is uncool.
>>>> And yet "ktrace tip com1" and "kdump -f ktrace.out" clearly show:
>>>>
>>>>  50461 tip      CALL  open(0x8059030,0x6,0)
>>>>  50461 tip      NAMI  "/dev/cuad0"
>>>>  50461 tip      RET   open -1 errno 16 Device busy
>>>
>>> This isn't very useful.  A ktrace on the process that's locking the file
>>> would be.  :-P
>>
>> See above.  I can't find it. :-(
> 
> You might need to start ktracing very early to locate the original
> problematic open/open/close sequence.
> 
>>>> NOTE: at this time I am suspecting that CD is being misread (it's not
>>>> present - I have a break out box on the line) and that this problem is
>>>> somehow tied to that.  This problem appears at random after login has
>>>> exerted itself on the system.  I've disabled the getty on ttyd0 and 
>>>> login
>>>> has timed out, but it continues to show "device busy".
>>>
>>> How did you disable the getty?  Was this prior to or after a 
>>> restart?  It
>>> sounds like /etc/ttys is maybe running a process on it.  You need to
>>> "killall -HUP init" after changing /etc/ttys.  But you probably already
>>> know that.
>>
>> Yes, I change "on" to "off" in /etc/ttys and "kill -1 1" :-)
> 
> Killing all processes sleeping in serial device open unwedges the port for
> the bug that I know about (provided the close doesn't hang).  This and
> making the open succeed by raising DCD in hardware are the only ways that
> I know of to unwedge the port once the open gets stuck.
> 
> Bruce

-- 
Jo Rhett
senior geek
Silicon Valley Colocation



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?452EB286.8000503>