Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 19 Dec 2015 03:05:32 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        eugen@grosbein.net
Cc:        freebsd-bugs@freebsd.org
Subject:   Re: [Bug 205398] [regression] [tty] tty_drain() kernel function lacks timeout support it had before
Message-ID:  <20151219013916.N930@besplex.bde.org>
In-Reply-To: <bug-205398-8@https.bugs.freebsd.org/bugzilla/>
References:  <bug-205398-8@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 18 Dec 2015 a bug system that doesn't want replies wrote:

> Revision 181905 by ed@freebsd.org brought the new MPSAFE TTY layer and removed
> "drainwain" timeout support. Now applications working with serial port can hang
> forever on close() system call:

It brought many other bugs.  About 20 more related to draining.

Some of the other bugs accidentally ameliorate this one.  The tty layer
never waits long enough for the last few characters to drain (though
I finished fixing this for sio in 1996).  So it takes a large buffer
to possibly give an endless wait.  Flow control must be on for the
wait to be endless.  Flow control is also broken...

There is a hack for last-close that is supposed to give a hard-coded timeout
of 1 second.  Not sure why this doesn't work for you.  My quick fix that
restores the timeout uses slightly different logic where this hack was.

The timeout is also a hack (breaks POSIX conformance), but at least the
user can control it and it doesn't default to a too small value.  The
old default of 300 seconds was a bit too large, but I kept it.  My systems
have always changed this to 180 seconds in /etc/rc.d.  I set it to 1
second per-device only transiently.

> - an application opens /dev/cuau0 in non-blocing i/o mode and tries to detect
> GSM gateway there writing commands like ATZ, ATE1 etc. to the device;
> - the device may be dead (lost power, broken, disconnected etc.) and does not
> answer back;

Old versions also had a hack by me that breaks waiting in last-close if
the device is in non-blocking mode.

If the device is really disconnected, then the tty should be in a zombie
state and should not wait.  I think this still works.  CLOCAL or lack of
modem signals may break detection of last-close.

Did you have CRTSCTS flow control enabled?  This is probably the main
source of hangs.  The RTS and CTS signals are not ignored in CLOCAL mode,
flow control should be invoked when they go away when th device goes
away.

> - application timeouts waiting for answer and closes device with close()
> - tty layer tries to drain output "forever", until a signal arrives.

Perhaps the hard-coded 1 second timeout only works for close() in exit().
So it helps more for sloppy applications that exit without waiting for
their data to go out.

Applications that do the above are still sloppy.  POSIX specifies waiting
"forever" again to drain in close().  A non-buggy application would do:

      write();
      // set up timeout for draining
      tcdrain();
      // when timeout expires, try to recover
      // when recovery is impossible, clean up and exit
      tcflush();		// this is a critical step in the cleanup
      // set up timeout for closing, just in case there is a kernel bug
      close();		// now it can't block unless there was a kernel bug

> gnokii (comms/gnokii) suffers from this problem.
>
> Please re-implement tunable timeout and TIOCSDRAINWAIT syscall kernel has
> before.

This is mostly fixed in my version.  I started to cut out the patches,
but they were too entwined with other fixes.  Here is the part that
replaces the hard-coded 1 second timeout:

X diff -c2 ./kern/tty.c~ ./kern/tty.c
X *** ./kern/tty.c~	Thu Mar 19 18:23:08 2015
X --- ./kern/tty.c	Sat Aug  8 11:40:23 2015
X ***************
X *** 133,155 ****
X   		return (0);
X 
X ! 	while (ttyoutq_bytesused(&tp->t_outq) > 0) {
X   		ttydevsw_outwakeup(tp);
X   		/* Could be handled synchronously. */
X   		bytesused = ttyoutq_bytesused(&tp->t_outq);
X ! 		if (bytesused == 0)
X   			return (0);
X 
X   		/* Wait for data to be drained. */
X ! 		if (leaving) {
X   			revokecnt = tp->t_revokecnt;
X ! 			error = tty_timedwait(tp, &tp->t_outwait, hz);
X   			switch (error) {
X   			case ERESTART:
X   				if (revokecnt != tp->t_revokecnt)
X   					error = 0;
X   				break;
X   			case EWOULDBLOCK:
X ! 				if (ttyoutq_bytesused(&tp->t_outq) < bytesused)
X   					error = 0;
X   				break;
X   			}
X --- 196,225 ----
X   		return (0);
X 
X ! 	while (ttyoutq_bytesused(&tp->t_outq) != 0 || tp->t_flags & TS_BUSY) {

This also restores some TS_BUSY logic.  TS_BUSY must be set by drivers to
inform the tty layer that the downstream layer(s) have output in progress.
Without that, upper layers cannot wait as necessary.

X   		ttydevsw_outwakeup(tp);
X   		/* Could be handled synchronously. */
X   		bytesused = ttyoutq_bytesused(&tp->t_outq);
X ! 		busy = tp->t_flags & TS_BUSY;
X ! 		if (bytesused == 0 && busy == 0)
X   			return (0);
X 
X   		/* Wait for data to be drained. */
X ! 		if (leaving || tp->t_drainwait != 0) {
X   			revokecnt = tp->t_revokecnt;
X ! 			waittime = (tp->t_drainwait != 0) ? tp->t_drainwait :
X ! 			    hz;

This is bug for bug compatible with the previous version but not with old
versions.  It maps t_drainwait = 0 to a timeout of a seconds.  In old
versions, t_drainwait = 0 was the way of getting the timeout of infinity
for POSIX conformance.

X ! 			error = tty_timedwait(tp, &tp->t_outwait, waittime);
X   			switch (error) {
X   			case ERESTART:
X + 				/* XXX this seems wrong. */
X   				if (revokecnt != tp->t_revokecnt)
X   					error = 0;
X   				break;

This is the revoke(2) case.  I am now working on fixing all old bugs in
revoke().  Here mapping the error to 0 seems wrong.  revoke() should
forcibly break the draining and perhaps make it always return EIO.  But
mapping the error to 0 makes us wait again.  Waiting is especially wrong
for revoked ttys, but the timeout limits the damage.

X   			case EWOULDBLOCK:
X ! 				if (ttyoutq_bytesused(&tp->t_outq) !=
X ! 				    bytesused ||(tp->t_flags & TS_BUSY) != busy)
X   					error = 0;
X + 				else
X + 					error = EIO;
X   				break;
X   			}

This maps the error to EIO like it used to be.  Neither EWOULDBLOCK
(EAGAIN) or EIO is quite right.

For a quick fix, try turning off flow control (both hardware and software)
in last-close.  This should limit the wait.  Only large buffers or small
speeds take very long to drain if draining is not blocked completely by
flow control.  I use small speeds to test bugs in this area.  E.g., at 50
bps, a 4K buffer takes 800 seconds to drain; at 1 bps, it takes 40960
seconds to drain.  This shows how broken a hard-coded timeout of 1 or
even 300 seconds is.  Also, how broken an application that doesn't do
its own draining and error handling is.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20151219013916.N930>