Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 9 Nov 2003 20:43:33 +1100 (EST)
From:      Bruce Evans <bde@zeta.org.au>
To:        Don Lewis <truckman@FreeBSD.org>
Cc:        current@FreeBSD.org
Subject:   Re: serial console oddity
Message-ID:  <20031109192046.L2537@gamplex.bde.org>
In-Reply-To: <200311090735.hA97ZLeF063127@gw.catspoiler.org>
References:  <200311090735.hA97ZLeF063127@gw.catspoiler.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 8 Nov 2003, Don Lewis wrote:

> I've been seeing some wierd things for many months when using a serial
> console on my -CURRENT box.  I finally had a chance to take a closer
> look today.
>
> It looks like the problem is some sort of interference between kernel
> output to the console and userland writes to /dev/console.  I typically
> see syslogd output to the console get corrupted.  Each message that
> syslogd writes seems to get truncated or otherwise corrupted.  The most
> common thing I see is that each syslog message is reduced to a space and
> the first character of the month, or sometimes just a space, or
> sometimes nothing at all.

This is (at least primarily) a longstanding bug in ttymsg().  It uses
nonblocking mode so that it doesn't block in write() or close().  For
the same reason, it doesn't wait for output to drain before close().
If the close happens to be the last one on the device, this causes any
data buffered in the tty and lower software layers to be discarded
cleanly and any data in lower hardware layers to by discarded in a
driver plus hardware-dependent way (usually not so cleanly, especially
for the character being transmitted).

> This is totally consistent until I "kill
> -HUP" syslogd, which I believe causes syslogd to close and open
> /dev/console, after which the syslog output appears correct on the
> console. When the syslogd output is being corrupted, I can cat a file to
> /dev/console and the output appears to be correct.

When I debugged this, syslogd didn't seem to keep the console open,
so the open()/close() in ttymsg() always caused the problem.  I didn't
notice killing syslogd makes a difference.  Perhaps it helps due to a
missing close.  Holding the console open may be a workaround or even
the correct fix.  It's not clear where this should be done (should all
clients of ttymsg() do it?).  Running getty on the console or on the
underlying tty device should do it accidentally.

> I truss'ed syslogd, and it appears to be working normally, the writev()
> call that writes the data to the console appears to be writing the
> correct character count, so it would appear that the fault is in the
> kernel.

If there are any kernel bugs in this area, then they would be that
last close of the console affects the underlying tty.  The multiple
console changes are quite likely to have broken this if getty is run
on the underlying tty (they silently discarded the half-close of the
underlying tty which was needed to avoided trashing some of its state
when only the console is closed).

> The problem doesn't appear to be specific to syslogd, because I have
> seen the output from the shutdown scripts that goes to the console get
> truncated as well.

Yes, in theory it should affect anything that uses ttymsg() or does
direct non-blocking writes without waiting for the output to drain.

> I have my serial console running at the default 9600 bps.

I always use 115200 bps and the symptoms are similar right down to
normally getting only the first character of the month name followed
by 0-1 bytes of garbage.  The first character of the month name is
just the first character of the message.  Apparently my systems are
fast enough for close() to be called before transmission of the second
character has completed (2 * 87+ usec at 115200 bps).

Here are some half-baked fixes.  The part that clears O_NONBLOCK is
wrong, and the usleep() part is obviously a hack.  ttymsg() shouldn't
block even in close(), since if the close is in the parent ttymsg()
might block forever and if the close() is in a forked child then
blocking could create zillions of blocked children.

Another part of the patch is concerned with limiting forked children.
If I were happy with that part then blocking would not be so bad.  In
practice, I don't have enough system activity for blocked children to
be a problem.  To see the problem with blocked children, do something
like the following:
- turn off clocal on the console so that the console can block better.
  For sio consoles this often requires turning it off in the lock-state
  device, since the driver defends against this foot shooting by locking
  it on.
- hold the console open or otherwise avoid the original bug in this
  thread, else messages will just be discarded in close() faster than
  they can pile up.
- turn off your external console device or otherwise drop carrier.
- send lots of messages.

%%%
Index: ttymsg.c
===================================================================
RCS file: /home/ncvs/src/usr.bin/wall/ttymsg.c,v
retrieving revision 1.11
diff -u -2 -r1.11 ttymsg.c
--- ttymsg.c	11 Oct 2002 14:58:34 -0000	1.11
+++ ttymsg.c	11 Oct 2002 18:13:51 -0000
@@ -32,14 +32,16 @@
  */

-#include <sys/cdefs.h>
-
-__FBSDID("$FreeBSD: src/usr.bin/wall/ttymsg.c,v 1.11 2002/10/11 14:58:34 mike Exp $");
-
+#if 0
 #ifndef lint
-static const char sccsid[] = "@(#)ttymsg.c	8.2 (Berkeley) 11/16/93";
+static char sccsid[] = "@(#)ttymsg.c	8.2 (Berkeley) 11/16/93";
+#endif /* not lint */
 #endif

+#include <sys/cdefs.h>
+__FBSDID("$FreeBSD: src/usr.bin/wall/ttymsg.c,v 1.11 2002/10/11 14:58:34 mike Exp $");
+
 #include <sys/types.h>
 #include <sys/uio.h>
+#include <sys/wait.h>
 #include <dirent.h>
 #include <errno.h>
@@ -66,8 +68,24 @@
 	struct iovec localiov[7];
 	ssize_t left, wret;
-	int cnt, fd;
+	static pid_t cpid;
+	pid_t pid;
+	int cnt, fd, forked, status;
 	static char device[MAXNAMLEN] = _PATH_DEV;
 	static char errbuf[1024];
-	int forked;
+
+	/*
+	 * Don't do anything if a child is still alive.  Interleaved
+	 * output would not be good, and filling the process table with
+	 * as many children as we can fork before the timeout kills them
+	 * would be worse.  Waiting here also reduces the need to handle
+	 * SIGCHLD in callers.
+	 */
+	if (cpid) {
+		pid = waitpid(cpid, &status, WNOHANG);
+		if (pid == cpid || (pid == -1 || errno == ECHILD))
+			cpid = 0;
+		else
+			return (NULL);
+	}

 	forked = 0;
@@ -121,6 +139,4 @@
 		}
 		if (errno == EWOULDBLOCK) {
-			int cpid;
-
 			if (forked) {
 				(void) close(fd);
@@ -161,4 +177,8 @@
 	}

+	(void) fcntl(fd, F_SETFL, 0);	/* Clear O_NONBLOCK. */
+#if 1
+	usleep(100000);
+#endif
 	(void) close(fd);
 	if (forked)
%%%

For a non-half-baked fix, do somethng like:
- never block in ttymsg(), but always wait for output to drain using
  tcdrain() in a single child process.  It's probably acceptable for
  this to not report errors to ttymsg()'s caller.
- limit children better.  I think we now fork children iff writev()
  returns EWOULDBLOCK and this happens mainly when the tty buffers
  fill up due to clocal being off and the external console not
  listening.  Handling this right seems to require handing off the
  messages to a single child process that can buffer the messages
  in userland and can block writing and draining them.  Blocked write()s
  and tcdrain()s are easy enough to handle in a specialized process by
  sending signals to abort them.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20031109192046.L2537>