Date: Sun, 9 Nov 2003 20:43:33 +1100 (EST) From: Bruce Evans <bde@zeta.org.au> To: Don Lewis <truckman@FreeBSD.org> Cc: current@FreeBSD.org Subject: Re: serial console oddity Message-ID: <20031109192046.L2537@gamplex.bde.org> In-Reply-To: <200311090735.hA97ZLeF063127@gw.catspoiler.org> References: <200311090735.hA97ZLeF063127@gw.catspoiler.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 8 Nov 2003, Don Lewis wrote: > I've been seeing some wierd things for many months when using a serial > console on my -CURRENT box. I finally had a chance to take a closer > look today. > > It looks like the problem is some sort of interference between kernel > output to the console and userland writes to /dev/console. I typically > see syslogd output to the console get corrupted. Each message that > syslogd writes seems to get truncated or otherwise corrupted. The most > common thing I see is that each syslog message is reduced to a space and > the first character of the month, or sometimes just a space, or > sometimes nothing at all. This is (at least primarily) a longstanding bug in ttymsg(). It uses nonblocking mode so that it doesn't block in write() or close(). For the same reason, it doesn't wait for output to drain before close(). If the close happens to be the last one on the device, this causes any data buffered in the tty and lower software layers to be discarded cleanly and any data in lower hardware layers to by discarded in a driver plus hardware-dependent way (usually not so cleanly, especially for the character being transmitted). > This is totally consistent until I "kill > -HUP" syslogd, which I believe causes syslogd to close and open > /dev/console, after which the syslog output appears correct on the > console. When the syslogd output is being corrupted, I can cat a file to > /dev/console and the output appears to be correct. When I debugged this, syslogd didn't seem to keep the console open, so the open()/close() in ttymsg() always caused the problem. I didn't notice killing syslogd makes a difference. Perhaps it helps due to a missing close. Holding the console open may be a workaround or even the correct fix. It's not clear where this should be done (should all clients of ttymsg() do it?). Running getty on the console or on the underlying tty device should do it accidentally. > I truss'ed syslogd, and it appears to be working normally, the writev() > call that writes the data to the console appears to be writing the > correct character count, so it would appear that the fault is in the > kernel. If there are any kernel bugs in this area, then they would be that last close of the console affects the underlying tty. The multiple console changes are quite likely to have broken this if getty is run on the underlying tty (they silently discarded the half-close of the underlying tty which was needed to avoided trashing some of its state when only the console is closed). > The problem doesn't appear to be specific to syslogd, because I have > seen the output from the shutdown scripts that goes to the console get > truncated as well. Yes, in theory it should affect anything that uses ttymsg() or does direct non-blocking writes without waiting for the output to drain. > I have my serial console running at the default 9600 bps. I always use 115200 bps and the symptoms are similar right down to normally getting only the first character of the month name followed by 0-1 bytes of garbage. The first character of the month name is just the first character of the message. Apparently my systems are fast enough for close() to be called before transmission of the second character has completed (2 * 87+ usec at 115200 bps). Here are some half-baked fixes. The part that clears O_NONBLOCK is wrong, and the usleep() part is obviously a hack. ttymsg() shouldn't block even in close(), since if the close is in the parent ttymsg() might block forever and if the close() is in a forked child then blocking could create zillions of blocked children. Another part of the patch is concerned with limiting forked children. If I were happy with that part then blocking would not be so bad. In practice, I don't have enough system activity for blocked children to be a problem. To see the problem with blocked children, do something like the following: - turn off clocal on the console so that the console can block better. For sio consoles this often requires turning it off in the lock-state device, since the driver defends against this foot shooting by locking it on. - hold the console open or otherwise avoid the original bug in this thread, else messages will just be discarded in close() faster than they can pile up. - turn off your external console device or otherwise drop carrier. - send lots of messages. %%% Index: ttymsg.c =================================================================== RCS file: /home/ncvs/src/usr.bin/wall/ttymsg.c,v retrieving revision 1.11 diff -u -2 -r1.11 ttymsg.c --- ttymsg.c 11 Oct 2002 14:58:34 -0000 1.11 +++ ttymsg.c 11 Oct 2002 18:13:51 -0000 @@ -32,14 +32,16 @@ */ -#include <sys/cdefs.h> - -__FBSDID("$FreeBSD: src/usr.bin/wall/ttymsg.c,v 1.11 2002/10/11 14:58:34 mike Exp $"); - +#if 0 #ifndef lint -static const char sccsid[] = "@(#)ttymsg.c 8.2 (Berkeley) 11/16/93"; +static char sccsid[] = "@(#)ttymsg.c 8.2 (Berkeley) 11/16/93"; +#endif /* not lint */ #endif +#include <sys/cdefs.h> +__FBSDID("$FreeBSD: src/usr.bin/wall/ttymsg.c,v 1.11 2002/10/11 14:58:34 mike Exp $"); + #include <sys/types.h> #include <sys/uio.h> +#include <sys/wait.h> #include <dirent.h> #include <errno.h> @@ -66,8 +68,24 @@ struct iovec localiov[7]; ssize_t left, wret; - int cnt, fd; + static pid_t cpid; + pid_t pid; + int cnt, fd, forked, status; static char device[MAXNAMLEN] = _PATH_DEV; static char errbuf[1024]; - int forked; + + /* + * Don't do anything if a child is still alive. Interleaved + * output would not be good, and filling the process table with + * as many children as we can fork before the timeout kills them + * would be worse. Waiting here also reduces the need to handle + * SIGCHLD in callers. + */ + if (cpid) { + pid = waitpid(cpid, &status, WNOHANG); + if (pid == cpid || (pid == -1 || errno == ECHILD)) + cpid = 0; + else + return (NULL); + } forked = 0; @@ -121,6 +139,4 @@ } if (errno == EWOULDBLOCK) { - int cpid; - if (forked) { (void) close(fd); @@ -161,4 +177,8 @@ } + (void) fcntl(fd, F_SETFL, 0); /* Clear O_NONBLOCK. */ +#if 1 + usleep(100000); +#endif (void) close(fd); if (forked) %%% For a non-half-baked fix, do somethng like: - never block in ttymsg(), but always wait for output to drain using tcdrain() in a single child process. It's probably acceptable for this to not report errors to ttymsg()'s caller. - limit children better. I think we now fork children iff writev() returns EWOULDBLOCK and this happens mainly when the tty buffers fill up due to clocal being off and the external console not listening. Handling this right seems to require handing off the messages to a single child process that can buffer the messages in userland and can block writing and draining them. Blocked write()s and tcdrain()s are easy enough to handle in a specialized process by sending signals to abort them. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20031109192046.L2537>