Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 17 Apr 2012 18:25:03 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        freebsd-stable@freebsd.org
Subject:   Re: top not restoring terminal echo/icanon correctly
Message-ID:  <20120418012503.GA25161@icarus.home.lan>
In-Reply-To: <20120417200252.GA60718@icarus.home.lan>
References:  <20120417182242.GA58449@icarus.home.lan> <20120417195115.GA60447@icarus.home.lan> <20120417200252.GA60718@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Apr 17, 2012 at 01:02:52PM -0700, Jeremy Chadwick wrote:
> On Tue, Apr 17, 2012 at 12:51:15PM -0700, Jeremy Chadwick wrote:
> > On Tue, Apr 17, 2012 at 11:22:42AM -0700, Jeremy Chadwick wrote:
> > > (Please keep me CC'd as I'm not subscribed to the list)
> > > 
> > > I'd like to request that folks running RELENG_8 (and RELENG_9, though I
> > > do not use it) please check the behaviour of their terminal after each
> > > of following commands are run (check terminal after each command):
> > > 
> > > top -a  (press "q" after 1 screen refresh)
> > > top -b
> > > 
> > > If you find that your input characters in your shell aren't being echo'd
> > > back after one of the above commands, blindly type "stty icanon echo"
> > > and hit <Enter> and things should be back to normal.
> > > 
> > > What I'm looking for is confirmation from others of the problem.
> > > 
> > > Also very important: please provide uname -a output, specifically world
> > > rebuild date.  It greatly matters, because a commit was recently done
> > > where now -b functions fine (was previously busted in this way), but now
> > > -a behaves like -b did.  So src/world date matters.
> > > 
> > > All of this is documented in PR 161739.  I urge anyone experiencing this
> > > problem to read that PR in full, as I spent many hours today writing a
> > > debug routine to confirm that top is sometimes not calling tcsetattr()
> > > with the original terminal parameters when it exits, and what the
> > > condition seems to be.
> > > 
> > > http://www.freebsd.org/cgi/query-pr.cgi?pr=bin/161739
> > > 
> > > Finally, if anyone want to tackle the problem (work out the logic bug
> > > that is in there which causes it), please be my guest.  I have other
> > > things going on right now (doctors appointments) so I don't have as much
> > > time as I'd like.
> > > 
> > > Thanks everyone.
> > 
> > Thanks to all those who have responded, including kib@.
> > 
> > The problem is very odd and appears specific to the bash shell, but
> > only "somewhat".  Below is a part of what I sent kib@ on the matter.
> > A chart showing what I've found:
> > 
> > Location      Username   TERM     Shell                 bug?
> > =============================================================
> > VGA console   root       cons25   /bin/csh              no
> > VGA console   jdc        cons25   /usr/local/bin/bash   no
> > SSH (PuTTY)   root       xterm    /bin/csh              no
> > SSH (PuTTY)   jdc        xterm    /bin/csh              no
> > SSH (PuTTY)   jdc        xterm    /usr/local/bin/bash   yes
> > =============================================================
> > 
> > In the last case (and only that case): if I move my dotfiles (.bashrc
> > and .bash_profile) aside and log in (SSH), top behaves normally.
> > Naturally this made me think "something is wonky with my dotfiles!"...
> > 
> > But the problem *doesn't* happen if my dotfiles are intact and I
> > log in via VGA console (cons25) as myself with a bash shell.  So it's
> > almost like there's some bizarre combination of things that causes this
> > problem.
> > 
> > I'll continue to try and narrow it down.
> 
> Boy this is a weird one.
> 
> In my .bashrc I've been using the following statement to show all
> arguments and set the update interval to 1 second:
> 
> export TOP="-a -s 1"
> 
> Removing this completely and logging back in results in no problems
> when running "top" in any way (top -a, top -s 1, or top -a -s 1).
> 
> However, the problem I describe with icanon/echo never getting restored
> rears its ugly head when running "top -a -b", but only under bash
> (with no dotfiles too).  It never happens under csh.
> 
> I can also reproduce this on VGA console when logged in (regardless of
> user), e.g. log in as root (get csh shell), run bash, "top -a -b", bam,
> issue happens.
> 
> I don't know what's different about the two shells at this level that
> would cause oddities like this, but um, yeah.....  *blink*

After further conversations with kib@, and my own prodding at the
code at his request, I did what's usually required these days: down the
rabbit hole.

First, it actually has nothing to do with the shell.  It's interesting
that I can reproduce it but only with bash and when I have specific
environment settings or flags used when passed to top, and not with
csh/zsh.  Maybe it's luck of the draw.  Maybe it has to do with my
terminal size (132x43).  Maybe it's something else.  Not sure.

Anyway, kib@ pointed me to the fact that top(1) corrupts memory, meaning
it overwrites/destroys its own variables in some cases.

When top is run, it does a series of variables to determine the environment
its being run from.  It makes use of termios(4), and uses tcgetattr(3)
to save terminal device settings (stuff you'd see via stty -a) before
modifying them -- specifically disabling the ICANON and ECHO
capabilities.

Normally top restores these on exit.  However, as my PR and Email
suggests, there are times where it doesn't.  How top decides whether
or not to do this is via a simple if statement based on a global
variable called is_a_terminal.  If this is true/1, then it issues
tcsetattr(3) with the original terminal device settings it fetched
from when top was run initially.  If false, it doesn't.

kib@ recommended that I set a hardware watchpoint in gdb for
is_a_terminal and see if it was changing while top ran.  Sure enough
that's the case.  (I had to use a software watchpoint since a hardware
one wasn't permitted by gdb, but that's not important; software ones
just result in the program running significantly slower).

Here's the results, which confirms his claims.

(gdb) watch is_a_terminal
Watchpoint 1: is_a_terminal
(gdb) x/x &is_a_terminal
0x511de0 <is_a_terminal>:       0x00000000

is_a_terminal defaults to 0.  It's also very important to note that the
variable is at address 0x511de0.

(gdb) p &fmt
$1 = (char (*)[128]) 0x511d60
(gdb) p fmt
$2 = '\0' <repeats 127 times>

The above will make more sense in a moment...

(gdb) run -a -b
Starting program: /home/jdc/usr.bin/top/top -a -b
Watchpoint 1: is_a_terminal
Watchpoint 1: is_a_terminal
Watchpoint 1: is_a_terminal
Watchpoint 1: is_a_terminal
Watchpoint 1: is_a_terminal
Watchpoint 1: is_a_terminal

Old value = 0 '\0'
New value = 1 '\001'
init_screen () at /home/jdc/usr.bin/top/../../contrib/top/screen.c:317
317             putcap(terminal_init);

Here init_screen() has set is_a_terminal to 1.  Good.  Now we
continue...

(gdb) c
Continuing.
Watchpoint 1: is_a_terminal

Old value = 1 '\001'
New value = 0 '\0'
0x00000000a0aa1448 in vsprintf () from /lib/libc.so.7
(gdb) x/x &is_a_terminal
0x511de0 <is_a_terminal>:       0x00000000

Eh?  Something just changed its value to zero?  vsprintf?  Stack trace
time.

(gdb) bt
#0  0x00000000a0aa1448 in vsprintf () from /lib/libc.so.7
#1  0x00000000a0aa04c8 in sprintf () from /lib/libc.so.7
#2  0x0000000000407170 in format_next_process (handle=0x511d50 "\230\003са", get_userid=0x40c100 <username>, flags=1)
    at machine.c:963
#3  0x000000000040af1c in main (argc=3, argv=0x7fffffffea18) at /home/jdc/usr.bin/top/../../contrib/top/top.c:678

Frame 2 looks suspicious...

(gdb) f 2
#2  0x0000000000407170 in format_next_process (handle=0x511d50 "\230\003са", get_userid=0x40c100 <username>, flags=1)
    at machine.c:963
963             sprintf(fmt, proc_fmt,

Uh oh, sprintf(), not even snprintf().  Living dangerously...  Let's see
how dangerous.

(gdb) p fmt
$3 = " 6482 halbot      1  44    0 90940K 75912K select  0   0:22  0.00% /usr/local/bin/perl /home/halbot/hal/halbot.pl /home/halbot/h"
(gdb) x/s fmt
0x511d60 <fmt>:  " 6482 halbot      1  44    0 90940K 75912K select  0   0:22  0.00% /usr/local/bin/perl /home/halbot/hal/halbot.pl /home/halbot/h"
(gdb) p strlen(fmt)
$4 = 128

(gdb) x/s proc_fmt
0x40d828 <_fini+2980>:   "%5d%s %-*.*s %s%3d %4s%7s %6s %-6.6s %2d%7s %5.2f%% %.*s"

The fmt variable is actually a "static char fmt[128]".  So what
comes after that?

(gdb) x/x 0x511d60+128
0x511de0 <is_a_terminal>:       0x0000000000000000

Ah ha... yep: 0x511de0 - 0x511d60 = 128.

Thus as shown, the ECHO and ICANON settings never get restored to the
terminal device because the sprintf() resulted in writing past the
end of the buffer.

We can easily solve this by making use of snprintf() instead, with
a sizeof(fmt), limiting the total string length printed to 128.

And I have confirmed that indeed solves the problem.

There are lots and lots of warnings in this code too, due to lack of
proper #include statements and some other minor annoyances.  I could
clean these up as well.

Thoughts/comments?  Should I close my existing PR and open a new one?

-- 
| Jeremy Chadwick                                 jdc@parodius.com |
| Parodius Networking                     http://www.parodius.com/ |
| UNIX Systems Administrator                 Mountain View, CA, US |
| Making life hard for others since 1977.             PGP 4BD6C0CB |



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120418012503.GA25161>