From owner-freebsd-questions@FreeBSD.ORG  Fri May 28 18:28:30 2010
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 552421065678
	for <freebsd-questions@freebsd.org>;
	Fri, 28 May 2010 18:28:30 +0000 (UTC) (envelope-from carlj@peak.org)
Received: from redcondor1.peak.org (redcondor1.peak.org [69.59.192.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 234728FC1C
	for <freebsd-questions@freebsd.org>;
	Fri, 28 May 2010 18:28:29 +0000 (UTC)
Received: from peak-mail-gateway.peak.org ([69.59.192.41])
	by redcondor1.peak.org ({e03e86cd-14ae-47ce-9578-3c080ce9c462})
	via TCP (outbound) with ESMTP id 20100528182817348
	for <freebsd-questions@freebsd.org>; Fri, 28 May 2010 18:28:17 +0000
X-RC-FROM: <carlj@peak.org>
X-RC-RCPT: <freebsd-questions@freebsd.org>
Received: from cjlinux.localnet (207.55.91.197.peak.org [207.55.91.197] (may
	be forged))
	(authenticated bits=0)	by peak-mail-gateway.peak.org (8.12.10/8.12.8)
	with ESMTP id o4SISGLn020323
	(version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO)	for
	<freebsd-questions@freebsd.org>; 
	Fri, 28 May 2010 11:28:16 -0700 (PDT)
Received: from carlj by cjlinux.localnet with local (Exim 4.69)	(envelope-from
	<carlj@peak.org>)	id
	1OI4I3-0003mn-Rg	for freebsd-questions@freebsd.org;
	Fri, 28 May 2010 11:28:15 -0700
To: freebsd-questions@freebsd.org
References: <877hmzbi1f.fsf@cjlinux.localnet> <878w7d7qbb.fsf@kobe.laptop>
	<87pr0przea.fsf@cjlinux.localnet> <8739xlnhzz.fsf@kobe.laptop>
From: Carl Johnson <carlj@peak.org>
Date: Fri, 28 May 2010 11:28:15 -0700
In-Reply-To: <8739xlnhzz.fsf@kobe.laptop> (Giorgos Keramidas's message of
	"Fri\, 21 May 2010 22\:59\:28 +0300")
Message-ID: <87mxvjridc.fsf@cjlinux.localnet>
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Subject: Re: Bash lockups
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 May 2010 18:28:30 -0000

Giorgos Keramidas <keramida@ceid.upatras.gr> writes:

> On Fri, 21 May 2010 09:30:05 -0700, Carl Johnson <carlj@peak.org> wrote:
>> Giorgos Keramidas <keramida@ceid.upatras.gr> writes:
>>> Does this lock-up happen if you leave the shell 'idle' for too long
>>> over an ssh session?  There may be problems with stateful connection
>>> tracking between your terminal and the remote shell :-/
>>
>> No, I don't think that could be the problem.  I am just using ssh
>> between local machines and there is no firewall between them.  It also
>> often seems to happen to a shell as I switch away from it to another
>> one.  One suspicion is that something is sending a signal to the shell
>> as it switches, and bash sometimes doesn't handle that signal
>> properly.
>>
>> I also should have mentioned that I have been running bash as my
>> default shell for years under Linux and have never seen this problem
>> there.
>>
>> Thanks for the suggestion.
>
> That's ok.  If you can attach to the bash process with ktrace please try
> to grab a ktrace file from a deadlocked shell.  We may be able to see
> why it gets deadlocked by running kdump(8) on the shell trace file.
>
> You can run a second shell under ktrace (and hope that the parent
> doesn't deadlock before the traced child shell), by running:
>
>     bash$ ktrace -f bash.trace bash --login
>
> When you exit from the child shell you can dump ktrace(8) events from
> the bash.trace file with:
>
>     bash$ kdump -f bash.trace > logfile 2>&1
>
> Looking near the last records dumped in 'logfile' should be quite
> informative if the process is dead-locked or spinning around the same
> code over and over again.

I finally got one after starting ktrace a few days ago.  It is
informative, but it raises as many questions as it answers.  It
basically just wrote out the prompt, *started* to setup for reading
the input and just stopped.  I ran gdb on it and it is stuck looping
somewhere in getenv.  I don't have the system compiled with debugging,
so I have limited information on what it is doing there.  I checked
multiple times, and I also saw getenv running routines such as memset,
strlen, mbrtowc, and wcsnrtombs.

The following is the tail end of the 'kdump -Ef' output:
  67263 bash     61412.013860 GIO   fd 2 wrote 28 bytes
       0x0000 0d0f 1b5b 316d 5b63 6172 6c6a 4063 6a62 7364 3874 207e 5d24 1b5b  |...[1m[carlj@cjbsd8t ~]$.[|
       0x001a 6d20                                                              |m |

  67263 bash     61412.013867 RET   write 28/0x1c
  67263 bash     61412.013874 CALL  sigprocmask(SIG_SETMASK,0x80e133c,0)
  67263 bash     61412.013880 RET   sigprocmask 0

and the following is the similar section of a normal prompt:
  67263 bash     61403.461469 GIO   fd 2 wrote 27 bytes
       0x0000 0f1b 5b31 6d5b 6361 726c 6a40 636a 6273 6438 7420 7e5d 241b 5b6d  |..[1m[carlj@cjbsd8t ~]$.[m|
       0x001a 20                                                                | |
  67263 bash     61403.461476 RET   write 27/0x1b
  67263 bash     61403.461483 CALL  sigprocmask(SIG_SETMASK,0x80e133c,0)
  67263 bash     61403.461489 RET   sigprocmask 0
  67263 bash     61403.461497 CALL  sigprocmask(SIG_BLOCK,0,0x80e1e3c)
  67263 bash     61403.461504 RET   sigprocmask 0
  67263 bash     61403.461513 CALL  read(0,0xbfbfd95f,0x1)

I just realized there is an extra CR at the beginning of that prompt
(28 bytes instead of 27) that I don't see elsewhere, but nothing else
before that looks different.  This one is an i368 8.0 release, but I
also have another hung shell in a amd64 7.3 release system in
VirtualBox.  I just checked my other ktrace logs and I found one
other place where that extra CR occurs, but there is no lockup there
and that was my other system.

The following is a section of a backtrace from gdb: 
  #0  0x28308540 in mbrtowc () from /lib/libc.so.7
  #1  0x080c7ce6 in getenv ()
  #2  0x080c1335 in getenv ()
  #3  0x080ae1d4 in getenv ()
  #4  0x080ac4b0 in getenv ()
  #5  0x080ac815 in getenv ()
  #6  0x080c3955 in getenv ()
  #7  0x080c3ac9 in getenv ()
  #8  0x080ac4b0 in getenv ()
  #9  0x080ac815 in getenv ()
  #10 0x080acb6c in getenv ()
  #11 0x080acf55 in getenv ()
  #12 0x08054611 in ?? ()
  #13 0x284a9a80 in ?? ()
  ...
  #67 0x2832cbfd in time () from /lib/libc.so.7

The first few entries change when I let it run for a while, but the
last 8-9 getenv addresses and everything before them remain the same.
There are a total of about 65 backtrace entries this time, some of
which are 0x00000000 addresses which seem suspicious.  The backtrace
from the other hung shell is also in getenv, but I didn't have ktrace
running on that one.

I am at the limit of my experience, so does anybody else have any
ideas about what could cause this, or how I could trace it further?  I
am keeping the processes attached to gdb, so I can do further checking
on them if anyone has any other ideas.  Thanks in advance for any
help, and thanks for the help that allowed me to get this far.

-- 
Carl Johnson		carlj@peak.org