From owner-freebsd-questions@FreeBSD.ORG Fri May 28 18:28:30 2010 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 552421065678 for ; Fri, 28 May 2010 18:28:30 +0000 (UTC) (envelope-from carlj@peak.org) Received: from redcondor1.peak.org (redcondor1.peak.org [69.59.192.54]) by mx1.freebsd.org (Postfix) with ESMTP id 234728FC1C for ; Fri, 28 May 2010 18:28:29 +0000 (UTC) Received: from peak-mail-gateway.peak.org ([69.59.192.41]) by redcondor1.peak.org ({e03e86cd-14ae-47ce-9578-3c080ce9c462}) via TCP (outbound) with ESMTP id 20100528182817348 for ; Fri, 28 May 2010 18:28:17 +0000 X-RC-FROM: X-RC-RCPT: Received: from cjlinux.localnet (207.55.91.197.peak.org [207.55.91.197] (may be forged)) (authenticated bits=0) by peak-mail-gateway.peak.org (8.12.10/8.12.8) with ESMTP id o4SISGLn020323 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO) for ; Fri, 28 May 2010 11:28:16 -0700 (PDT) Received: from carlj by cjlinux.localnet with local (Exim 4.69) (envelope-from ) id 1OI4I3-0003mn-Rg for freebsd-questions@freebsd.org; Fri, 28 May 2010 11:28:15 -0700 To: freebsd-questions@freebsd.org References: <877hmzbi1f.fsf@cjlinux.localnet> <878w7d7qbb.fsf@kobe.laptop> <87pr0przea.fsf@cjlinux.localnet> <8739xlnhzz.fsf@kobe.laptop> From: Carl Johnson Date: Fri, 28 May 2010 11:28:15 -0700 In-Reply-To: <8739xlnhzz.fsf@kobe.laptop> (Giorgos Keramidas's message of "Fri\, 21 May 2010 22\:59\:28 +0300") Message-ID: <87mxvjridc.fsf@cjlinux.localnet> User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: Bash lockups X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 May 2010 18:28:30 -0000 Giorgos Keramidas writes: > On Fri, 21 May 2010 09:30:05 -0700, Carl Johnson wrote: >> Giorgos Keramidas writes: >>> Does this lock-up happen if you leave the shell 'idle' for too long >>> over an ssh session? There may be problems with stateful connection >>> tracking between your terminal and the remote shell :-/ >> >> No, I don't think that could be the problem. I am just using ssh >> between local machines and there is no firewall between them. It also >> often seems to happen to a shell as I switch away from it to another >> one. One suspicion is that something is sending a signal to the shell >> as it switches, and bash sometimes doesn't handle that signal >> properly. >> >> I also should have mentioned that I have been running bash as my >> default shell for years under Linux and have never seen this problem >> there. >> >> Thanks for the suggestion. > > That's ok. If you can attach to the bash process with ktrace please try > to grab a ktrace file from a deadlocked shell. We may be able to see > why it gets deadlocked by running kdump(8) on the shell trace file. > > You can run a second shell under ktrace (and hope that the parent > doesn't deadlock before the traced child shell), by running: > > bash$ ktrace -f bash.trace bash --login > > When you exit from the child shell you can dump ktrace(8) events from > the bash.trace file with: > > bash$ kdump -f bash.trace > logfile 2>&1 > > Looking near the last records dumped in 'logfile' should be quite > informative if the process is dead-locked or spinning around the same > code over and over again. I finally got one after starting ktrace a few days ago. It is informative, but it raises as many questions as it answers. It basically just wrote out the prompt, *started* to setup for reading the input and just stopped. I ran gdb on it and it is stuck looping somewhere in getenv. I don't have the system compiled with debugging, so I have limited information on what it is doing there. I checked multiple times, and I also saw getenv running routines such as memset, strlen, mbrtowc, and wcsnrtombs. The following is the tail end of the 'kdump -Ef' output: 67263 bash 61412.013860 GIO fd 2 wrote 28 bytes 0x0000 0d0f 1b5b 316d 5b63 6172 6c6a 4063 6a62 7364 3874 207e 5d24 1b5b |...[1m[carlj@cjbsd8t ~]$.[| 0x001a 6d20 |m | 67263 bash 61412.013867 RET write 28/0x1c 67263 bash 61412.013874 CALL sigprocmask(SIG_SETMASK,0x80e133c,0) 67263 bash 61412.013880 RET sigprocmask 0 and the following is the similar section of a normal prompt: 67263 bash 61403.461469 GIO fd 2 wrote 27 bytes 0x0000 0f1b 5b31 6d5b 6361 726c 6a40 636a 6273 6438 7420 7e5d 241b 5b6d |..[1m[carlj@cjbsd8t ~]$.[m| 0x001a 20 | | 67263 bash 61403.461476 RET write 27/0x1b 67263 bash 61403.461483 CALL sigprocmask(SIG_SETMASK,0x80e133c,0) 67263 bash 61403.461489 RET sigprocmask 0 67263 bash 61403.461497 CALL sigprocmask(SIG_BLOCK,0,0x80e1e3c) 67263 bash 61403.461504 RET sigprocmask 0 67263 bash 61403.461513 CALL read(0,0xbfbfd95f,0x1) I just realized there is an extra CR at the beginning of that prompt (28 bytes instead of 27) that I don't see elsewhere, but nothing else before that looks different. This one is an i368 8.0 release, but I also have another hung shell in a amd64 7.3 release system in VirtualBox. I just checked my other ktrace logs and I found one other place where that extra CR occurs, but there is no lockup there and that was my other system. The following is a section of a backtrace from gdb: #0 0x28308540 in mbrtowc () from /lib/libc.so.7 #1 0x080c7ce6 in getenv () #2 0x080c1335 in getenv () #3 0x080ae1d4 in getenv () #4 0x080ac4b0 in getenv () #5 0x080ac815 in getenv () #6 0x080c3955 in getenv () #7 0x080c3ac9 in getenv () #8 0x080ac4b0 in getenv () #9 0x080ac815 in getenv () #10 0x080acb6c in getenv () #11 0x080acf55 in getenv () #12 0x08054611 in ?? () #13 0x284a9a80 in ?? () ... #67 0x2832cbfd in time () from /lib/libc.so.7 The first few entries change when I let it run for a while, but the last 8-9 getenv addresses and everything before them remain the same. There are a total of about 65 backtrace entries this time, some of which are 0x00000000 addresses which seem suspicious. The backtrace from the other hung shell is also in getenv, but I didn't have ktrace running on that one. I am at the limit of my experience, so does anybody else have any ideas about what could cause this, or how I could trace it further? I am keeping the processes attached to gdb, so I can do further checking on them if anyone has any other ideas. Thanks in advance for any help, and thanks for the help that allowed me to get this far. -- Carl Johnson carlj@peak.org