From owner-freebsd-current@FreeBSD.ORG Sun Nov 28 14:26:02 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B468616A4CE for ; Sun, 28 Nov 2004 14:26:02 +0000 (GMT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2A32143D4C for ; Sun, 28 Nov 2004 14:26:02 +0000 (GMT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.13.1/8.13.1) with ESMTP id iASEO2dQ097725; Sun, 28 Nov 2004 09:24:02 -0500 (EST) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)iASEO2KQ097722; Sun, 28 Nov 2004 14:24:02 GMT (envelope-from robert@fledge.watson.org) Date: Sun, 28 Nov 2004 14:24:02 +0000 (GMT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Claudiu Dragalia-Paraipan In-Reply-To: <41A9A24F.7050403@gmail.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: current@freebsd.org Subject: Re: ssh & select() problem on 5.3 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 28 Nov 2004 14:26:02 -0000 On Sun, 28 Nov 2004, Claudiu Dragalia-Paraipan wrote: > Since I have upgraded to FreeBSD 5.3 I have the following problem with > SSH client: I log on several FreeBSD 5.2.1 machines, and when I start a > command that gives a 'large' result (like dmesg, cat a file), ssh client > locks. I ran ssh in gdb, and found out that it locks in select() in > libc.so.5. I do it like this: run ssh in gdb, connect to the host, run > a dmesg. After this it locks, and I have to send a SIGKILL or SIGTERM > before I can see this in gdb: > > Program received signal SIGTERM, Terminated. > 0x282b5dd7 in select () from /lib/libc.so.5 > (gdb) > > The result of a bt is (if relevant): > #0 0x282b5dd7 in select () from /lib/libc.so.5 > This happens both in SMP on UP kernels. Attached is dmesg for UP kernel. > Also, ocasionally it hangs at shutdown or reboot, at random places (?), > and it seems to be happening after I have a locked ssh client in the > system. If you need more informations about this, and you think this are > related, let me know and I will run a kernel with debugging enabled, to > get more informations. Sounds like a bug, but the interesting question is really whether it's a kernel bug or an SSH bug. I'm not up on SSH internals, but there are a few other knobs you might try and things to look at that might help address whether it's a kernel bug or not: (1) Try debug.mpsafenet=0 in loader.conf on the 5.3 box -- if we're looking at a kernel race condition due to a locking bug, that might close the race. However, it might also just changing the timing... That this happens on SMP and UP suggests that it's not so much a timing issue. (2) select() is almost always used to wait for space in a buffer to write, or wait for data in a buffer to read. Using a combination of netstat(1) and sockstat(1), it would be useful to know whether there is in fact data in either the send or receive buffer. Combined with inspecting the state of the select arguments and socket buffers in kernel, this might reveal whether perhaps there was a missed wakeup. It's worth noting that we believe we corrected a bug with exactly thes symptoms shortly before 5.3 release. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Principal Research Scientist, McAfee Research