From owner-freebsd-bugs@FreeBSD.ORG Thu Feb 24 18:11:39 2005 Return-Path: Delivered-To: freebsd-bugs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E89F416A4CE for ; Thu, 24 Feb 2005 18:11:39 +0000 (GMT) Received: from smtp.dental-on-line.fr (ras75-5-82-234-245-176.fbx.proxad.net [82.234.245.176]) by mx1.FreeBSD.org (Postfix) with ESMTP id A807843D53 for ; Thu, 24 Feb 2005 18:11:38 +0000 (GMT) (envelope-from celkaim@dental-on-line.fr) Received: from [192.168.1.6] (mercure.dental-on-line.net [192.168.1.6]) by smtp.dental-on-line.fr (Postfix) with ESMTP id D9C1F6021B for ; Thu, 24 Feb 2005 19:11:34 +0100 (CET) Message-ID: <421E185A.8040902@dental-on-line.fr> Date: Thu, 24 Feb 2005 19:09:30 +0100 From: Cyril Elkaim User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041217 X-Accept-Language: en-us, en MIME-Version: 1.0 To: freebsd-bugs@freebsd.org References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: ssh sbwait indfinite timeout on SMP machines X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Feb 2005 18:11:40 -0000 Hi Robert, Thanks for your answer. Unhopefully I can't restart the machines or touch the kernel, these are production machines and under heavy use. And I don't have another SMP machine available right now. The only thing I can assert for the moment is that if you abort an authentication sequence it leaves ssh in CLOSE_WAIT state indefinitely and it always reproductible. The sshd_config file is the default one and the machines have no special configuration in the sysctl.conf for example. Doing the same thing on an uniprocessor do not trigger the problem. Here a top before an ssh connection: last pid: 80505; load averages: 0.06, 0.02, 0.00 up 49+04:25:13 19:00:58 57 processes: 1 running, 56 sleeping CPU states: % user, % nice, % system, % interrupt, % idle Mem: 69M Active, 1606M Inact, 239M Wired, 84M Cache, 112M Buf, 4816K Free Swap: 4071M Total, 340K Used, 4070M Free PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND 76291 admin 96 0 2684K 1908K select 1 1:22 0.00% 0.00% rsync 69111 root 8 0 7796K 4560K nanslp 2 0:47 0.00% 0.00% httpd 971 root 96 0 3460K 2484K select 0 0:44 0.00% 0.00% sendmail 303 root 96 0 1316K 788K select 1 0:38 0.00% 0.00% syslogd 446 root 8 0 1356K 900K nanslp 1 0:32 0.00% 0.00% cron 473 pgsql 96 0 15728K 2084K select 1 0:17 0.00% 0.00% postgres 80080 admin 96 0 2164K 1464K select 1 0:07 0.00% 0.00% rsync 29845 ldap 20 0 40008K 16824K kserel 3 0:03 0.00% 0.00% slapd 39987 bind 20 0 5692K 3980K kserel 3 0:02 0.00% 0.00% named 38045 smmsp 20 0 3324K 2032K pause 0 0:02 0.00% 0.00% sendmail 975 smmsp 20 0 3340K 2292K pause 3 0:01 0.00% 0.00% sendmail 80072 admin 96 0 6092K 2688K select 0 0:01 0.00% 0.00% sshd 69112 www 96 0 7736K 4480K select 3 0:00 0.00% 0.00% httpd 76290 admin 96 0 2408K 1640K select 1 0:00 0.00% 0.00% rsync 77762 root 96 0 3360K 2340K select 1 0:00 0.00% 0.00% sshd 72572 admin 8 0 7844K 4900K wait 1 0:00 0.00% 0.00% python 77672 root 96 0 3360K 2340K select 1 0:00 0.00% 0.00% sshd 80077 admin 8 0 7844K 5008K wait 2 0:00 0.00% 0.00% python 80459 admin 96 0 6088K 2692K select 1 0:00 0.00% 0.00% sshd 80079 admin 96 0 1808K 1124K select 3 0:00 0.00% 0.00% rsync 58472 root 96 0 1424K 876K select 0 0:00 0.00% 0.00% rsync 80073 admin 8 0 2256K 1776K wait 2 0:00 0.00% 0.00% bash 80457 root 4 0 6080K 2624K sbwait 2 0:00 0.00% 0.00% sshd 80069 root 4 0 6084K 2628K sbwait 3 0:00 0.00% 0.00% sshd 72584 pgsql 4 0 16376K 5416K sbwait 0 0:00 0.00% 0.00% postgres 80460 admin 8 0 2252K 1772K wait 2 0:00 0.00% 0.00% bash 76289 admin -8 0 3356K 2408K piperd 1 0:00 0.00% 0.00% sendmail 72567 admin 8 0 1636K 1008K wait 0 0:00 0.00% 0.00% sh 558 root 5 0 1280K 792K ttyin 1 0:00 0.00% 0.00% getty 556 root 5 0 1280K 792K ttyin 0 0:00 0.00% 0.00% getty 559 root 5 0 1280K 792K ttyin 2 0:00 0.00% 0.00% getty 557 root 5 0 1280K 792K ttyin 2 0:00 0.00% 0.00% getty 561 root 5 0 1280K 792K ttyin 2 0:00 0.00% 0.00% getty 563 root 5 0 1280K 792K ttyin 1 0:00 0.00% 0.00% getty 560 root 5 0 1280K 792K ttyin 3 0:00 0.00% 0.00% getty 562 root 5 0 1280K 792K ttyin 0 0:00 0.00% 0.00% getty 80078 pgsql 4 0 16044K 3436K sbwait 1 0:00 0.00% 0.00% postgres 80399 www 4 0 7796K 4708K accept 0 0:00 0.00% 0.00% httpd 80364 www 4 0 7796K 4708K accept 2 0:00 0.00% 0.00% httpd 80425 www 4 0 7796K 4708K accept 2 0:00 0.00% 0.00% httpd 80388 www 4 0 7796K 4708K accept 1 0:00 0.00% 0.00% httpd 80353 www 4 0 7796K 4708K accept 3 0:00 0.00% 0.00% httpd 72562 root -8 0 1400K 1080K piperd 1 0:00 0.00% 0.00% cron 474 pgsql 110 0 7476K 1924K select 3 0:00 0.00% 0.00% postgres 475 pgsql 110 0 6488K 1928K select 2 0:00 0.00% 0.00% postgres 80505 admin 96 0 2288K 1404K CPU0 0 0:00 0.00% 0.00% top 279 root 110 0 516K 224K select 2 0:00 0.00% 0.00% devd --------------------------------------------------------------------------------------------------------------------- _after_ the aborted ssh connection, look a at the end of the top the aborted ssh is the one with the id 80508 in sbwait state. We had such processes waiting since two months before killing them manually. last pid: 80509; load averages: 0.00, 0.00, 0.00 up 49+04:28:53 19:04:38 58 processes: 1 running, 57 sleeping CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Mem: 70M Active, 1612M Inact, 239M Wired, 78M Cache, 112M Buf, 3024K Free Swap: 4071M Total, 340K Used, 4070M Free PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND 76291 admin 96 0 2684K 1908K select 1 1:22 0.00% 0.00% rsync 69111 root 8 0 7796K 4560K nanslp 0 0:47 0.00% 0.00% httpd 971 root 96 0 3460K 2484K select 2 0:44 0.00% 0.00% sendmail 303 root 96 0 1316K 788K select 2 0:38 0.00% 0.00% syslogd 446 root 8 0 1356K 900K nanslp 3 0:32 0.00% 0.00% cron 473 pgsql 96 0 15728K 2084K select 3 0:17 0.00% 0.00% postgres 80080 admin 96 0 2164K 1464K select 1 0:07 0.00% 0.00% rsync 29845 ldap 20 0 40008K 16824K kserel 0 0:03 0.00% 0.00% slapd 39987 bind 20 0 5692K 3980K kserel 1 0:02 0.00% 0.00% named 38045 smmsp 20 0 3324K 2032K pause 0 0:02 0.00% 0.00% sendmail 975 smmsp 20 0 3340K 2292K pause 3 0:01 0.00% 0.00% sendmail 80072 admin 96 0 6092K 2688K select 0 0:01 0.00% 0.00% sshd 69112 www 96 0 7736K 4480K select 3 0:00 0.00% 0.00% httpd 76290 admin 96 0 2408K 1640K select 0 0:00 0.00% 0.00% rsync 77762 root 96 0 3360K 2340K select 1 0:00 0.00% 0.00% sshd 72572 admin 8 0 7844K 4900K wait 1 0:00 0.00% 0.00% python 77672 root 96 0 3360K 2340K select 1 0:00 0.00% 0.00% sshd 80077 admin 8 0 7844K 5008K wait 2 0:00 0.00% 0.00% python 80459 admin 96 0 6088K 2692K select 3 0:00 0.00% 0.00% sshd 80079 admin 96 0 1808K 1124K select 0 0:00 0.00% 0.00% rsync 80505 admin 96 0 2440K 1584K CPU0 1 0:00 0.00% 0.00% top 58472 root 96 0 1424K 876K select 0 0:00 0.00% 0.00% rsync 80073 admin 8 0 2256K 1776K wait 2 0:00 0.00% 0.00% bash 80457 root 4 0 6080K 2624K sbwait 2 0:00 0.00% 0.00% sshd 80069 root 4 0 6084K 2628K sbwait 3 0:00 0.00% 0.00% sshd 72584 pgsql 4 0 16376K 5416K sbwait 0 0:00 0.00% 0.00% postgres 80460 admin 8 0 2252K 1772K wait 2 0:00 0.00% 0.00% bash 76289 admin -8 0 3356K 2408K piperd 1 0:00 0.00% 0.00% sendmail 72567 admin 8 0 1636K 1008K wait 0 0:00 0.00% 0.00% sh 558 root 5 0 1280K 792K ttyin 1 0:00 0.00% 0.00% getty 556 root 5 0 1280K 792K ttyin 0 0:00 0.00% 0.00% getty 559 root 5 0 1280K 792K ttyin 2 0:00 0.00% 0.00% getty 557 root 5 0 1280K 792K ttyin 2 0:00 0.00% 0.00% getty 561 root 5 0 1280K 792K ttyin 2 0:00 0.00% 0.00% getty 563 root 5 0 1280K 792K ttyin 1 0:00 0.00% 0.00% getty 560 root 5 0 1280K 792K ttyin 3 0:00 0.00% 0.00% getty 562 root 5 0 1280K 792K ttyin 0 0:00 0.00% 0.00% getty 80078 pgsql 4 0 16044K 3436K sbwait 1 0:00 0.00% 0.00% postgres 80399 www 4 0 7796K 4708K accept 0 0:00 0.00% 0.00% httpd 80364 www 4 0 7796K 4708K accept 2 0:00 0.00% 0.00% httpd 80425 www 4 0 7796K 4708K accept 2 0:00 0.00% 0.00% httpd 80388 www 4 0 7796K 4708K accept 1 0:00 0.00% 0.00% httpd 80353 www 4 0 7796K 4708K accept 3 0:00 0.00% 0.00% httpd 72562 root -8 0 1400K 1080K piperd 1 0:00 0.00% 0.00% cron 474 pgsql 110 0 7476K 1924K select 3 0:00 0.00% 0.00% postgres 80508 root 4 0 4852K 2592K sbwait 0 0:00 0.00% 0.00% sshd 475 pgsql 110 0 6488K 1928K select 2 0:00 0.00% 0.00% postgres 279 root 110 0 516K 224K select 2 0:00 0.00% 0.00% devd and here a dump using sockstat _after_ the aborted connection: USER COMMAND PID FD PROTO LOCAL ADDRESS FOREIGN ADDRESS root sshd 80508 3 stream -> ?? root sshd 80508 4 stream (not connected) root sshd 80508 5 tcp4 192.168.254.2:22 192.168.254.1:49023 root sshd 80508 6 stream -> ?? Hope that helps, Cyril Elkaim Robert Watson wrote: >On Thu, 24 Feb 2005, Dental-on-line wrote: > > > >>We are using two SMP machines (Dell 1750) with 5.3_p2 RELENG and have >>timeout problems with SSH. >> >>This problem happens ONLY on SMP machines not uniprocessor ones. >> >> > >Could you try the following things: > >- Boot the box with the debug.mpsafenet=0 tunable set in loader.conf and > see if the problem recurs? > >- Update to 5-STABLE and try with and without debug.mpsafenet=0? A number > of TCP locking changes have been made there, and one may help. > >Does it happen for every SSH session, or only some? > >Thanks, > >Robert N M Watson > > >