Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 Mar 2014 16:57:36 -0300
From:      Marcelo Gondim <gondim@bsdinfo.com.br>
To:        freebsd-net@freebsd.org
Subject:   Re: sshd with zombie process on FreeBSD 10.0-STABLE
Message-ID:  <5335D430.8010108@bsdinfo.com.br>
In-Reply-To: <5335B8F7.7050001@bsdinfo.com.br>
References:  <5335B8F7.7050001@bsdinfo.com.br>

next in thread | previous in thread | raw e-mail | index | archive | help
Em 28/03/14 15:01, Marcelo Gondim escreveu:
> Hi all,
>
> I have noticed zombie processes on the system after a few lost 
> connections on ssh.
>
> # ps afx
> [...]
>  8045  -  Is       0:00.01 sshd: unknown [priv] (sshd)
>  8046  -  Z        0:00.01 <defunct>
>  8054  -  IW       0:00.00 sshd: unknown [pam] (sshd)
> 28146  -  Is       0:00.01 sshd: unknown [priv] (sshd)
> 28147  -  Z        0:00.01 <defunct>
> 28155  -  IW       0:00.00 sshd: unknown [pam] (sshd)
> 43320  -  Is       0:00.01 sshd: unknown [priv] (sshd)
> 43321  -  Z        0:00.01 <defunct>
> 43322  -  IW       0:00.00 sshd: unknown [pam] (sshd)
> 73413  -  Is       0:00.01 sshd: unknown [priv] (sshd)
> 73414  -  Z        0:00.01 <defunct>
> 73430  -  IW       0:00.00 sshd: unknown [pam] (sshd)
> [...]
>
> More information discussed in freebsd-stable@list:
>
>
> On Tue, Mar 25, 2014 at 3:52 AM, Karl Pielorz <kpielorz_lst@tdx.co.uk 
> <mailto:kpielorz_lst@tdx.co.uk>> wrote:
>
>
>
>    --On 21 March 2014 22:02 -0700 Kevin Oberman <rkoberman@gmail.com
>    <mailto:rkoberman@gmail.com>> wrote:
>
>        Ideally I'd look to try and capture the packets st the end of
>        the session.
>        Can you do something to trigger this reliably? if so "standard"
>        "tcpdump
>        -pw file.bpf host HOST". I seem to recall that these 
> connections are
>        scheduled. If so, you can put the packet capture in a crontab to
>        run at
>        the same time. If you feed this to a tool like wireshark, you
>        should get
>        a good idea of what is happening, if not why. I understand that 
> the
>        timing of this might be very tricky.
>
>
>    Ok, fwiw as I have this issue as well - I've done a packet capture
>    (it's below). This box has 59 'CLOSED' sockets from ssh on it, and
>    60 sshd stuck.
>
>    Also - initially I thought this was a Xen issue - so there's a
>    couple of posts on that list from a couple of weeks ago, in brief -
>    the sshd processes I have are stuck in 'urdlck' - one of the Xen
>    guys commented "It seems like the process is stuck while trying to
>    acquire a rw mutex in read mode."
>
>    I did a backtrace of a stuck process - I can post that if you want
>    (or check the FreeBSD-Xen list for 'stuck sshd in urdlck'.
>
>    Also, if I ssh into this host, 90% of the time (seems to get worse
>    the longer the box is up) I get:
>
>    "
>    ssh_exchange_identification: Connection closed by remote host
>    "
>
>    That does not leave a lingering CLOSED socket. In fact, a successful
>    ssh login - and logout, does not in testing appear to leave a
>    lingering CLOSED socket, nor sshd stuck in urdlck - so I'm not
>    entirely sure where they're coming from, or how often they are 
> created.
>
>    tcpdump from the start of a successful ssh connect:
>
>    10:47:04.777765 IP (tos 0x0, ttl 61, id 4058, offset 0, flags [DF],
>    proto TCP (6), length 60)
>        192.168.0.37.31139 > 192.168.0.138.22: Flags [S], cksum 0x57ce
>    (correct), seq 634709832, win 65535, options [mss 1368,nop,wscale
>    3,sackOK,TS val 1019391535 ecr 0], length 0
>
>    10:47:04.777776 IP (tos 0x0, ttl 64, id 10060, offset 0, flags [DF],
>    proto TCP (6), length 60)
>        192.168.0.138.22 > 192.168.0.37.31139: Flags [S.], cksum 0x7ef8
>    (incorrect -> 0x69d8), seq 2316386788, ack 634709833, win 65535,
>    options [mss 1368,nop,wscale 6,sackOK,TS val 805368299 ecr
>    1019391535], length 0
>
>    10:47:04.804218 IP (tos 0x0, ttl 61, id 4059, offset 0, flags [DF],
>    proto TCP (6), length 52)
>        192.168.0.37.31139 > 192.168.0.138.22: Flags [.], cksum 0x77d3
>    (correct), ack 1, win 8305, options [nop,nop,TS val 1019391538 ecr
>    805368299], length 0
>
>    10:47:04.809692 IP (tos 0x0, ttl 64, id 10061, offset 0, flags [DF],
>    proto TCP (6), length 99)
>        192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f1f
>    (incorrect -> 0x5799), seq 1:48, ack 1, win 1038, options
>    [nop,nop,TS val 805368328 ecr 1019391538], length 47
>
>    10:47:04.836110 IP (tos 0x0, ttl 61, id 4060, offset 0, flags [DF],
>    proto TCP (6), length 92)
>        192.168.0.37.31139 > 192.168.0.138.22: Flags [P.], cksum 0x8afa
>    (correct), seq 1:41, ack 48, win 8305, options [nop,nop,TS val
>    1019391541 ecr 805368328], length 40
>
>    10:47:04.836669 IP (tos 0x0, ttl 64, id 10062, offset 0, flags [DF],
>    proto TCP (6), length 1596)
>        192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x84f8
>    (incorrect -> 0x4d42), seq 48:1592, ack 41, win 1038, options
>    [nop,nop,TS val 805368358 ecr 1019391541], length 1544
>
>
>    And the end of the session:
>
>    10:47:15.243540 IP (tos 0x10, ttl 61, id 4132, offset 0, flags [DF],
>    proto TCP (6), length 100)
>        192.168.0.37.31139 > 192.168.0.138.22: Flags [P.], cksum 0x6364
>    (correct), seq 2321:2369, ack 3520, win 8305, options [nop,nop,TS
>    val 1019392582 ecr 805375068], length 48
>
>    10:47:15.243736 IP (tos 0x10, ttl 64, id 10125, offset 0, flags
>    [DF], proto TCP (6), length 100)
>        192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f20
>    (incorrect -> 0x3ea8), seq 3520:3568, ack 2369, win 1038, options
>    [nop,nop,TS val 805378765 ecr 1019392582], length 48
>
>    10:47:15.243796 IP (tos 0x10, ttl 64, id 10126, offset 0, flags
>    [DF], proto TCP (6), length 100)
>        192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f20
>    (incorrect -> 0xdd22), seq 3568:3616, ack 2369, win 1038, options
>    [nop,nop,TS val 805378765 ecr 1019392582], length 48
>
>    10:47:15.244627 IP (tos 0x10, ttl 64, id 10127, offset 0, flags
>    [DF], proto TCP (6), length 84)
>        192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f10
>    (incorrect -> 0x86ed), seq 3616:3648, ack 2369, win 1038, options
>    [nop,nop,TS val 805378765 ecr 1019392582], length 32
>
>    10:47:15.244812 IP (tos 0x10, ttl 64, id 10128, offset 0, flags
>    [DF], proto TCP (6), length 212)
>        192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f90
>    (incorrect -> 0x431c), seq 3648:3808, ack 2369, win 1038, options
>    [nop,nop,TS val 805378765 ecr 1019392582], length 160
>
>    10:47:15.271439 IP (tos 0x10, ttl 61, id 4134, offset 0, flags [DF],
>    proto TCP (6), length 52)
>        192.168.0.37.31139 > 192.168.0.138.22: Flags [.], cksum 0x3381
>    (correct), ack 3616, win 8299, options [nop,nop,TS val 1019392585
>    ecr 805378765], length 0
>
>    10:47:15.272238 IP (tos 0x10, ttl 61, id 4135, offset 0, flags [DF],
>    proto TCP (6), length 52)
>        192.168.0.37.31139 > 192.168.0.138.22: Flags [.], cksum 0x32d9
>    (correct), ack 3808, win 8275, options [nop,nop,TS val 1019392585
>    ecr 805378765], length 0
>
>    10:47:15.273515 IP (tos 0x10, ttl 61, id 4137, offset 0, flags [DF],
>    proto TCP (6), length 84)
>        192.168.0.37.31139 > 192.168.0.138.22: Flags [P.], cksum 0x43cc
>    (correct), seq 2369:2401, ack 3808, win 8305, options [nop,nop,TS
>    val 1019392585 ecr 805378765], length 32
>
>    10:47:15.276199 IP (tos 0x10, ttl 61, id 4138, offset 0, flags [DF],
>    proto TCP (6), length 116)
>        192.168.0.37.31139 > 192.168.0.138.22: Flags [P.], cksum 0xb169
>    (correct), seq 2401:2465, ack 3808, win 8305, options [nop,nop,TS
>    val 1019392585 ecr 805378765], length 64
>
>    10:47:15.276220 IP (tos 0x10, ttl 64, id 10129, offset 0, flags
>    [DF], proto TCP (6), length 52)
>        192.168.0.138.22 > 192.168.0.37.31139: Flags [.], cksum 0x7ef0
>    (incorrect -> 0x4ea3), ack 2465, win 1037, options [nop,nop,TS val
>    805378793 ecr 1019392585], length 0
>
>    10:47:15.276970 IP (tos 0x10, ttl 61, id 4140, offset 0, flags [DF],
>    proto TCP (6), length 52)
>        192.168.0.37.31139 > 192.168.0.138.22: Flags [F.], cksum 0x325a
>    (correct), seq 2465, ack 3808, win 8305, options [nop,nop,TS val
>    1019392585 ecr 805378765], length 0
>
>    10:47:15.276978 IP (tos 0x10, ttl 64, id 10130, offset 0, flags
>    [DF], proto TCP (6), length 52)
>        192.168.0.138.22 > 192.168.0.37.31139: Flags [.], cksum 0x7ef0
>    (incorrect -> 0x4e9c), ack 2466, win 1038, options [nop,nop,TS val
>    805378798 ecr 1019392585], length 0
>
>    10:47:15.277212 IP (tos 0x10, ttl 64, id 10131, offset 0, flags
>    [DF], proto TCP (6), length 52)
>        192.168.0.138.22 > 192.168.0.37.31139: Flags [F.], cksum 0x7ef0
>    (incorrect -> 0x4e9b), seq 3808, ack 2466, win 1038, options
>    [nop,nop,TS val 805378798 ecr 1019392585], length 0
>
>    10:47:15.303993 IP (tos 0x10, ttl 61, id 4142, offset 0, flags [DF],
>    proto TCP (6), length 52)
>        192.168.0.37.31139 > 192.168.0.138.22: Flags [.], cksum 0x3235
>    (correct), ack 3809, win 8305, options [nop,nop,TS val 1019392588
>    ecr 805378798], length 0
>
>
>    This box has no services running on it at present, and just sits
>    there idle. I'll periodically check on it and see if the CLOSED
>    socket count, or hung sshd count goes up.
>
>    It's runing as a PVHVM domU under XenServer 6.2
>
>
> Since I am retired, I no longer have access to my beloved "TCP/IP 
> Illustrated Vol. 1", but I believe CLOSED status indicates that the 
> socket is no longer in use.As the name says, it is CLOSED.  I don't 
> know why sshd should be waiting on a CLOSED socket, nor do I 
> understand why a CLOSED server socket should live on for extended times.
>
> That said, I have been monitoring my FreeBSD 10 system and it seems to 
> pick up an occasional case where a socket gets "stuck" in the CLOSED 
> state for a very long time. I can't say forever, but at least many 
> minutes. I don't see this on 9.2 systems. Right now I have two CLOSED 
> sockets, both the the same address on port 80. It appears that the 
> only thing that removes the socket is to kill the owning process (in 
> this case it's Firefox). The owning firefox process does exist just 
> fine when asked and I think sshd should do so as well, but it looks to 
> me like the root problem is that CLOSED socket seem to live on in 10.0 
> and they don't on older versions.
>
> I think it's time to move this over to net@, as I suspect a change 
> made in 10 is triggering this. I'm not sure the changes is wrong or 
> that sshd is wrong in not exiting cleanly when sockets are hanging 
> around in CLOSED state. (Nor am I sure sshd is wrong.)
>
> If I am confused about this, please let me know. I'm going ot read the 
> 10.0 release notes to see if they say anything about this.

Another my server with 12 zombie process related with CLOSED connections:

last pid: 84965;  load averages:  2.74,  2.65, 2.07 up 14+23:43:16  16:10:00
133 processes: 1 running, 120 sleeping, 12 zombie
CPU:  8.5% user,  0.0% nice, 39.3% system, 11.8% interrupt, 40.4% idle
Mem: 194M Active, 2575M Inact, 828M Wired, 77M Cache, 391M Buf, 3056K Free
Swap: 16G Total, 377M Used, 16G Free, 2% Inuse, 8K In

# netstat -n |grep CLOSED
tcp4       0      0 192.168.254.1.22       10.254.0.12.26792 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.60591 CLOSED
tcp4       0      0 192.168.254.1.22       10.254.0.12.26974 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.54034 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.52999 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.65345 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.64099 CLOSED
tcp4       0      0 192.168.254.1.22       10.254.0.12.26457 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.54181 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.53717 CLOSED
tcp4       0      0 186.xxx.xx.2.22        186.xxx.xx.8.57035 CLOSED
tcp4       0      0 192.168.254.1.22       10.254.0.12.26832 CLOSED





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5335D430.8010108>