Date: Fri, 28 Mar 2014 16:57:36 -0300 From: Marcelo Gondim <gondim@bsdinfo.com.br> To: freebsd-net@freebsd.org Subject: Re: sshd with zombie process on FreeBSD 10.0-STABLE Message-ID: <5335D430.8010108@bsdinfo.com.br> In-Reply-To: <5335B8F7.7050001@bsdinfo.com.br> References: <5335B8F7.7050001@bsdinfo.com.br>
next in thread | previous in thread | raw e-mail | index | archive | help
Em 28/03/14 15:01, Marcelo Gondim escreveu: > Hi all, > > I have noticed zombie processes on the system after a few lost > connections on ssh. > > # ps afx > [...] > 8045 - Is 0:00.01 sshd: unknown [priv] (sshd) > 8046 - Z 0:00.01 <defunct> > 8054 - IW 0:00.00 sshd: unknown [pam] (sshd) > 28146 - Is 0:00.01 sshd: unknown [priv] (sshd) > 28147 - Z 0:00.01 <defunct> > 28155 - IW 0:00.00 sshd: unknown [pam] (sshd) > 43320 - Is 0:00.01 sshd: unknown [priv] (sshd) > 43321 - Z 0:00.01 <defunct> > 43322 - IW 0:00.00 sshd: unknown [pam] (sshd) > 73413 - Is 0:00.01 sshd: unknown [priv] (sshd) > 73414 - Z 0:00.01 <defunct> > 73430 - IW 0:00.00 sshd: unknown [pam] (sshd) > [...] > > More information discussed in freebsd-stable@list: > > > On Tue, Mar 25, 2014 at 3:52 AM, Karl Pielorz <kpielorz_lst@tdx.co.uk > <mailto:kpielorz_lst@tdx.co.uk>> wrote: > > > > --On 21 March 2014 22:02 -0700 Kevin Oberman <rkoberman@gmail.com > <mailto:rkoberman@gmail.com>> wrote: > > Ideally I'd look to try and capture the packets st the end of > the session. > Can you do something to trigger this reliably? if so "standard" > "tcpdump > -pw file.bpf host HOST". I seem to recall that these > connections are > scheduled. If so, you can put the packet capture in a crontab to > run at > the same time. If you feed this to a tool like wireshark, you > should get > a good idea of what is happening, if not why. I understand that > the > timing of this might be very tricky. > > > Ok, fwiw as I have this issue as well - I've done a packet capture > (it's below). This box has 59 'CLOSED' sockets from ssh on it, and > 60 sshd stuck. > > Also - initially I thought this was a Xen issue - so there's a > couple of posts on that list from a couple of weeks ago, in brief - > the sshd processes I have are stuck in 'urdlck' - one of the Xen > guys commented "It seems like the process is stuck while trying to > acquire a rw mutex in read mode." > > I did a backtrace of a stuck process - I can post that if you want > (or check the FreeBSD-Xen list for 'stuck sshd in urdlck'. > > Also, if I ssh into this host, 90% of the time (seems to get worse > the longer the box is up) I get: > > " > ssh_exchange_identification: Connection closed by remote host > " > > That does not leave a lingering CLOSED socket. In fact, a successful > ssh login - and logout, does not in testing appear to leave a > lingering CLOSED socket, nor sshd stuck in urdlck - so I'm not > entirely sure where they're coming from, or how often they are > created. > > tcpdump from the start of a successful ssh connect: > > 10:47:04.777765 IP (tos 0x0, ttl 61, id 4058, offset 0, flags [DF], > proto TCP (6), length 60) > 192.168.0.37.31139 > 192.168.0.138.22: Flags [S], cksum 0x57ce > (correct), seq 634709832, win 65535, options [mss 1368,nop,wscale > 3,sackOK,TS val 1019391535 ecr 0], length 0 > > 10:47:04.777776 IP (tos 0x0, ttl 64, id 10060, offset 0, flags [DF], > proto TCP (6), length 60) > 192.168.0.138.22 > 192.168.0.37.31139: Flags [S.], cksum 0x7ef8 > (incorrect -> 0x69d8), seq 2316386788, ack 634709833, win 65535, > options [mss 1368,nop,wscale 6,sackOK,TS val 805368299 ecr > 1019391535], length 0 > > 10:47:04.804218 IP (tos 0x0, ttl 61, id 4059, offset 0, flags [DF], > proto TCP (6), length 52) > 192.168.0.37.31139 > 192.168.0.138.22: Flags [.], cksum 0x77d3 > (correct), ack 1, win 8305, options [nop,nop,TS val 1019391538 ecr > 805368299], length 0 > > 10:47:04.809692 IP (tos 0x0, ttl 64, id 10061, offset 0, flags [DF], > proto TCP (6), length 99) > 192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f1f > (incorrect -> 0x5799), seq 1:48, ack 1, win 1038, options > [nop,nop,TS val 805368328 ecr 1019391538], length 47 > > 10:47:04.836110 IP (tos 0x0, ttl 61, id 4060, offset 0, flags [DF], > proto TCP (6), length 92) > 192.168.0.37.31139 > 192.168.0.138.22: Flags [P.], cksum 0x8afa > (correct), seq 1:41, ack 48, win 8305, options [nop,nop,TS val > 1019391541 ecr 805368328], length 40 > > 10:47:04.836669 IP (tos 0x0, ttl 64, id 10062, offset 0, flags [DF], > proto TCP (6), length 1596) > 192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x84f8 > (incorrect -> 0x4d42), seq 48:1592, ack 41, win 1038, options > [nop,nop,TS val 805368358 ecr 1019391541], length 1544 > > > And the end of the session: > > 10:47:15.243540 IP (tos 0x10, ttl 61, id 4132, offset 0, flags [DF], > proto TCP (6), length 100) > 192.168.0.37.31139 > 192.168.0.138.22: Flags [P.], cksum 0x6364 > (correct), seq 2321:2369, ack 3520, win 8305, options [nop,nop,TS > val 1019392582 ecr 805375068], length 48 > > 10:47:15.243736 IP (tos 0x10, ttl 64, id 10125, offset 0, flags > [DF], proto TCP (6), length 100) > 192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f20 > (incorrect -> 0x3ea8), seq 3520:3568, ack 2369, win 1038, options > [nop,nop,TS val 805378765 ecr 1019392582], length 48 > > 10:47:15.243796 IP (tos 0x10, ttl 64, id 10126, offset 0, flags > [DF], proto TCP (6), length 100) > 192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f20 > (incorrect -> 0xdd22), seq 3568:3616, ack 2369, win 1038, options > [nop,nop,TS val 805378765 ecr 1019392582], length 48 > > 10:47:15.244627 IP (tos 0x10, ttl 64, id 10127, offset 0, flags > [DF], proto TCP (6), length 84) > 192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f10 > (incorrect -> 0x86ed), seq 3616:3648, ack 2369, win 1038, options > [nop,nop,TS val 805378765 ecr 1019392582], length 32 > > 10:47:15.244812 IP (tos 0x10, ttl 64, id 10128, offset 0, flags > [DF], proto TCP (6), length 212) > 192.168.0.138.22 > 192.168.0.37.31139: Flags [P.], cksum 0x7f90 > (incorrect -> 0x431c), seq 3648:3808, ack 2369, win 1038, options > [nop,nop,TS val 805378765 ecr 1019392582], length 160 > > 10:47:15.271439 IP (tos 0x10, ttl 61, id 4134, offset 0, flags [DF], > proto TCP (6), length 52) > 192.168.0.37.31139 > 192.168.0.138.22: Flags [.], cksum 0x3381 > (correct), ack 3616, win 8299, options [nop,nop,TS val 1019392585 > ecr 805378765], length 0 > > 10:47:15.272238 IP (tos 0x10, ttl 61, id 4135, offset 0, flags [DF], > proto TCP (6), length 52) > 192.168.0.37.31139 > 192.168.0.138.22: Flags [.], cksum 0x32d9 > (correct), ack 3808, win 8275, options [nop,nop,TS val 1019392585 > ecr 805378765], length 0 > > 10:47:15.273515 IP (tos 0x10, ttl 61, id 4137, offset 0, flags [DF], > proto TCP (6), length 84) > 192.168.0.37.31139 > 192.168.0.138.22: Flags [P.], cksum 0x43cc > (correct), seq 2369:2401, ack 3808, win 8305, options [nop,nop,TS > val 1019392585 ecr 805378765], length 32 > > 10:47:15.276199 IP (tos 0x10, ttl 61, id 4138, offset 0, flags [DF], > proto TCP (6), length 116) > 192.168.0.37.31139 > 192.168.0.138.22: Flags [P.], cksum 0xb169 > (correct), seq 2401:2465, ack 3808, win 8305, options [nop,nop,TS > val 1019392585 ecr 805378765], length 64 > > 10:47:15.276220 IP (tos 0x10, ttl 64, id 10129, offset 0, flags > [DF], proto TCP (6), length 52) > 192.168.0.138.22 > 192.168.0.37.31139: Flags [.], cksum 0x7ef0 > (incorrect -> 0x4ea3), ack 2465, win 1037, options [nop,nop,TS val > 805378793 ecr 1019392585], length 0 > > 10:47:15.276970 IP (tos 0x10, ttl 61, id 4140, offset 0, flags [DF], > proto TCP (6), length 52) > 192.168.0.37.31139 > 192.168.0.138.22: Flags [F.], cksum 0x325a > (correct), seq 2465, ack 3808, win 8305, options [nop,nop,TS val > 1019392585 ecr 805378765], length 0 > > 10:47:15.276978 IP (tos 0x10, ttl 64, id 10130, offset 0, flags > [DF], proto TCP (6), length 52) > 192.168.0.138.22 > 192.168.0.37.31139: Flags [.], cksum 0x7ef0 > (incorrect -> 0x4e9c), ack 2466, win 1038, options [nop,nop,TS val > 805378798 ecr 1019392585], length 0 > > 10:47:15.277212 IP (tos 0x10, ttl 64, id 10131, offset 0, flags > [DF], proto TCP (6), length 52) > 192.168.0.138.22 > 192.168.0.37.31139: Flags [F.], cksum 0x7ef0 > (incorrect -> 0x4e9b), seq 3808, ack 2466, win 1038, options > [nop,nop,TS val 805378798 ecr 1019392585], length 0 > > 10:47:15.303993 IP (tos 0x10, ttl 61, id 4142, offset 0, flags [DF], > proto TCP (6), length 52) > 192.168.0.37.31139 > 192.168.0.138.22: Flags [.], cksum 0x3235 > (correct), ack 3809, win 8305, options [nop,nop,TS val 1019392588 > ecr 805378798], length 0 > > > This box has no services running on it at present, and just sits > there idle. I'll periodically check on it and see if the CLOSED > socket count, or hung sshd count goes up. > > It's runing as a PVHVM domU under XenServer 6.2 > > > Since I am retired, I no longer have access to my beloved "TCP/IP > Illustrated Vol. 1", but I believe CLOSED status indicates that the > socket is no longer in use.As the name says, it is CLOSED. I don't > know why sshd should be waiting on a CLOSED socket, nor do I > understand why a CLOSED server socket should live on for extended times. > > That said, I have been monitoring my FreeBSD 10 system and it seems to > pick up an occasional case where a socket gets "stuck" in the CLOSED > state for a very long time. I can't say forever, but at least many > minutes. I don't see this on 9.2 systems. Right now I have two CLOSED > sockets, both the the same address on port 80. It appears that the > only thing that removes the socket is to kill the owning process (in > this case it's Firefox). The owning firefox process does exist just > fine when asked and I think sshd should do so as well, but it looks to > me like the root problem is that CLOSED socket seem to live on in 10.0 > and they don't on older versions. > > I think it's time to move this over to net@, as I suspect a change > made in 10 is triggering this. I'm not sure the changes is wrong or > that sshd is wrong in not exiting cleanly when sockets are hanging > around in CLOSED state. (Nor am I sure sshd is wrong.) > > If I am confused about this, please let me know. I'm going ot read the > 10.0 release notes to see if they say anything about this. Another my server with 12 zombie process related with CLOSED connections: last pid: 84965; load averages: 2.74, 2.65, 2.07 up 14+23:43:16 16:10:00 133 processes: 1 running, 120 sleeping, 12 zombie CPU: 8.5% user, 0.0% nice, 39.3% system, 11.8% interrupt, 40.4% idle Mem: 194M Active, 2575M Inact, 828M Wired, 77M Cache, 391M Buf, 3056K Free Swap: 16G Total, 377M Used, 16G Free, 2% Inuse, 8K In # netstat -n |grep CLOSED tcp4 0 0 192.168.254.1.22 10.254.0.12.26792 CLOSED tcp4 0 0 186.xxx.xx.2.22 186.xxx.xx.8.60591 CLOSED tcp4 0 0 192.168.254.1.22 10.254.0.12.26974 CLOSED tcp4 0 0 186.xxx.xx.2.22 186.xxx.xx.8.54034 CLOSED tcp4 0 0 186.xxx.xx.2.22 186.xxx.xx.8.52999 CLOSED tcp4 0 0 186.xxx.xx.2.22 186.xxx.xx.8.65345 CLOSED tcp4 0 0 186.xxx.xx.2.22 186.xxx.xx.8.64099 CLOSED tcp4 0 0 192.168.254.1.22 10.254.0.12.26457 CLOSED tcp4 0 0 186.xxx.xx.2.22 186.xxx.xx.8.54181 CLOSED tcp4 0 0 186.xxx.xx.2.22 186.xxx.xx.8.53717 CLOSED tcp4 0 0 186.xxx.xx.2.22 186.xxx.xx.8.57035 CLOSED tcp4 0 0 192.168.254.1.22 10.254.0.12.26832 CLOSED
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5335D430.8010108>