Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 22 Mar 2014 00:18:20 -0700
From:      Kevin Oberman <rkoberman@gmail.com>
To:        Marcelo Gondim <gondim@bsdinfo.com.br>
Cc:        FreeBSD Stable Mailing List <freebsd-stable@freebsd.org>
Subject:   Re: sshd with zombie process on FreeBSD 10.0-STABLE - workaround
Message-ID:  <CAN6yY1uEADbTHyrP7=uEgEUQWR%2BcTW2grq=aK00i9idW=ver%2Bg@mail.gmail.com>
In-Reply-To: <532D2852.1010700@bsdinfo.com.br>
References:  <53016D97.5030909@bsdinfo.com.br> <CAN6yY1uucfkdXxkCF30w1Q9vffRpDLxM90Sz1XVbdn5W69vQMg@mail.gmail.com> <5329D81E.7040709@bsdinfo.com.br> <201403201058.38555.jhb@freebsd.org> <532B7DEC.7010809@bsdinfo.com.br> <CAN6yY1sf0z_jBJgBy2dZX0a3JJnyTnq76_DepXzG32GWgHHO6A@mail.gmail.com> <532D2852.1010700@bsdinfo.com.br>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Mar 21, 2014 at 11:06 PM, Marcelo Gondim <gondim@bsdinfo.com.br>wrote:

> Em 22/03/14 02:02, Kevin Oberman escreveu:
>
>> On Thu, Mar 20, 2014 at 4:46 PM, Marcelo Gondim <gondim@bsdinfo.com.br
>> >wrote:
>>
>>  Em 20/03/14 11:58, John Baldwin escreveu:
>>>
>>>  On Wednesday, March 19, 2014 1:47:10 pm Marcelo Gondim wrote:
>>>>
>>>>   Em 19/03/14 13:01, Kevin Oberman escreveu:
>>>>
>>>>> On Wed, Mar 19, 2014 at 6:00 AM, Marcelo Gondim
>>>>>>
>>>>>>  <gondim@bsdinfo.com.br>wrote:
>>>>> Hi all,
>>>>>
>>>>>> While the solution does not appear, did the script below and put it in
>>>>>>> crontab to automatically delete zombie sshd processes.
>>>>>>>
>>>>>>> the_walking_dead.sh:
>>>>>>>
>>>>>>> #!/bin/sh
>>>>>>> kill -9 `ps afx|grep sshd|grep unknown|awk '{print $1}'`
>>>>>>>
>>>>>>>
>>>>>>> Put this in /etc/crontab:
>>>>>>>
>>>>>>> 00 1 * * *    root    the_walking_dead.sh
>>>>>>>
>>>>>>>
>>>>>>>   If 'kill -9' works, the process is not really a zombie. It simply
>>>>>>>
>>>>>> still
>>>>>>
>>>>>>  has
>>>>> a socket open and is waiting for it to be closed before exiting.
>>>>>
>>>>>> You might takes a look at network sockets with sockstat(1) and see if
>>>>>> you
>>>>>> can get any indication of why these sockets are not being closed. It
>>>>>> may
>>>>>>
>>>>>>  be
>>>>> that the issue is not sshd but some other issue in the OS leaving
>>>>> sockets
>>>>>
>>>>>> open.
>>>>>>
>>>>>>   Hi Kevin,
>>>>>>
>>>>> My ps -afx below:
>>>>>
>>>>> [...]
>>>>> 42139  -  Is       0:00.01 sshd: unknown [priv] (sshd)
>>>>> 42140  -  Z        0:00.01 <defunct>
>>>>> 42141  -  IW       0:00.00 sshd: unknown [pam] (sshd)
>>>>> 58445  -  Is       0:00.01 sshd: unknown [priv] (sshd)
>>>>> 58446  -  Z        0:00.02 <defunct>
>>>>> 58447  -  IW       0:00.00 sshd: unknown [pam] (sshd)
>>>>> 65635  -  Is       0:00.01 sshd: vinicius [priv] (sshd)
>>>>> 65636  -  Z        0:00.01 <defunct>
>>>>> [...]
>>>>>
>>>>> # sockstat | grep 42140
>>>>> #
>>>>>
>>>>> # sockstat | grep 58446
>>>>> #
>>>>>
>>>>> # sockstat | grep 65636
>>>>> #
>>>>>
>>>>> No associated socket with zombie process.
>>>>>
>>>>>  Do a pstree.  I bet the zombies are children of the other processes
>>>> that
>>>> are stuck on a socket as Kevin described.
>>>>
>>>>   # ps afx|grep sshd |grep unk
>>>>
>>> 10948  -  Is       0:00.02 sshd: unknown [priv] (sshd)
>>> 10955  -  IW       0:00.00 sshd: unknown [pam] (sshd)       <====
>>> 11701  -  Is       0:00.02 sshd: unknown [priv] (sshd)
>>> 11704  -  IW       0:00.00 sshd: unknown [pam] (sshd)
>>> 25450  -  Is       0:00.01 sshd: unknown [priv] (sshd)
>>> 25452  -  IW       0:00.00 sshd: unknown [pam] (sshd)
>>> 41193  -  Is       0:00.02 sshd: unknown [priv] (sshd)
>>> 41196  -  IW       0:00.00 sshd: unknown [pam] (sshd)
>>> 42193  -  Is       0:00.02 sshd: unknown [priv] (sshd)
>>> 42195  -  IW       0:00.00 sshd: unknown [pam] (sshd)
>>> 80638  -  Is       0:00.02 sshd: unknown [priv] (sshd)
>>> 80640  -  IW       0:00.00 sshd: unknown [pam] (sshd)
>>> 81484  -  Is       0:00.02 sshd: unknown [priv] (sshd)
>>> 81486  -  IW       0:00.00 sshd: unknown [pam] (sshd)
>>>
>>> With proctstat I could see  the socket as follows:
>>>
>>> # procstat -f 10955
>>>    PID COMM               FD T V FLAGS     REF  OFFSET PRO NAME
>>> 10955 sshd              text v r r-------  -       - - /usr/sbin/sshd
>>> 10955 sshd               cwd v d r-------  -       - - /
>>> 10955 sshd              root v d r-------  -       - - /
>>> 10955 sshd                 0 v c rw------  6       0 - /dev/null
>>> 10955 sshd                 1 v c rw------  6       0 - /dev/null
>>> 10955 sshd                 2 v c rw------  6       0 - /dev/null
>>> 10955 sshd                 3 s - rw---n--  2       0 TCP 186.xxx.xx.2:22
>>> 186.xxx.xx.8:57035
>>> 10955 sshd                 5 p - rw------  2       0 - -
>>> 10955 sshd                 6 s - rw------  2       0 UDS -
>>> 10955 sshd                 7 p - rw------  1       0 - -
>>> 10955 sshd                 8 s - rw------  2       0 UDS -
>>>
>>> I do not understand why these connections are remaining locked in FreeBSD
>>> 10.0
>>>
>>> I'll try this sysctl: net.inet.tcp.delayed_ack=0
>>>
>>>  If the problem is still showing up, can you  see what is going on with
>> the
>> socket? What is the state of the connection. Try "netstat -f inet -p tcp"
>> and see what state the connection is in. I'm wondering if there is some
>> sort of race going on where the socket hangs.
>>
>> Ideally I'd look to try and capture the packets st the end of the session.
>> Can you do something to trigger this reliably? if so "standard" "tcpdump
>> -pw file.bpf host HOST". I seem to recall that these connections are
>> scheduled. If so, you can put the packet capture in a crontab to run at
>> the
>> same time. If you feed this to a tool like wireshark, you should get a
>> good
>> idea of what is happening, if not why. I understand that the timing of
>> this
>> might be very tricky.
>>
> Hi Kevin,
>
> Thanks for your help.
>
> I did the netstat and the state of the connection is closed as you can see
> below:
>
> # procstat -f 26177
>   PID COMM               FD T V FLAGS     REF  OFFSET PRO NAME
> 26177 sshd              text v r r-------  -       - - /usr/sbin/sshd
> 26177 sshd               cwd v d r-------  -       - - /
> 26177 sshd              root v d r-------  -       - - /
> 26177 sshd                 0 v c rw------  6       0 - /dev/null
> 26177 sshd                 1 v c rw------  6       0 - /dev/null
> 26177 sshd                 2 v c rw------  6       0 - /dev/null
> 26177 sshd                 3 s - rw---n--  2       0 TCP
> 186.193.48.10:4321 186.193.48.8:50094
> 26177 sshd                 4 s - rw------  1       0 UDS -
> 26177 sshd                 5 p - rw------  2       0 - -
> 26177 sshd                 6 s - rw------  2       0 UDS -
>
> # procstat -f 10110
>   PID COMM               FD T V FLAGS     REF  OFFSET PRO NAME
> 10110 sshd              text v r r-------  -       - - /usr/sbin/sshd
> 10110 sshd               cwd v d r-------  -       - - /
> 10110 sshd              root v d r-------  -       - - /
> 10110 sshd                 0 v c rw------  6       0 - /dev/null
> 10110 sshd                 1 v c rw------  6       0 - /dev/null
> 10110 sshd                 2 v c rw------  6       0 - /dev/null
> 10110 sshd                 3 s - rw---n--  2       0 TCP
> 186.193.48.10:4321 186.193.48.8:63048
> 10110 sshd                 4 s - rw------  1       0 UDS -
> 10110 sshd                 5 p - rw------  2       0 - -
> 10110 sshd                 6 s - rw------  2       0 UDS -
>
> # netstat -f inet -p tcp
> Active Internet connections
> Proto Recv-Q Send-Q Local Address          Foreign Address (state)
> tcp4       0      0 bart.24173             pppoe17250.8728 ESTABLISHED
> tcp4       0      0 bart.53795             pppoe17249.8728 TIME_WAIT
> tcp4       0      0 bart.54191             pppoe149.8728 TIME_WAIT
> tcp4       0      0 bart.12476             pppoe148.8728 TIME_WAIT
> tcp4       0      0 bart.36846             pppoe142.8728 TIME_WAIT
> tcp4       0      0 bart.39944             186.193.48.22.8728 TIME_WAIT
> tcp4       0      0 bart.60233             186.193.48.25.8728 TIME_WAIT
> tcp4       0      0 bart.50946             186.193.48.9.8728 TIME_WAIT
> tcp4       0      0 bart.13403             186.193.48.19.8728 TIME_WAIT
> tcp4       0      0 bart.36982             zeus.linuxinfo.c.8728 TIME_WAIT
> tcp4       0      0 bart.rwhois            pppoe769.49896 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.15711 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.16087 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.25051 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.59126 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.59051 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.29446 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.45453 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.14938 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.46230 FIN_WAIT_2
> tcp4       0      0 bart.mysql             mail.16930 FIN_WAIT_2
> tcp4       0      0 bart.mysql             mail.28074 FIN_WAIT_2
> tcp4       0      0 bart.mysql             mail.53686 FIN_WAIT_2
> tcp4       0      0 bart.mysql             mail.14448 FIN_WAIT_2
> tcp4       0      0 bart.mysql             mail.52487 ESTABLISHED
> tcp4       0      0 bart.rwhois            186.193.48.8.50094 CLOSED
>        <====
> tcp4       0      0 bart.mysql             mail.38286 FIN_WAIT_2
> tcp4       0      0 bart.mysql             mail.32387 FIN_WAIT_2
> tcp4       0      0 bart.mysql             mail.52219 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.52144 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.18862 FIN_WAIT_2
> tcp4       0      0 bart.mysql             mail.52636 FIN_WAIT_2
> tcp4       0      0 bart.mysql             mail.51607 FIN_WAIT_2
> tcp4       0      0 bart.mysql             mail.62581 FIN_WAIT_2
> tcp4       0      0 bart.mysql             mail.23071 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.22862 FIN_WAIT_2
> tcp4       0      0 bart.rwhois            186.193.48.8.63048 CLOSED
>        <====
> tcp4       0      0 bart.mysql             mail.42479 FIN_WAIT_2
> tcp4       0      0 bart.mysql             mail.18146 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.46731 FIN_WAIT_2
> tcp4       0      0 bart.mysql             mail.20498 ESTABLISHED
> tcp4       0      0 bart.62869             186.193.48.2.1190 ESTABLISHED
> tcp4       0      0 bart.mysql             mail.55353 ESTABLISHED
>

I'm sorry. I am now even more confused. Maybe I need to re-read the entire
thread.

I thought that the hung processes were sshd. These are rwhois. Or is there
an ssh tunnel carrying the rwhois connections? (I see no sshd connections
in this list.)
-- 
R. Kevin Oberman, Network Engineer, Retired
E-mail: rkoberman@gmail.com



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAN6yY1uEADbTHyrP7=uEgEUQWR%2BcTW2grq=aK00i9idW=ver%2Bg>