From owner-freebsd-hackers Sat Jun 22 3:39:20 2002 Delivered-To: freebsd-hackers@freebsd.org Received: from goose.mail.pas.earthlink.net (goose.mail.pas.earthlink.net [207.217.120.18]) by hub.freebsd.org (Postfix) with ESMTP id 8D28937B400 for ; Sat, 22 Jun 2002 03:39:11 -0700 (PDT) Received: from pool0038.cvx40-bradley.dialup.earthlink.net ([216.244.42.38] helo=mindspring.com) by goose.mail.pas.earthlink.net with esmtp (Exim 3.33 #2) id 17LiI4-0006XB-00; Sat, 22 Jun 2002 03:38:49 -0700 Message-ID: <3D145392.3B09B1D3@mindspring.com> Date: Sat, 22 Jun 2002 03:38:10 -0700 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Patrick Thomas Cc: Nielsen , hackers@freebsd.org Subject: Re: (jail) problem and a (possible) solution ? References: <20020622014826.U68572-100000@utility.clubscholarship.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Patrick Thomas wrote: > What it does is the userland hangs, but the kernel keeps running. > > When the system is crashed, I can still ping it successfully, and I can > still open sockets (like I can open a connection to a jails httpd or sshd, > or the sshd of the underlying server itself) but nothing answers on the > sockets - they just hang open. > > So everything stops running, but it is still "up" - still responds to > pings...syslog stops logging though, cron stops running.... > > Two questions for you: > > 1) do you allow them write access to their /dev/mem, /dev/kmem, /dev/io ? > > 2) does this sound like what you see? Can you still ping the crashed > server ? > > I'm mostly just curious if this kind of crash (userland hung but kernel > running) is a possible outcome of someone in a jail fiddling with those > /dev nodes, or if fiddling with dev/mem or /dev/kmem or io would just lock > the machine up hard and completely. > > Terry? I've kept quiet so far because I'm not the "jail" expert; Poul actually wrote the jail code, and there was someone else who understood it enough to recently add multiple IP support. Given your symptoms, I can pretty much guess where the problem is, but not really how to fix it, other than trial-and-error, since I tend to run jails on a number of my machines, and make them do things they aren't supposed to do... Knowing what version of FreeBSD you are running would be helpful. That you can still ping indicates that both hardware interrupts and NETISR are running. That NETISR runs indicates that things are still calling "splx()", which means things are still calling "spl*()" and coming back from it. The fact that you can still connect to servers that have active listens posted, but that you get no data is also indicative that the NETISR is running, at least up to the accept. It would be interesting to attempt a large number of connections, to see if the connections stop being accepted after you've tried more times than you set in listen(3) as the queue depth for the number of sockets allowed to sit there pending accept. If this happens (connection attempts start hanging, rather than being accepted), you know for certain that the process you are trying to talk to is not being scheduled to run. Basically, this implies one of two things is happening: 1) Your scheduler "lost" its head entry, so it's not scheduling anything to run, OR 2) You've used up all your resources on the machine (usually memory), and all of your processes are hung on a copy-on-write or allocate request, pending being serviced by the kernel If you can, compile the kernel for the box with the kernel debugger enabled, and "break to debugger" enabled, and break to the debugger on the console. The type "ps" and see what you get back as the wait channel everything you are trying to connect to is waiting on. This should be very informative, and it should be easy to locate the problem from there. If you have to, you can look at the scheduler queues, if there is anything in runnable state, and find out what's not there. Probably, it's not enough RAM, and your tuning parameters are set such that this isn't fatal to processes, when it should be. That you are able to ping, etc. guaranteed that you are not out of mbufs, and that you can connect that you aren't out of inpcb's or tcpcb's -- but mbufs are freelisted, so that's to be expected there (may not need more) and the pcb's are allocated at boot time (so are sockets, based on maxfiles), so tuning any of them after boot can get you in trouble. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message