From owner-freebsd-emulation@FreeBSD.ORG Tue Jun 12 22:25:20 2012 Return-Path: Delivered-To: freebsd-emulation@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1F0B1106566C for ; Tue, 12 Jun 2012 22:25:20 +0000 (UTC) (envelope-from gpalmer@freebsd.org) Received: from noop.in-addr.com (mail.in-addr.com [IPv6:2001:470:8:162::1]) by mx1.freebsd.org (Postfix) with ESMTP id C67B88FC0A for ; Tue, 12 Jun 2012 22:25:19 +0000 (UTC) Received: from gjp by noop.in-addr.com with local (Exim 4.77 (FreeBSD)) (envelope-from ) id 1SeZVt-0007w2-Vk; Tue, 12 Jun 2012 18:24:37 -0400 Date: Tue, 12 Jun 2012 18:24:37 -0400 From: Gary Palmer To: Steve Tuts Message-ID: <20120612222437.GB14487@in-addr.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: gpalmer@freebsd.org X-SA-Exim-Scanned: No (on noop.in-addr.com); SAEximRunCond expanded to false Cc: freebsd-emulation@freebsd.org Subject: Re: Still unresolved - Re: one virtualbox vm disrupts all vms and entire network X-BeenThere: freebsd-emulation@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Development of Emulators of other operating systems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Jun 2012 22:25:20 -0000 On Thu, Jun 07, 2012 at 03:56:22PM -0400, Steve Tuts wrote: > On Thu, Jun 7, 2012 at 3:54 AM, Steve Tuts wrote: > > > > > > > On Thu, Jun 7, 2012 at 2:58 AM, Bernhard Fr?hlich wrote: > > > >> On Do., 7. Jun. 2012 01:07:52 CEST, Kevin Oberman > >> wrote: > >> > >> > On Wed, Jun 6, 2012 at 3:46 PM, Steve Tuts wrote: > >> > > On Wed, Jun 6, 2012 at 3:50 AM, Bernhard Froehlich > >> > > wrote: > >> > > > >> > > > On 05.06.2012 20:16, Bernhard Froehlich wrote: > >> > > > > >> > > > > On 05.06.2012 19:05, Steve Tuts wrote: > >> > > > > > >> > > > > > On Mon, Jun 4, 2012 at 4:11 PM, Rusty Nejdl > >> > > > > > wrote: > >> > > > > > > >> > > > > > On 2012-06-02 12:16, Steve Tuts wrote: > >> > > > > > > > >> > > > > > > Hi, we have a Dell poweredge server with a dozen interfaces. > >> > > > > > > It hosts > >> > > > > > > > a > >> > > > > > > > few guests of web app and email servers with > >> > > > > > > > VirtualBox-4.0.14. The host > >> > > > > > > > and all guests are FreeBSD 9.0 64bit. Each guest is bridged > >> > > > > > > > to a distinct > >> > > > > > > > interface. The host and all guests are set to 10.0.0.0 > >> > > > > > > > network NAT'ed to > >> > > > > > > > a > >> > > > > > > > cicso router. > >> > > > > > > > > >> > > > > > > > This runs well for a couple months, until we added a new > >> > > > > > > > guest recently. > >> > > > > > > > Every few hours, none of the guests can be connected. We > >> > > > > > > > can only connect > >> > > > > > > > to the host from outside the router. We can also go to the > >> > > > > > > > console of the > >> > > > > > > > guests (except the new guest), but from there we can't ping > >> > > > > > > > the gateway 10.0.0.1 any more. The new guest just froze. > >> > > > > > > > > >> > > > > > > > Furthermore, on the host we can see a vboxheadless process > >> > > > > > > > for each guest, > >> > > > > > > > including the new guest. But we can not kill it, not even > >> > > > > > > > with "kill -9". > >> > > > > > > > We looked around the web and someone suggested we should use > >> > > > > > > > "kill -SIGCONT" first since the "ps" output has the "T" flag > >> > > > > > > > for that vboxheadless process for that new guest, but that > >> > > > > > > > doesn't help. We also > >> > > > > > > > tried all the VBoxManager commands to poweroff/reset etc > >> > > > > > > > that new guest, > >> > > > > > > > but they all failed complaining that vm is in Aborted state. > >> > > > > > > > We also tried > >> > > > > > > > VBoxManager commands to disconnect the network cable for > >> > > > > > > > that new guest, > >> > > > > > > > it > >> > > > > > > > didn't complain, but there was no effect. > >> > > > > > > > > >> > > > > > > > For a couple times, on the host we disabled the interface > >> > > > > > > > bridging that new > >> > > > > > > > guest, then that vboxheadless process for that new guest > >> > > > > > > > disappeared (we > >> > > > > > > > attempted to kill it before that). And immediately all > >> > > > > > > > other vms regained > >> > > > > > > > connection back to normal. > >> > > > > > > > > >> > > > > > > > But there is one time even the above didn't help - the > >> > > > > > > > vboxheadless process > >> > > > > > > > for that new guest stubbonly remains, and we had to reboot > >> > > > > > > > the host. > >> > > > > > > > > >> > > > > > > > This is already a production server, so we can't upgrade > >> > > > > > > > virtualbox to the > >> > > > > > > > latest version until we obtain a test server. > >> > > > > > > > > >> > > > > > > > Would you advise: > >> > > > > > > > > >> > > > > > > > 1. is there any other way to kill that new guest instead of > >> > > > > > > > rebooting? 2. what might cause the problem? > >> > > > > > > > 3. what setting and test I can do to analyze this problem? > >> > > > > > > > ______________________________****_________________ > >> > > > > > > > > >> > > > > > > > > >> > > > > > > I haven't seen any comments on this and don't want you to > >> > > > > > > think you are being ignored but I haven't seen this but also, > >> > > > > > > the 4.0 branch was buggier > >> > > > > > > for me than the 4.1 releases so yeah, upgrading is probably > >> > > > > > > what you are looking at. > >> > > > > > > > >> > > > > > > Rusty Nejdl > >> > > > > > > ______________________________****_________________ > >> > > > > > > > >> > > > > > > > >> > > > > > > sorry, just realize my reply yesterday didn't go to the list, > >> > > > > > > so am > >> > > > > > re-sending with some updates. > >> > > > > > > >> > > > > > Yes, we upgraded all ports and fortunately everything went back > >> > > > > > and especially all vms has run peacefully for two days now. So > >> > > > > > upgrading to the latest virtualbox 4.1.16 solved that problem. > >> > > > > > > >> > > > > > But now we got a new problem with this new version of > >> virtualbox: > >> > > > > > whenever > >> > > > > > we try to vnc to any vm, that vm will go to Aborted state > >> > > > > > immediately. Actually, merely telnet from within the host to the > >> > > > > > vnc port of that vm will immediately Abort that vm. This > >> > > > > > prevents us from adding new vms. Also, when starting vm with vnc > >> > > > > > port, we got this message: > >> > > > > > > >> > > > > > rfbListenOnTCP6Port: error in bind IPv6 socket: Address already > >> > > > > > in use > >> > > > > > > >> > > > > > , which we found someone else provided a patch at > >> > > > > > > >> http://permalink.gmane.org/**gmane.os.freebsd.devel.**emulation/10237< > >> http://permalink.gmane.org/gmane.os.freebsd.devel.emulation/10237> > >> > > > > > > >> > > > > > So looks like when there are multiple vms on a ipv6 system (we > >> > > > > > have 64bit FreeBSD 9.0) will get this problem. > >> > > > > > > >> > > > > > >> > > > > Glad to hear that 4.1.16 helps for the networking problem. The VNC > >> > > > > problem is also a known one but the mentioned patch does not work > >> > > > > at least for a few people. It seems the bug is somewhere in > >> > > > > libvncserver so downgrading net/libvncserver to an earlier version > >> > > > > (and rebuilding virtualbox) should help until we come up with a > >> > > > > proper fix. > >> > > > > > >> > > > > >> > > > You are right about the "Address already in use" problem and the > >> > > > patch for it so I will commit the fix in a few moments. > >> > > > > >> > > > I have also tried to reproduce the VNC crash but I couldn't. > >> Probably > >> > > > because > >> > > > my system is IPv6 enabled. flo@ has seen the same crash and has no > >> > > > IPv6 in his kernel which lead him to find this commit in > >> > > > libvncserver: > >> > > > > >> > > > > >> > > > commit 66282f58000c8863e104666c30cb67**b1d5cbdee3 > >> > > > Author: Kyle J. McKay > >> > > > Date: Fri May 18 00:30:11 2012 -0700 > >> > > > libvncserver/sockets.c: do not segfault when > >> > > > listenSock/listen6Sock == -1 > >> > > > > >> > > > http://libvncserver.git.** > >> sourceforge.net/git/gitweb.**cgi?p=libvncserver/ > >> > > > **libvncserver;a=commit;h=**66282f5< > >> http://libvncserver.git.sourceforge.net/git/gitweb.cgi?p=libvncserver/libvncserver;a=commit;h=66282f5 > >> > > >> > > > > >> > > > > >> > > > It looks promising so please test this patch if you can reproduce > >> the > >> > > > crash. > >> > > > > >> > > > > >> > > > -- > >> > > > Bernhard Froehlich > >> > > > http://www.bluelife.at/ > >> > > > > >> > > > >> > > Sorry, I tried to try this patch, but couldn't figure out how to do > >> > > that. I use ports to compile everything, and can see the file is at > >> > > > >> /usr/ports/net/libvncserver/work/LibVNCServer-0.9.9/libvncserver/sockets.c > >> > > . However, if I edit this file and do make clean, this patch is wiped > >> > > out before I can do "make" out of it. How to apply this patch in the > >> > > ports? > >> > > >> > To apply patches to ports: > >> > # make clean > >> > # make patch > >> > > >> > # make > >> > # make deinstall > >> > # make reinstall > >> > > >> > Note that the final two steps assume a version of the port is already > >> > installed. If not: 'make install' > >> > I you use portmaster, after applying the patch: 'portmaster -C > >> > net/libvncserver' -- > >> > >> flo has already committed the patch to net/libvncserver so I guess it > >> fixes the problem. Please update your portstree and verify that it works > >> fine. > >> > > > > I confirmed after upgrading all ports and noticing libvncserver upgraded > > to 0.99_1 and reboot, then I can vnc to the vms now. Also, starting vms > > with vnc doesn't have that error now, instead it issues the following info, > > so all problem are solved. > > > > 07/06/2012 03:49:14 Listening for VNC connections on TCP port 5903 > > 07/06/2012 03:49:14 Listening for VNC connections on TCP6 port 5903 > > > > Thanks everyone for your great help! > > > > Unfortunately, seems that the original problem of one vm disrupts all vms > and entire network appears to remain, albeit to less scope. After running > on virtualbox-ose-4.1.16_1 and libvncserver-0.9.9_1 for 12 hours, all vms > lost connection again. Also, phpvirtualbox stopped responding, and > attempts to restart vboxwebsrv hanged. And trying to kill (-9) the > vboxwebsrv process won't work. The following was the output of "ps > aux|grep -i box" at that time: > > root 3322 78.7 16.9 4482936 4248180 ?? Is 3:42AM 126:00.53 > /usr/local/bin/VBoxHeadless --startvm vm1 > root 3377 0.2 4.3 1286200 1078728 ?? Is 3:42AM 15:39.40 > /usr/local/bin/VBoxHeadless --startvm vm2 > root 3388 0.1 4.3 1297592 1084676 ?? Is 3:42AM 15:06.97 > /usr/local/bin/VBoxHeadless --startvm vm7 -n -m 5907 -o jtlgjkrfyh9tpgjklfds > root 2453 0.0 0.0 141684 7156 ?? Ts 3:38AM 4:14.09 > /usr/local/bin/vboxwebsrv > root 2478 0.0 0.0 45288 2528 ?? S 3:38AM 1:29.99 > /usr/local/lib/virtualbox/VBoxXPCOMIPCD > root 2494 0.0 0.0 121848 5380 ?? S 3:38AM 3:13.96 > /usr/local/lib/virtualbox/VBoxSVC --auto-shutdown > root 3333 0.0 4.3 1294712 1079608 ?? Is 3:42AM 19:35.09 > /usr/local/bin/VBoxHeadless --startvm vm3 > root 3355 0.0 4.3 1290424 1079332 ?? Is 3:42AM 16:43.05 > /usr/local/bin/VBoxHeadless --startvm vm5 > root 3366 0.0 8.5 2351436 2140076 ?? Is 3:42AM 17:32.35 > /usr/local/bin/VBoxHeadless --startvm vm6 > root 3598 0.0 4.3 1294520 1078664 ?? Ds 3:50AM 15:01.04 > /usr/local/bin/VBoxHeadless --startvm vm4 -n -m 5904 -o > u679y0uojlkdfsgkjtfds > > You can see the vboxwebsrv process has the "T" flag there, and the > vboxheadless process for vm4 has "D" flag there. Both of such processes I > can never kill them, not even with "kill -9". So on the host I disabled > the interface bridged to vm4 and restarted network, and fortunately both > the vm4 and the vboxwebsrv processed disappeared. And at that point all > other vms regained network. > > There may be one hope that the "troublemaker" may be limited to one of the > vms that started with vnc, although there was no vnc connection at that > time, and the other vm with vnc was fine. And this is just a hopeful guess. > > Also I found no log or error message related to virtualbox in any log > file. The VBoxSVC.log only had some information when started but never > since. If this is still a problem then ps alxww | grep -i box may be more helpful as it will show the wait channel of processes stuck in the kernel. Gary