From owner-freebsd-emulation@FreeBSD.ORG Fri Jul 13 20:34:18 2012 Return-Path: Delivered-To: freebsd-emulation@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C50FF106566B for ; Fri, 13 Jul 2012 20:34:18 +0000 (UTC) (envelope-from yiz5hwi@gmail.com) Received: from mail-vc0-f182.google.com (mail-vc0-f182.google.com [209.85.220.182]) by mx1.freebsd.org (Postfix) with ESMTP id 7A4A58FC15 for ; Fri, 13 Jul 2012 20:34:18 +0000 (UTC) Received: by vcbf1 with SMTP id f1so3071166vcb.13 for ; Fri, 13 Jul 2012 13:34:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=3iy9QJi1qOY0jK7J1/QsMX/I252RTjULgf7Q+hM3XZw=; b=qkeGDb754Q7iQKznzd+Q1Ttp0XygZpUm5uLIl4z1FCna2oEO88kGH8mf9S18pHdQL/ SCibtn/SbAasgwOsQqGfM8dkM8MNGHX/NSTk0oTFbSnUYTeziHHMoTUqxYZ51Sb88jNa ZCCM6QZOqHlypjvakGvTPEb71y7Y5a+pkP4fxw7qJtdNWUFEtBJ0Zgnh0ZQFErs3LFlU MjidEDFZsPs95onCfbg992FZDa7HNaIt/UrRA2BFCHsm/cZwmdretENgDRlmd3IwuvMj L78nIeeSvykCLmdE7XqD79a3A1jhRtsEeuP3hl5av3JlJFP1MEfp4iZVHTVp6ew1NKFY v7Gg== MIME-Version: 1.0 Received: by 10.220.106.135 with SMTP id x7mr1259693vco.28.1342211652381; Fri, 13 Jul 2012 13:34:12 -0700 (PDT) Received: by 10.52.115.134 with HTTP; Fri, 13 Jul 2012 13:34:12 -0700 (PDT) Date: Fri, 13 Jul 2012 16:34:12 -0400 Message-ID: From: Steve Tuts To: freebsd-emulation@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: become worse now - Re: one virtualbox vm disrupts all vms and entire network X-BeenThere: freebsd-emulation@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Development of Emulators of other operating systems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Jul 2012 20:34:18 -0000 On Mon, Jul 9, 2012 at 9:11 AM, Steve Tuts wrote: > > > On Tue, Jun 12, 2012 at 6:24 PM, Gary Palmer wrote: > >> On Thu, Jun 07, 2012 at 03:56:22PM -0400, Steve Tuts wrote: >> > On Thu, Jun 7, 2012 at 3:54 AM, Steve Tuts wrote: >> > >> > > >> > > >> > > On Thu, Jun 7, 2012 at 2:58 AM, Bernhard Fr?hlich > >wrote: >> > > >> > >> On Do., 7. Jun. 2012 01:07:52 CEST, Kevin Oberman < >> kob6558@gmail.com> >> > >> wrote: >> > >> >> > >> > On Wed, Jun 6, 2012 at 3:46 PM, Steve Tuts >> wrote: >> > >> > > On Wed, Jun 6, 2012 at 3:50 AM, Bernhard Froehlich >> > >> > > wrote: >> > >> > > >> > >> > > > On 05.06.2012 20:16, Bernhard Froehlich wrote: >> > >> > > > >> > >> > > > > On 05.06.2012 19:05, Steve Tuts wrote: >> > >> > > > > >> > >> > > > > > On Mon, Jun 4, 2012 at 4:11 PM, Rusty Nejdl >> > >> > > > > > wrote: >> > >> > > > > > >> > >> > > > > > On 2012-06-02 12:16, Steve Tuts wrote: >> > >> > > > > > > >> > >> > > > > > > Hi, we have a Dell poweredge server with a dozen >> interfaces. >> > >> > > > > > > It hosts >> > >> > > > > > > > a >> > >> > > > > > > > few guests of web app and email servers with >> > >> > > > > > > > VirtualBox-4.0.14. The host >> > >> > > > > > > > and all guests are FreeBSD 9.0 64bit. Each guest is >> bridged >> > >> > > > > > > > to a distinct >> > >> > > > > > > > interface. The host and all guests are set to 10.0.0.0 >> > >> > > > > > > > network NAT'ed to >> > >> > > > > > > > a >> > >> > > > > > > > cicso router. >> > >> > > > > > > > >> > >> > > > > > > > This runs well for a couple months, until we added a >> new >> > >> > > > > > > > guest recently. >> > >> > > > > > > > Every few hours, none of the guests can be connected. >> We >> > >> > > > > > > > can only connect >> > >> > > > > > > > to the host from outside the router. We can also go >> to the >> > >> > > > > > > > console of the >> > >> > > > > > > > guests (except the new guest), but from there we can't >> ping >> > >> > > > > > > > the gateway 10.0.0.1 any more. The new guest just >> froze. >> > >> > > > > > > > >> > >> > > > > > > > Furthermore, on the host we can see a vboxheadless >> process >> > >> > > > > > > > for each guest, >> > >> > > > > > > > including the new guest. But we can not kill it, not >> even >> > >> > > > > > > > with "kill -9". >> > >> > > > > > > > We looked around the web and someone suggested we >> should use >> > >> > > > > > > > "kill -SIGCONT" first since the "ps" output has the >> "T" flag >> > >> > > > > > > > for that vboxheadless process for that new guest, but >> that >> > >> > > > > > > > doesn't help. We also >> > >> > > > > > > > tried all the VBoxManager commands to poweroff/reset >> etc >> > >> > > > > > > > that new guest, >> > >> > > > > > > > but they all failed complaining that vm is in Aborted >> state. >> > >> > > > > > > > We also tried >> > >> > > > > > > > VBoxManager commands to disconnect the network cable >> for >> > >> > > > > > > > that new guest, >> > >> > > > > > > > it >> > >> > > > > > > > didn't complain, but there was no effect. >> > >> > > > > > > > >> > >> > > > > > > > For a couple times, on the host we disabled the >> interface >> > >> > > > > > > > bridging that new >> > >> > > > > > > > guest, then that vboxheadless process for that new >> guest >> > >> > > > > > > > disappeared (we >> > >> > > > > > > > attempted to kill it before that). And immediately all >> > >> > > > > > > > other vms regained >> > >> > > > > > > > connection back to normal. >> > >> > > > > > > > >> > >> > > > > > > > But there is one time even the above didn't help - the >> > >> > > > > > > > vboxheadless process >> > >> > > > > > > > for that new guest stubbonly remains, and we had to >> reboot >> > >> > > > > > > > the host. >> > >> > > > > > > > >> > >> > > > > > > > This is already a production server, so we can't >> upgrade >> > >> > > > > > > > virtualbox to the >> > >> > > > > > > > latest version until we obtain a test server. >> > >> > > > > > > > >> > >> > > > > > > > Would you advise: >> > >> > > > > > > > >> > >> > > > > > > > 1. is there any other way to kill that new guest >> instead of >> > >> > > > > > > > rebooting? 2. what might cause the problem? >> > >> > > > > > > > 3. what setting and test I can do to analyze this >> problem? >> > >> > > > > > > > ______________________________****_________________ >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > I haven't seen any comments on this and don't want you to >> > >> > > > > > > think you are being ignored but I haven't seen this but >> also, >> > >> > > > > > > the 4.0 branch was buggier >> > >> > > > > > > for me than the 4.1 releases so yeah, upgrading is >> probably >> > >> > > > > > > what you are looking at. >> > >> > > > > > > >> > >> > > > > > > Rusty Nejdl >> > >> > > > > > > ______________________________****_________________ >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > sorry, just realize my reply yesterday didn't go to the >> list, >> > >> > > > > > > so am >> > >> > > > > > re-sending with some updates. >> > >> > > > > > >> > >> > > > > > Yes, we upgraded all ports and fortunately everything went >> back >> > >> > > > > > and especially all vms has run peacefully for two days >> now. So >> > >> > > > > > upgrading to the latest virtualbox 4.1.16 solved that >> problem. >> > >> > > > > > >> > >> > > > > > But now we got a new problem with this new version of >> > >> virtualbox: >> > >> > > > > > whenever >> > >> > > > > > we try to vnc to any vm, that vm will go to Aborted state >> > >> > > > > > immediately. Actually, merely telnet from within the host >> to the >> > >> > > > > > vnc port of that vm will immediately Abort that vm. This >> > >> > > > > > prevents us from adding new vms. Also, when starting vm >> with vnc >> > >> > > > > > port, we got this message: >> > >> > > > > > >> > >> > > > > > rfbListenOnTCP6Port: error in bind IPv6 socket: Address >> already >> > >> > > > > > in use >> > >> > > > > > >> > >> > > > > > , which we found someone else provided a patch at >> > >> > > > > > >> > >> >> http://permalink.gmane.org/**gmane.os.freebsd.devel.**emulation/10237< >> > >> http://permalink.gmane.org/gmane.os.freebsd.devel.emulation/10237> >> > >> > > > > > >> > >> > > > > > So looks like when there are multiple vms on a ipv6 system >> (we >> > >> > > > > > have 64bit FreeBSD 9.0) will get this problem. >> > >> > > > > > >> > >> > > > > >> > >> > > > > Glad to hear that 4.1.16 helps for the networking problem. >> The VNC >> > >> > > > > problem is also a known one but the mentioned patch does not >> work >> > >> > > > > at least for a few people. It seems the bug is somewhere in >> > >> > > > > libvncserver so downgrading net/libvncserver to an earlier >> version >> > >> > > > > (and rebuilding virtualbox) should help until we come up >> with a >> > >> > > > > proper fix. >> > >> > > > > >> > >> > > > >> > >> > > > You are right about the "Address already in use" problem and >> the >> > >> > > > patch for it so I will commit the fix in a few moments. >> > >> > > > >> > >> > > > I have also tried to reproduce the VNC crash but I couldn't. >> > >> Probably >> > >> > > > because >> > >> > > > my system is IPv6 enabled. flo@ has seen the same crash and >> has no >> > >> > > > IPv6 in his kernel which lead him to find this commit in >> > >> > > > libvncserver: >> > >> > > > >> > >> > > > >> > >> > > > commit 66282f58000c8863e104666c30cb67**b1d5cbdee3 >> > >> > > > Author: Kyle J. McKay >> > >> > > > Date: Fri May 18 00:30:11 2012 -0700 >> > >> > > > libvncserver/sockets.c: do not segfault when >> > >> > > > listenSock/listen6Sock == -1 >> > >> > > > >> > >> > > > http://libvncserver.git.** >> > >> sourceforge.net/git/gitweb.**cgi?p=libvncserver/ >> > >> > > > **libvncserver;a=commit;h=**66282f5< >> > >> >> http://libvncserver.git.sourceforge.net/git/gitweb.cgi?p=libvncserver/libvncserver;a=commit;h=66282f5 >> > >> > >> > >> > > > >> > >> > > > >> > >> > > > It looks promising so please test this patch if you can >> reproduce >> > >> the >> > >> > > > crash. >> > >> > > > >> > >> > > > >> > >> > > > -- >> > >> > > > Bernhard Froehlich >> > >> > > > http://www.bluelife.at/ >> > >> > > > >> > >> > > >> > >> > > Sorry, I tried to try this patch, but couldn't figure out how to >> do >> > >> > > that. I use ports to compile everything, and can see the file is >> at >> > >> > > >> > >> >> /usr/ports/net/libvncserver/work/LibVNCServer-0.9.9/libvncserver/sockets.c >> > >> > > . However, if I edit this file and do make clean, this patch is >> wiped >> > >> > > out before I can do "make" out of it. How to apply this patch >> in the >> > >> > > ports? >> > >> > >> > >> > To apply patches to ports: >> > >> > # make clean >> > >> > # make patch >> > >> > >> > >> > # make >> > >> > # make deinstall >> > >> > # make reinstall >> > >> > >> > >> > Note that the final two steps assume a version of the port is >> already >> > >> > installed. If not: 'make install' >> > >> > I you use portmaster, after applying the patch: 'portmaster -C >> > >> > net/libvncserver' -- >> > >> >> > >> flo has already committed the patch to net/libvncserver so I guess it >> > >> fixes the problem. Please update your portstree and verify that it >> works >> > >> fine. >> > >> >> > > >> > > I confirmed after upgrading all ports and noticing libvncserver >> upgraded >> > > to 0.99_1 and reboot, then I can vnc to the vms now. Also, starting >> vms >> > > with vnc doesn't have that error now, instead it issues the following >> info, >> > > so all problem are solved. >> > > >> > > 07/06/2012 03:49:14 Listening for VNC connections on TCP port 5903 >> > > 07/06/2012 03:49:14 Listening for VNC connections on TCP6 port 5903 >> > > >> > > Thanks everyone for your great help! >> > > >> > >> > Unfortunately, seems that the original problem of one vm disrupts all >> vms >> > and entire network appears to remain, albeit to less scope. After >> running >> > on virtualbox-ose-4.1.16_1 and libvncserver-0.9.9_1 for 12 hours, all >> vms >> > lost connection again. Also, phpvirtualbox stopped responding, and >> > attempts to restart vboxwebsrv hanged. And trying to kill (-9) the >> > vboxwebsrv process won't work. The following was the output of "ps >> > aux|grep -i box" at that time: >> > >> > root 3322 78.7 16.9 4482936 4248180 ?? Is 3:42AM 126:00.53 >> > /usr/local/bin/VBoxHeadless --startvm vm1 >> > root 3377 0.2 4.3 1286200 1078728 ?? Is 3:42AM 15:39.40 >> > /usr/local/bin/VBoxHeadless --startvm vm2 >> > root 3388 0.1 4.3 1297592 1084676 ?? Is 3:42AM 15:06.97 >> > /usr/local/bin/VBoxHeadless --startvm vm7 -n -m 5907 -o >> jtlgjkrfyh9tpgjklfds >> > root 2453 0.0 0.0 141684 7156 ?? Ts 3:38AM 4:14.09 >> > /usr/local/bin/vboxwebsrv >> > root 2478 0.0 0.0 45288 2528 ?? S 3:38AM 1:29.99 >> > /usr/local/lib/virtualbox/VBoxXPCOMIPCD >> > root 2494 0.0 0.0 121848 5380 ?? S 3:38AM 3:13.96 >> > /usr/local/lib/virtualbox/VBoxSVC --auto-shutdown >> > root 3333 0.0 4.3 1294712 1079608 ?? Is 3:42AM 19:35.09 >> > /usr/local/bin/VBoxHeadless --startvm vm3 >> > root 3355 0.0 4.3 1290424 1079332 ?? Is 3:42AM 16:43.05 >> > /usr/local/bin/VBoxHeadless --startvm vm5 >> > root 3366 0.0 8.5 2351436 2140076 ?? Is 3:42AM 17:32.35 >> > /usr/local/bin/VBoxHeadless --startvm vm6 >> > root 3598 0.0 4.3 1294520 1078664 ?? Ds 3:50AM 15:01.04 >> > /usr/local/bin/VBoxHeadless --startvm vm4 -n -m 5904 -o >> > u679y0uojlkdfsgkjtfds >> > >> > You can see the vboxwebsrv process has the "T" flag there, and the >> > vboxheadless process for vm4 has "D" flag there. Both of such >> processes I >> > can never kill them, not even with "kill -9". So on the host I disabled >> > the interface bridged to vm4 and restarted network, and fortunately both >> > the vm4 and the vboxwebsrv processed disappeared. And at that point all >> > other vms regained network. >> > >> > There may be one hope that the "troublemaker" may be limited to one of >> the >> > vms that started with vnc, although there was no vnc connection at that >> > time, and the other vm with vnc was fine. And this is just a hopeful >> guess. >> > >> > Also I found no log or error message related to virtualbox in any log >> > file. The VBoxSVC.log only had some information when started but never >> > since. >> >> If this is still a problem then >> >> ps alxww | grep -i box >> >> may be more helpful as it will show the wait channel of processes stuck >> in the kernel. >> >> Gary >> > > We avoided this problem by running all vms without vnc. But forgot this > problem and left one vm on with vnc, together with the other few running > vms yesterday, and hit this problem again on virtualbox 4.1.16. Only the > old trick of turning off the host interface corresponding to the vm with > vnc and then restarting host network got us out of the problem. > > We then upgraded virtualbox to 4.1.18, turning off all vms, wait until "ps > aux|grep -i box" reported nothing, then started all vms. And let no vm > with vnc running. > > Still the problem hit us again. Here is the output of " ps alxww | grep > -i box" as you suggested: > > 1011 42725 1 0 20 0 1289796 1081064 IPRT S > Is ?? 30:53.24 VBoxHeadless --startvm vm5 > > after "kill -9 42725", the line changed to > > 1011 42725 1 0 20 0 1289796 1081064 keglim > Ts ?? 30:53.24 VBoxHeadless --startvm vm5 > > after "kill -9" for another vm, the line changed to something like > > 1011 42754 1 0 20 0 1289796 1081064 - Ts > ?? 30:53.24 VBoxHeadless --startvm vm7 > > and controlvm command don't work, and these command stuck there > themselves. The following are their outputs: > > 0 89572 79180 0 21 0 44708 1644 select I+ > v6 0:00.01 VBoxManage controlvm projects_outside acpipowerbutton > 0 89605 89586 0 21 0 44708 2196 select I+ > v7 0:00.01 VBoxManage controlvm projects_outside poweroff > > We now rebooted the host, and left no vm with vnc running. > The problem has become more rampant now. After rebooting and running virtualbox-ose-4.1.18, and no vm was started with console, the around 10 vms, bridged to each of its own dedicated interface, get no network connection a couple times a day. Most times it would recover itself after about 10 minutes, sometimes we have to restart host network which immediately restore all connections.