Date: Fri, 13 Jul 2012 16:34:12 -0400 From: Steve Tuts <yiz5hwi@gmail.com> To: freebsd-emulation@freebsd.org Subject: become worse now - Re: one virtualbox vm disrupts all vms and entire network Message-ID: <CAEXKtDpKxACxbVQYYM9V2FUCt37PzwXx5ZzcqF_c1zFqynkeyw@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
On Mon, Jul 9, 2012 at 9:11 AM, Steve Tuts <yiz5hwi@gmail.com> wrote: > > > On Tue, Jun 12, 2012 at 6:24 PM, Gary Palmer <gpalmer@freebsd.org> wrote: > >> On Thu, Jun 07, 2012 at 03:56:22PM -0400, Steve Tuts wrote: >> > On Thu, Jun 7, 2012 at 3:54 AM, Steve Tuts <yiz5hwi@gmail.com> wrote: >> > >> > > >> > > >> > > On Thu, Jun 7, 2012 at 2:58 AM, Bernhard Fr?hlich <decke@bluelife.at >> >wrote: >> > > >> > >> On Do., 7. Jun. 2012 01:07:52 CEST, Kevin Oberman < >> kob6558@gmail.com> >> > >> wrote: >> > >> >> > >> > On Wed, Jun 6, 2012 at 3:46 PM, Steve Tuts <yiz5hwi@gmail.com> >> wrote: >> > >> > > On Wed, Jun 6, 2012 at 3:50 AM, Bernhard Froehlich >> > >> > > <decke@freebsd.org>wrote: >> > >> > > >> > >> > > > On 05.06.2012 20:16, Bernhard Froehlich wrote: >> > >> > > > >> > >> > > > > On 05.06.2012 19:05, Steve Tuts wrote: >> > >> > > > > >> > >> > > > > > On Mon, Jun 4, 2012 at 4:11 PM, Rusty Nejdl >> > >> > > > > > <rnejdl@ringofsaturn.com> wrote: >> > >> > > > > > >> > >> > > > > > On 2012-06-02 12:16, Steve Tuts wrote: >> > >> > > > > > > >> > >> > > > > > > Hi, we have a Dell poweredge server with a dozen >> interfaces. >> > >> > > > > > > It hosts >> > >> > > > > > > > a >> > >> > > > > > > > few guests of web app and email servers with >> > >> > > > > > > > VirtualBox-4.0.14. The host >> > >> > > > > > > > and all guests are FreeBSD 9.0 64bit. Each guest is >> bridged >> > >> > > > > > > > to a distinct >> > >> > > > > > > > interface. The host and all guests are set to 10.0.0.0 >> > >> > > > > > > > network NAT'ed to >> > >> > > > > > > > a >> > >> > > > > > > > cicso router. >> > >> > > > > > > > >> > >> > > > > > > > This runs well for a couple months, until we added a >> new >> > >> > > > > > > > guest recently. >> > >> > > > > > > > Every few hours, none of the guests can be connected. >> We >> > >> > > > > > > > can only connect >> > >> > > > > > > > to the host from outside the router. We can also go >> to the >> > >> > > > > > > > console of the >> > >> > > > > > > > guests (except the new guest), but from there we can't >> ping >> > >> > > > > > > > the gateway 10.0.0.1 any more. The new guest just >> froze. >> > >> > > > > > > > >> > >> > > > > > > > Furthermore, on the host we can see a vboxheadless >> process >> > >> > > > > > > > for each guest, >> > >> > > > > > > > including the new guest. But we can not kill it, not >> even >> > >> > > > > > > > with "kill -9". >> > >> > > > > > > > We looked around the web and someone suggested we >> should use >> > >> > > > > > > > "kill -SIGCONT" first since the "ps" output has the >> "T" flag >> > >> > > > > > > > for that vboxheadless process for that new guest, but >> that >> > >> > > > > > > > doesn't help. We also >> > >> > > > > > > > tried all the VBoxManager commands to poweroff/reset >> etc >> > >> > > > > > > > that new guest, >> > >> > > > > > > > but they all failed complaining that vm is in Aborted >> state. >> > >> > > > > > > > We also tried >> > >> > > > > > > > VBoxManager commands to disconnect the network cable >> for >> > >> > > > > > > > that new guest, >> > >> > > > > > > > it >> > >> > > > > > > > didn't complain, but there was no effect. >> > >> > > > > > > > >> > >> > > > > > > > For a couple times, on the host we disabled the >> interface >> > >> > > > > > > > bridging that new >> > >> > > > > > > > guest, then that vboxheadless process for that new >> guest >> > >> > > > > > > > disappeared (we >> > >> > > > > > > > attempted to kill it before that). And immediately all >> > >> > > > > > > > other vms regained >> > >> > > > > > > > connection back to normal. >> > >> > > > > > > > >> > >> > > > > > > > But there is one time even the above didn't help - the >> > >> > > > > > > > vboxheadless process >> > >> > > > > > > > for that new guest stubbonly remains, and we had to >> reboot >> > >> > > > > > > > the host. >> > >> > > > > > > > >> > >> > > > > > > > This is already a production server, so we can't >> upgrade >> > >> > > > > > > > virtualbox to the >> > >> > > > > > > > latest version until we obtain a test server. >> > >> > > > > > > > >> > >> > > > > > > > Would you advise: >> > >> > > > > > > > >> > >> > > > > > > > 1. is there any other way to kill that new guest >> instead of >> > >> > > > > > > > rebooting? 2. what might cause the problem? >> > >> > > > > > > > 3. what setting and test I can do to analyze this >> problem? >> > >> > > > > > > > ______________________________****_________________ >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > I haven't seen any comments on this and don't want you to >> > >> > > > > > > think you are being ignored but I haven't seen this but >> also, >> > >> > > > > > > the 4.0 branch was buggier >> > >> > > > > > > for me than the 4.1 releases so yeah, upgrading is >> probably >> > >> > > > > > > what you are looking at. >> > >> > > > > > > >> > >> > > > > > > Rusty Nejdl >> > >> > > > > > > ______________________________****_________________ >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > sorry, just realize my reply yesterday didn't go to the >> list, >> > >> > > > > > > so am >> > >> > > > > > re-sending with some updates. >> > >> > > > > > >> > >> > > > > > Yes, we upgraded all ports and fortunately everything went >> back >> > >> > > > > > and especially all vms has run peacefully for two days >> now. So >> > >> > > > > > upgrading to the latest virtualbox 4.1.16 solved that >> problem. >> > >> > > > > > >> > >> > > > > > But now we got a new problem with this new version of >> > >> virtualbox: >> > >> > > > > > whenever >> > >> > > > > > we try to vnc to any vm, that vm will go to Aborted state >> > >> > > > > > immediately. Actually, merely telnet from within the host >> to the >> > >> > > > > > vnc port of that vm will immediately Abort that vm. This >> > >> > > > > > prevents us from adding new vms. Also, when starting vm >> with vnc >> > >> > > > > > port, we got this message: >> > >> > > > > > >> > >> > > > > > rfbListenOnTCP6Port: error in bind IPv6 socket: Address >> already >> > >> > > > > > in use >> > >> > > > > > >> > >> > > > > > , which we found someone else provided a patch at >> > >> > > > > > >> > >> >> http://permalink.gmane.org/**gmane.os.freebsd.devel.**emulation/10237< >> > >> http://permalink.gmane.org/gmane.os.freebsd.devel.emulation/10237> >> > >> > > > > > >> > >> > > > > > So looks like when there are multiple vms on a ipv6 system >> (we >> > >> > > > > > have 64bit FreeBSD 9.0) will get this problem. >> > >> > > > > > >> > >> > > > > >> > >> > > > > Glad to hear that 4.1.16 helps for the networking problem. >> The VNC >> > >> > > > > problem is also a known one but the mentioned patch does not >> work >> > >> > > > > at least for a few people. It seems the bug is somewhere in >> > >> > > > > libvncserver so downgrading net/libvncserver to an earlier >> version >> > >> > > > > (and rebuilding virtualbox) should help until we come up >> with a >> > >> > > > > proper fix. >> > >> > > > > >> > >> > > > >> > >> > > > You are right about the "Address already in use" problem and >> the >> > >> > > > patch for it so I will commit the fix in a few moments. >> > >> > > > >> > >> > > > I have also tried to reproduce the VNC crash but I couldn't. >> > >> Probably >> > >> > > > because >> > >> > > > my system is IPv6 enabled. flo@ has seen the same crash and >> has no >> > >> > > > IPv6 in his kernel which lead him to find this commit in >> > >> > > > libvncserver: >> > >> > > > >> > >> > > > >> > >> > > > commit 66282f58000c8863e104666c30cb67**b1d5cbdee3 >> > >> > > > Author: Kyle J. McKay <mackyle@gmail.com> >> > >> > > > Date: Fri May 18 00:30:11 2012 -0700 >> > >> > > > libvncserver/sockets.c: do not segfault when >> > >> > > > listenSock/listen6Sock == -1 >> > >> > > > >> > >> > > > http://libvncserver.git.** >> > >> sourceforge.net/git/gitweb.**cgi?p=libvncserver/ >> > >> > > > **libvncserver;a=commit;h=**66282f5< >> > >> >> http://libvncserver.git.sourceforge.net/git/gitweb.cgi?p=libvncserver/libvncserver;a=commit;h=66282f5 >> > >> > >> > >> > > > >> > >> > > > >> > >> > > > It looks promising so please test this patch if you can >> reproduce >> > >> the >> > >> > > > crash. >> > >> > > > >> > >> > > > >> > >> > > > -- >> > >> > > > Bernhard Froehlich >> > >> > > > http://www.bluelife.at/ >> > >> > > > >> > >> > > >> > >> > > Sorry, I tried to try this patch, but couldn't figure out how to >> do >> > >> > > that. I use ports to compile everything, and can see the file is >> at >> > >> > > >> > >> >> /usr/ports/net/libvncserver/work/LibVNCServer-0.9.9/libvncserver/sockets.c >> > >> > > . However, if I edit this file and do make clean, this patch is >> wiped >> > >> > > out before I can do "make" out of it. How to apply this patch >> in the >> > >> > > ports? >> > >> > >> > >> > To apply patches to ports: >> > >> > # make clean >> > >> > # make patch >> > >> > <Apply patch> >> > >> > # make >> > >> > # make deinstall >> > >> > # make reinstall >> > >> > >> > >> > Note that the final two steps assume a version of the port is >> already >> > >> > installed. If not: 'make install' >> > >> > I you use portmaster, after applying the patch: 'portmaster -C >> > >> > net/libvncserver' -- >> > >> >> > >> flo has already committed the patch to net/libvncserver so I guess it >> > >> fixes the problem. Please update your portstree and verify that it >> works >> > >> fine. >> > >> >> > > >> > > I confirmed after upgrading all ports and noticing libvncserver >> upgraded >> > > to 0.99_1 and reboot, then I can vnc to the vms now. Also, starting >> vms >> > > with vnc doesn't have that error now, instead it issues the following >> info, >> > > so all problem are solved. >> > > >> > > 07/06/2012 03:49:14 Listening for VNC connections on TCP port 5903 >> > > 07/06/2012 03:49:14 Listening for VNC connections on TCP6 port 5903 >> > > >> > > Thanks everyone for your great help! >> > > >> > >> > Unfortunately, seems that the original problem of one vm disrupts all >> vms >> > and entire network appears to remain, albeit to less scope. After >> running >> > on virtualbox-ose-4.1.16_1 and libvncserver-0.9.9_1 for 12 hours, all >> vms >> > lost connection again. Also, phpvirtualbox stopped responding, and >> > attempts to restart vboxwebsrv hanged. And trying to kill (-9) the >> > vboxwebsrv process won't work. The following was the output of "ps >> > aux|grep -i box" at that time: >> > >> > root 3322 78.7 16.9 4482936 4248180 ?? Is 3:42AM 126:00.53 >> > /usr/local/bin/VBoxHeadless --startvm vm1 >> > root 3377 0.2 4.3 1286200 1078728 ?? Is 3:42AM 15:39.40 >> > /usr/local/bin/VBoxHeadless --startvm vm2 >> > root 3388 0.1 4.3 1297592 1084676 ?? Is 3:42AM 15:06.97 >> > /usr/local/bin/VBoxHeadless --startvm vm7 -n -m 5907 -o >> jtlgjkrfyh9tpgjklfds >> > root 2453 0.0 0.0 141684 7156 ?? Ts 3:38AM 4:14.09 >> > /usr/local/bin/vboxwebsrv >> > root 2478 0.0 0.0 45288 2528 ?? S 3:38AM 1:29.99 >> > /usr/local/lib/virtualbox/VBoxXPCOMIPCD >> > root 2494 0.0 0.0 121848 5380 ?? S 3:38AM 3:13.96 >> > /usr/local/lib/virtualbox/VBoxSVC --auto-shutdown >> > root 3333 0.0 4.3 1294712 1079608 ?? Is 3:42AM 19:35.09 >> > /usr/local/bin/VBoxHeadless --startvm vm3 >> > root 3355 0.0 4.3 1290424 1079332 ?? Is 3:42AM 16:43.05 >> > /usr/local/bin/VBoxHeadless --startvm vm5 >> > root 3366 0.0 8.5 2351436 2140076 ?? Is 3:42AM 17:32.35 >> > /usr/local/bin/VBoxHeadless --startvm vm6 >> > root 3598 0.0 4.3 1294520 1078664 ?? Ds 3:50AM 15:01.04 >> > /usr/local/bin/VBoxHeadless --startvm vm4 -n -m 5904 -o >> > u679y0uojlkdfsgkjtfds >> > >> > You can see the vboxwebsrv process has the "T" flag there, and the >> > vboxheadless process for vm4 has "D" flag there. Both of such >> processes I >> > can never kill them, not even with "kill -9". So on the host I disabled >> > the interface bridged to vm4 and restarted network, and fortunately both >> > the vm4 and the vboxwebsrv processed disappeared. And at that point all >> > other vms regained network. >> > >> > There may be one hope that the "troublemaker" may be limited to one of >> the >> > vms that started with vnc, although there was no vnc connection at that >> > time, and the other vm with vnc was fine. And this is just a hopeful >> guess. >> > >> > Also I found no log or error message related to virtualbox in any log >> > file. The VBoxSVC.log only had some information when started but never >> > since. >> >> If this is still a problem then >> >> ps alxww | grep -i box >> >> may be more helpful as it will show the wait channel of processes stuck >> in the kernel. >> >> Gary >> > > We avoided this problem by running all vms without vnc. But forgot this > problem and left one vm on with vnc, together with the other few running > vms yesterday, and hit this problem again on virtualbox 4.1.16. Only the > old trick of turning off the host interface corresponding to the vm with > vnc and then restarting host network got us out of the problem. > > We then upgraded virtualbox to 4.1.18, turning off all vms, wait until "ps > aux|grep -i box" reported nothing, then started all vms. And let no vm > with vnc running. > > Still the problem hit us again. Here is the output of " ps alxww | grep > -i box" as you suggested: > > 1011 42725 1 0 20 0 1289796 1081064 IPRT S > Is ?? 30:53.24 VBoxHeadless --startvm vm5 > > after "kill -9 42725", the line changed to > > 1011 42725 1 0 20 0 1289796 1081064 keglim > Ts ?? 30:53.24 VBoxHeadless --startvm vm5 > > after "kill -9" for another vm, the line changed to something like > > 1011 42754 1 0 20 0 1289796 1081064 - Ts > ?? 30:53.24 VBoxHeadless --startvm vm7 > > and controlvm command don't work, and these command stuck there > themselves. The following are their outputs: > > 0 89572 79180 0 21 0 44708 1644 select I+ > v6 0:00.01 VBoxManage controlvm projects_outside acpipowerbutton > 0 89605 89586 0 21 0 44708 2196 select I+ > v7 0:00.01 VBoxManage controlvm projects_outside poweroff > > We now rebooted the host, and left no vm with vnc running. > The problem has become more rampant now. After rebooting and running virtualbox-ose-4.1.18, and no vm was started with console, the around 10 vms, bridged to each of its own dedicated interface, get no network connection a couple times a day. Most times it would recover itself after about 10 minutes, sometimes we have to restart host network which immediately restore all connections.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAEXKtDpKxACxbVQYYM9V2FUCt37PzwXx5ZzcqF_c1zFqynkeyw>