Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 12 Jun 2012 18:24:37 -0400
From:      Gary Palmer <gpalmer@freebsd.org>
To:        Steve Tuts <yiz5hwi@gmail.com>
Cc:        freebsd-emulation@freebsd.org
Subject:   Re: Still unresolved - Re: one virtualbox vm disrupts all vms and entire network
Message-ID:  <20120612222437.GB14487@in-addr.com>
In-Reply-To: <CAEXKtDqVCh-e-WNa2H%2Bj8ah0M0qiXvQ11J0ZLywzfS48UuJdCQ@mail.gmail.com>
References:  <CAEXKtDqVCh-e-WNa2H%2Bj8ah0M0qiXvQ11J0ZLywzfS48UuJdCQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jun 07, 2012 at 03:56:22PM -0400, Steve Tuts wrote:
> On Thu, Jun 7, 2012 at 3:54 AM, Steve Tuts <yiz5hwi@gmail.com> wrote:
> 
> >
> >
> > On Thu, Jun 7, 2012 at 2:58 AM, Bernhard Fr?hlich <decke@bluelife.at>wrote:
> >
> >> On Do.,   7. Jun. 2012 01:07:52 CEST, Kevin Oberman <kob6558@gmail.com>
> >> wrote:
> >>
> >> > On Wed, Jun 6, 2012 at 3:46 PM, Steve Tuts <yiz5hwi@gmail.com> wrote:
> >> > > On Wed, Jun 6, 2012 at 3:50 AM, Bernhard Froehlich
> >> > > <decke@freebsd.org>wrote:
> >> > >
> >> > > > On 05.06.2012 20:16, Bernhard Froehlich wrote:
> >> > > >
> >> > > > > On 05.06.2012 19:05, Steve Tuts wrote:
> >> > > > >
> >> > > > > > On Mon, Jun 4, 2012 at 4:11 PM, Rusty Nejdl
> >> > > > > > <rnejdl@ringofsaturn.com> wrote:
> >> > > > > >
> >> > > > > >  On 2012-06-02 12:16, Steve Tuts wrote:
> >> > > > > > >
> >> > > > > > >  Hi, we have a Dell poweredge server with a dozen interfaces.
> >> > > > > > >  It hosts
> >> > > > > > > > a
> >> > > > > > > > few guests of web app and email servers with
> >> > > > > > > > VirtualBox-4.0.14.  The host
> >> > > > > > > > and all guests are FreeBSD 9.0 64bit.  Each guest is bridged
> >> > > > > > > > to a distinct
> >> > > > > > > > interface.  The host and all guests are set to 10.0.0.0
> >> > > > > > > > network NAT'ed to
> >> > > > > > > > a
> >> > > > > > > > cicso router.
> >> > > > > > > >
> >> > > > > > > > This runs well for a couple months, until we added a new
> >> > > > > > > > guest recently.
> >> > > > > > > > Every few hours, none of the guests can be connected.  We
> >> > > > > > > > can only connect
> >> > > > > > > > to the host from outside the router.  We can also go to the
> >> > > > > > > > console of the
> >> > > > > > > > guests (except the new guest), but from there we can't ping
> >> > > > > > > > the gateway 10.0.0.1 any more.  The new guest just froze.
> >> > > > > > > >
> >> > > > > > > > Furthermore, on the host we can see a vboxheadless process
> >> > > > > > > > for each guest,
> >> > > > > > > > including the new guest.  But we can not kill it, not even
> >> > > > > > > > with "kill -9".
> >> > > > > > > > We looked around the web and someone suggested we should use
> >> > > > > > > > "kill -SIGCONT" first since the "ps" output has the "T" flag
> >> > > > > > > > for that vboxheadless process for that new guest, but that
> >> > > > > > > > doesn't help.  We also
> >> > > > > > > > tried all the VBoxManager commands to poweroff/reset etc
> >> > > > > > > > that new guest,
> >> > > > > > > > but they all failed complaining that vm is in Aborted state.
> >> > > > > > > >  We also tried
> >> > > > > > > > VBoxManager commands to disconnect the network cable for
> >> > > > > > > > that new guest,
> >> > > > > > > > it
> >> > > > > > > > didn't complain, but there was no effect.
> >> > > > > > > >
> >> > > > > > > > For a couple times, on the host we disabled the interface
> >> > > > > > > > bridging that new
> >> > > > > > > > guest, then that vboxheadless process for that new guest
> >> > > > > > > > disappeared (we
> >> > > > > > > > attempted to kill it before that).  And immediately all
> >> > > > > > > > other vms regained
> >> > > > > > > > connection back to normal.
> >> > > > > > > >
> >> > > > > > > > But there is one time even the above didn't help - the
> >> > > > > > > > vboxheadless process
> >> > > > > > > > for that new guest stubbonly remains, and we had to reboot
> >> > > > > > > > the host.
> >> > > > > > > >
> >> > > > > > > > This is already a production server, so we can't upgrade
> >> > > > > > > > virtualbox to the
> >> > > > > > > > latest version until we obtain a test server.
> >> > > > > > > >
> >> > > > > > > > Would you advise:
> >> > > > > > > >
> >> > > > > > > > 1. is there any other way to kill that new guest instead of
> >> > > > > > > > rebooting? 2. what might cause the problem?
> >> > > > > > > > 3. what setting and test I can do to analyze this problem?
> >> > > > > > > > ______________________________****_________________
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > I haven't seen any comments on this and don't want you to
> >> > > > > > > think you are being ignored but I haven't seen this but also,
> >> > > > > > > the 4.0 branch was buggier
> >> > > > > > > for me than the 4.1 releases so yeah, upgrading is probably
> >> > > > > > > what you are looking at.
> >> > > > > > >
> >> > > > > > > Rusty Nejdl
> >> > > > > > > ______________________________****_________________
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >  sorry, just realize my reply yesterday didn't go to the list,
> >> > > > > > > so am
> >> > > > > > re-sending with some updates.
> >> > > > > >
> >> > > > > > Yes, we upgraded all ports and fortunately everything went back
> >> > > > > > and especially all vms has run peacefully for two days now.  So
> >> > > > > > upgrading to the latest virtualbox 4.1.16 solved that problem.
> >> > > > > >
> >> > > > > > But now we got a new problem with this new version of
> >> virtualbox:
> >> > > > > > whenever
> >> > > > > > we try to vnc to any vm, that vm will go to Aborted state
> >> > > > > > immediately. Actually, merely telnet from within the host to the
> >> > > > > > vnc port of that vm will immediately Abort that vm.  This
> >> > > > > > prevents us from adding new vms. Also, when starting vm with vnc
> >> > > > > > port, we got this message:
> >> > > > > >
> >> > > > > > rfbListenOnTCP6Port: error in bind IPv6 socket: Address already
> >> > > > > > in use
> >> > > > > >
> >> > > > > > , which we found someone else provided a patch at
> >> > > > > >
> >> http://permalink.gmane.org/**gmane.os.freebsd.devel.**emulation/10237<;
> >> http://permalink.gmane.org/gmane.os.freebsd.devel.emulation/10237>;
> >> > > > > >
> >> > > > > > So looks like when there are multiple vms on a ipv6 system (we
> >> > > > > > have 64bit FreeBSD 9.0) will get this problem.
> >> > > > > >
> >> > > > >
> >> > > > > Glad to hear that 4.1.16 helps for the networking problem. The VNC
> >> > > > > problem is also a known one but the mentioned patch does not work
> >> > > > > at least for a few people. It seems the bug is somewhere in
> >> > > > > libvncserver so downgrading net/libvncserver to an earlier version
> >> > > > > (and rebuilding virtualbox) should help until we come up with a
> >> > > > > proper fix.
> >> > > > >
> >> > > >
> >> > > > You are right about the "Address already in use" problem and the
> >> > > > patch for it so I will commit the fix in a few moments.
> >> > > >
> >> > > > I have also tried to reproduce the VNC crash but I couldn't.
> >> Probably
> >> > > > because
> >> > > > my system is IPv6 enabled. flo@ has seen the same crash and has no
> >> > > > IPv6 in his kernel which lead him to find this commit in
> >> > > > libvncserver:
> >> > > >
> >> > > >
> >> > > > commit 66282f58000c8863e104666c30cb67**b1d5cbdee3
> >> > > > Author: Kyle J. McKay <mackyle@gmail.com>
> >> > > > Date:   Fri May 18 00:30:11 2012 -0700
> >> > > >     libvncserver/sockets.c: do not segfault when
> >> > > > listenSock/listen6Sock == -1
> >> > > >
> >> > > > http://libvncserver.git.**
> >> sourceforge.net/git/gitweb.**cgi?p=libvncserver/
> >> > > > **libvncserver;a=commit;h=**66282f5<
> >> http://libvncserver.git.sourceforge.net/git/gitweb.cgi?p=libvncserver/libvncserver;a=commit;h=66282f5
> >> >
> >> > > >
> >> > > >
> >> > > > It looks promising so please test this patch if you can reproduce
> >> the
> >> > > > crash.
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Bernhard Froehlich
> >> > > > http://www.bluelife.at/
> >> > > >
> >> > >
> >> > > Sorry, I tried to try this patch, but couldn't figure out how to do
> >> > > that. I use ports to compile everything, and can see the file is at
> >> > >
> >> /usr/ports/net/libvncserver/work/LibVNCServer-0.9.9/libvncserver/sockets.c
> >> > > .  However, if I edit this file and do make clean, this patch is wiped
> >> > > out before I can do "make" out of it.  How to apply this patch in the
> >> > > ports?
> >> >
> >> > To apply patches to ports:
> >> > # make clean
> >> > # make patch
> >> > <Apply patch>
> >> > # make
> >> > # make deinstall
> >> > # make reinstall
> >> >
> >> > Note that the final two steps assume a version of the port is already
> >> > installed. If not: 'make install'
> >> > I you use portmaster, after applying the patch: 'portmaster -C
> >> > net/libvncserver' --
> >>
> >> flo has already committed the patch to net/libvncserver so I guess it
> >> fixes the problem. Please update your portstree and verify that it works
> >> fine.
> >>
> >
> > I confirmed after upgrading all ports and noticing libvncserver upgraded
> > to 0.99_1 and reboot, then I can vnc to the vms now.  Also, starting vms
> > with vnc doesn't have that error now, instead it issues the following info,
> > so all problem are solved.
> >
> > 07/06/2012 03:49:14 Listening for VNC connections on TCP port 5903
> > 07/06/2012 03:49:14 Listening for VNC connections on TCP6 port 5903
> >
> > Thanks everyone for your great help!
> >
> 
> Unfortunately, seems that the original problem of one vm disrupts all vms
> and entire network appears to remain, albeit to less scope.  After running
> on virtualbox-ose-4.1.16_1 and libvncserver-0.9.9_1 for 12 hours, all vms
> lost connection again.  Also, phpvirtualbox stopped responding, and
> attempts to restart vboxwebsrv hanged.  And trying to kill (-9) the
> vboxwebsrv process won't work.  The following was the output of "ps
> aux|grep -i box" at that time:
> 
> root 3322  78.7 16.9 4482936 4248180  ??  Is    3:42AM   126:00.53
> /usr/local/bin/VBoxHeadless --startvm vm1
> root 3377   0.2  4.3 1286200 1078728  ??  Is    3:42AM    15:39.40
> /usr/local/bin/VBoxHeadless --startvm vm2
> root 3388   0.1  4.3 1297592 1084676  ??  Is    3:42AM    15:06.97
> /usr/local/bin/VBoxHeadless --startvm vm7 -n -m 5907 -o jtlgjkrfyh9tpgjklfds
> root 2453   0.0  0.0 141684   7156  ??  Ts    3:38AM     4:14.09
> /usr/local/bin/vboxwebsrv
> root 2478   0.0  0.0  45288   2528  ??  S     3:38AM     1:29.99
> /usr/local/lib/virtualbox/VBoxXPCOMIPCD
> root 2494   0.0  0.0 121848   5380  ??  S     3:38AM     3:13.96
> /usr/local/lib/virtualbox/VBoxSVC --auto-shutdown
> root 3333   0.0  4.3 1294712 1079608  ??  Is    3:42AM    19:35.09
> /usr/local/bin/VBoxHeadless --startvm vm3
> root 3355   0.0  4.3 1290424 1079332  ??  Is    3:42AM    16:43.05
> /usr/local/bin/VBoxHeadless --startvm vm5
> root 3366   0.0  8.5 2351436 2140076  ??  Is    3:42AM    17:32.35
> /usr/local/bin/VBoxHeadless --startvm vm6
> root 3598   0.0  4.3 1294520 1078664  ??  Ds    3:50AM    15:01.04
> /usr/local/bin/VBoxHeadless --startvm vm4 -n -m 5904 -o
> u679y0uojlkdfsgkjtfds
> 
> You can see the vboxwebsrv process has the "T" flag there, and the
> vboxheadless process for vm4 has "D" flag there.  Both of such processes I
> can never kill them, not even with "kill -9".  So on the host I disabled
> the interface bridged to vm4 and restarted network, and fortunately both
> the vm4 and the vboxwebsrv processed disappeared.  And at that point all
> other vms regained network.
> 
> There may be one hope that the "troublemaker" may be limited to one of the
> vms that started with vnc, although there was no vnc connection at that
> time, and the other vm with vnc was fine.  And this is just a hopeful guess.
> 
> Also I found no log or error message related to virtualbox in any log
> file.  The VBoxSVC.log only had some information when started but never
> since.

If this is still a problem then

ps alxww | grep -i box

may be more helpful as it will show the wait channel of processes stuck
in the kernel.

Gary



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120612222437.GB14487>