Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 Jun 2012 15:56:22 -0400
From:      Steve Tuts <yiz5hwi@gmail.com>
To:        freebsd-emulation@freebsd.org
Subject:   Still unresolved - Re: one virtualbox vm disrupts all vms and entire network
Message-ID:  <CAEXKtDqVCh-e-WNa2H%2Bj8ah0M0qiXvQ11J0ZLywzfS48UuJdCQ@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
On Thu, Jun 7, 2012 at 3:54 AM, Steve Tuts <yiz5hwi@gmail.com> wrote:

>
>
> On Thu, Jun 7, 2012 at 2:58 AM, Bernhard Fr=F6hlich <decke@bluelife.at>wr=
ote:
>
>> On Do.,   7. Jun. 2012 01:07:52 CEST, Kevin Oberman <kob6558@gmail.com>
>> wrote:
>>
>> > On Wed, Jun 6, 2012 at 3:46 PM, Steve Tuts <yiz5hwi@gmail.com> wrote:
>> > > On Wed, Jun 6, 2012 at 3:50 AM, Bernhard Froehlich
>> > > <decke@freebsd.org>wrote:
>> > >
>> > > > On 05.06.2012 20:16, Bernhard Froehlich wrote:
>> > > >
>> > > > > On 05.06.2012 19:05, Steve Tuts wrote:
>> > > > >
>> > > > > > On Mon, Jun 4, 2012 at 4:11 PM, Rusty Nejdl
>> > > > > > <rnejdl@ringofsaturn.com> wrote:
>> > > > > >
>> > > > > >  On 2012-06-02 12:16, Steve Tuts wrote:
>> > > > > > >
>> > > > > > >  Hi, we have a Dell poweredge server with a dozen interfaces=
.
>> > > > > > >  It hosts
>> > > > > > > > a
>> > > > > > > > few guests of web app and email servers with
>> > > > > > > > VirtualBox-4.0.14.  The host
>> > > > > > > > and all guests are FreeBSD 9.0 64bit.  Each guest is bridg=
ed
>> > > > > > > > to a distinct
>> > > > > > > > interface.  The host and all guests are set to 10.0.0.0
>> > > > > > > > network NAT'ed to
>> > > > > > > > a
>> > > > > > > > cicso router.
>> > > > > > > >
>> > > > > > > > This runs well for a couple months, until we added a new
>> > > > > > > > guest recently.
>> > > > > > > > Every few hours, none of the guests can be connected.  We
>> > > > > > > > can only connect
>> > > > > > > > to the host from outside the router.  We can also go to th=
e
>> > > > > > > > console of the
>> > > > > > > > guests (except the new guest), but from there we can't pin=
g
>> > > > > > > > the gateway 10.0.0.1 any more.  The new guest just froze.
>> > > > > > > >
>> > > > > > > > Furthermore, on the host we can see a vboxheadless process
>> > > > > > > > for each guest,
>> > > > > > > > including the new guest.  But we can not kill it, not even
>> > > > > > > > with "kill -9".
>> > > > > > > > We looked around the web and someone suggested we should u=
se
>> > > > > > > > "kill -SIGCONT" first since the "ps" output has the "T" fl=
ag
>> > > > > > > > for that vboxheadless process for that new guest, but that
>> > > > > > > > doesn't help.  We also
>> > > > > > > > tried all the VBoxManager commands to poweroff/reset etc
>> > > > > > > > that new guest,
>> > > > > > > > but they all failed complaining that vm is in Aborted stat=
e.
>> > > > > > > >  We also tried
>> > > > > > > > VBoxManager commands to disconnect the network cable for
>> > > > > > > > that new guest,
>> > > > > > > > it
>> > > > > > > > didn't complain, but there was no effect.
>> > > > > > > >
>> > > > > > > > For a couple times, on the host we disabled the interface
>> > > > > > > > bridging that new
>> > > > > > > > guest, then that vboxheadless process for that new guest
>> > > > > > > > disappeared (we
>> > > > > > > > attempted to kill it before that).  And immediately all
>> > > > > > > > other vms regained
>> > > > > > > > connection back to normal.
>> > > > > > > >
>> > > > > > > > But there is one time even the above didn't help - the
>> > > > > > > > vboxheadless process
>> > > > > > > > for that new guest stubbonly remains, and we had to reboot
>> > > > > > > > the host.
>> > > > > > > >
>> > > > > > > > This is already a production server, so we can't upgrade
>> > > > > > > > virtualbox to the
>> > > > > > > > latest version until we obtain a test server.
>> > > > > > > >
>> > > > > > > > Would you advise:
>> > > > > > > >
>> > > > > > > > 1. is there any other way to kill that new guest instead o=
f
>> > > > > > > > rebooting? 2. what might cause the problem?
>> > > > > > > > 3. what setting and test I can do to analyze this problem?
>> > > > > > > > ______________________________****_________________
>> > > > > > > >
>> > > > > > > >
>> > > > > > > I haven't seen any comments on this and don't want you to
>> > > > > > > think you are being ignored but I haven't seen this but also=
,
>> > > > > > > the 4.0 branch was buggier
>> > > > > > > for me than the 4.1 releases so yeah, upgrading is probably
>> > > > > > > what you are looking at.
>> > > > > > >
>> > > > > > > Rusty Nejdl
>> > > > > > > ______________________________****_________________
>> > > > > > >
>> > > > > > >
>> > > > > > >  sorry, just realize my reply yesterday didn't go to the lis=
t,
>> > > > > > > so am
>> > > > > > re-sending with some updates.
>> > > > > >
>> > > > > > Yes, we upgraded all ports and fortunately everything went bac=
k
>> > > > > > and especially all vms has run peacefully for two days now.  S=
o
>> > > > > > upgrading to the latest virtualbox 4.1.16 solved that problem.
>> > > > > >
>> > > > > > But now we got a new problem with this new version of
>> virtualbox:
>> > > > > > whenever
>> > > > > > we try to vnc to any vm, that vm will go to Aborted state
>> > > > > > immediately. Actually, merely telnet from within the host to t=
he
>> > > > > > vnc port of that vm will immediately Abort that vm.  This
>> > > > > > prevents us from adding new vms. Also, when starting vm with v=
nc
>> > > > > > port, we got this message:
>> > > > > >
>> > > > > > rfbListenOnTCP6Port: error in bind IPv6 socket: Address alread=
y
>> > > > > > in use
>> > > > > >
>> > > > > > , which we found someone else provided a patch at
>> > > > > >
>> http://permalink.gmane.org/**gmane.os.freebsd.devel.**emulation/10237<;
>> http://permalink.gmane.org/gmane.os.freebsd.devel.emulation/10237>;
>> > > > > >
>> > > > > > So looks like when there are multiple vms on a ipv6 system (we
>> > > > > > have 64bit FreeBSD 9.0) will get this problem.
>> > > > > >
>> > > > >
>> > > > > Glad to hear that 4.1.16 helps for the networking problem. The V=
NC
>> > > > > problem is also a known one but the mentioned patch does not wor=
k
>> > > > > at least for a few people. It seems the bug is somewhere in
>> > > > > libvncserver so downgrading net/libvncserver to an earlier versi=
on
>> > > > > (and rebuilding virtualbox) should help until we come up with a
>> > > > > proper fix.
>> > > > >
>> > > >
>> > > > You are right about the "Address already in use" problem and the
>> > > > patch for it so I will commit the fix in a few moments.
>> > > >
>> > > > I have also tried to reproduce the VNC crash but I couldn't.
>> Probably
>> > > > because
>> > > > my system is IPv6 enabled. flo@ has seen the same crash and has no
>> > > > IPv6 in his kernel which lead him to find this commit in
>> > > > libvncserver:
>> > > >
>> > > >
>> > > > commit 66282f58000c8863e104666c30cb67**b1d5cbdee3
>> > > > Author: Kyle J. McKay <mackyle@gmail.com>
>> > > > Date:   Fri May 18 00:30:11 2012 -0700
>> > > >     libvncserver/sockets.c: do not segfault when
>> > > > listenSock/listen6Sock =3D=3D -1
>> > > >
>> > > > http://libvncserver.git.**
>> sourceforge.net/git/gitweb.**cgi?p=3Dlibvncserver/
>> > > > **libvncserver;a=3Dcommit;h=3D**66282f5<
>> http://libvncserver.git.sourceforge.net/git/gitweb.cgi?p=3Dlibvncserver/=
libvncserver;a=3Dcommit;h=3D66282f5
>> >
>> > > >
>> > > >
>> > > > It looks promising so please test this patch if you can reproduce
>> the
>> > > > crash.
>> > > >
>> > > >
>> > > > --
>> > > > Bernhard Froehlich
>> > > > http://www.bluelife.at/
>> > > >
>> > >
>> > > Sorry, I tried to try this patch, but couldn't figure out how to do
>> > > that. I use ports to compile everything, and can see the file is at
>> > >
>> /usr/ports/net/libvncserver/work/LibVNCServer-0.9.9/libvncserver/sockets=
.c
>> > > .  However, if I edit this file and do make clean, this patch is wip=
ed
>> > > out before I can do "make" out of it.  How to apply this patch in th=
e
>> > > ports?
>> >
>> > To apply patches to ports:
>> > # make clean
>> > # make patch
>> > <Apply patch>
>> > # make
>> > # make deinstall
>> > # make reinstall
>> >
>> > Note that the final two steps assume a version of the port is already
>> > installed. If not: 'make install'
>> > I you use portmaster, after applying the patch: 'portmaster -C
>> > net/libvncserver' --
>>
>> flo has already committed the patch to net/libvncserver so I guess it
>> fixes the problem. Please update your portstree and verify that it works
>> fine.
>>
>
> I confirmed after upgrading all ports and noticing libvncserver upgraded
> to 0.99_1 and reboot, then I can vnc to the vms now.  Also, starting vms
> with vnc doesn't have that error now, instead it issues the following inf=
o,
> so all problem are solved.
>
> 07/06/2012 03:49:14 Listening for VNC connections on TCP port 5903
> 07/06/2012 03:49:14 Listening for VNC connections on TCP6 port 5903
>
> Thanks everyone for your great help!
>

Unfortunately, seems that the original problem of one vm disrupts all vms
and entire network appears to remain, albeit to less scope.  After running
on virtualbox-ose-4.1.16_1 and libvncserver-0.9.9_1 for 12 hours, all vms
lost connection again.  Also, phpvirtualbox stopped responding, and
attempts to restart vboxwebsrv hanged.  And trying to kill (-9) the
vboxwebsrv process won't work.  The following was the output of "ps
aux|grep -i box" at that time:

root 3322  78.7 16.9 4482936 4248180  ??  Is    3:42AM   126:00.53
/usr/local/bin/VBoxHeadless --startvm vm1
root 3377   0.2  4.3 1286200 1078728  ??  Is    3:42AM    15:39.40
/usr/local/bin/VBoxHeadless --startvm vm2
root 3388   0.1  4.3 1297592 1084676  ??  Is    3:42AM    15:06.97
/usr/local/bin/VBoxHeadless --startvm vm7 -n -m 5907 -o jtlgjkrfyh9tpgjklfd=
s
root 2453   0.0  0.0 141684   7156  ??  Ts    3:38AM     4:14.09
/usr/local/bin/vboxwebsrv
root 2478   0.0  0.0  45288   2528  ??  S     3:38AM     1:29.99
/usr/local/lib/virtualbox/VBoxXPCOMIPCD
root 2494   0.0  0.0 121848   5380  ??  S     3:38AM     3:13.96
/usr/local/lib/virtualbox/VBoxSVC --auto-shutdown
root 3333   0.0  4.3 1294712 1079608  ??  Is    3:42AM    19:35.09
/usr/local/bin/VBoxHeadless --startvm vm3
root 3355   0.0  4.3 1290424 1079332  ??  Is    3:42AM    16:43.05
/usr/local/bin/VBoxHeadless --startvm vm5
root 3366   0.0  8.5 2351436 2140076  ??  Is    3:42AM    17:32.35
/usr/local/bin/VBoxHeadless --startvm vm6
root 3598   0.0  4.3 1294520 1078664  ??  Ds    3:50AM    15:01.04
/usr/local/bin/VBoxHeadless --startvm vm4 -n -m 5904 -o
u679y0uojlkdfsgkjtfds

You can see the vboxwebsrv process has the "T" flag there, and the
vboxheadless process for vm4 has "D" flag there.  Both of such processes I
can never kill them, not even with "kill -9".  So on the host I disabled
the interface bridged to vm4 and restarted network, and fortunately both
the vm4 and the vboxwebsrv processed disappeared.  And at that point all
other vms regained network.

There may be one hope that the "troublemaker" may be limited to one of the
vms that started with vnc, although there was no vnc connection at that
time, and the other vm with vnc was fine.  And this is just a hopeful guess=
.

Also I found no log or error message related to virtualbox in any log
file.  The VBoxSVC.log only had some information when started but never
since.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAEXKtDqVCh-e-WNa2H%2Bj8ah0M0qiXvQ11J0ZLywzfS48UuJdCQ>