Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 19 Jul 2012 21:30:22 -0400
From:      Steve Tuts <yiz5hwi@gmail.com>
To:        freebsd-emulation@freebsd.org
Subject:   Re: become worse now - Re: one virtualbox vm disrupts all vms and entire network
Message-ID:  <CAEXKtDpTu1EaOm_UyNQxb6QeMekKs=zMxM21MrvGt%2BfZcxOs6g@mail.gmail.com>
In-Reply-To: <CAEXKtDqX6_yQ3r-yBrwWFsWB2wg2SSyE2eLAPuwN1sT4gY2wag@mail.gmail.com>
References:  <CAEXKtDpKxACxbVQYYM9V2FUCt37PzwXx5ZzcqF_c1zFqynkeyw@mail.gmail.com> <CAEXKtDqX6_yQ3r-yBrwWFsWB2wg2SSyE2eLAPuwN1sT4gY2wag@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jul 19, 2012 at 9:13 PM, Steve Tuts <yiz5hwi@gmail.com> wrote:

>
>
> On Fri, Jul 13, 2012 at 4:34 PM, Steve Tuts <yiz5hwi@gmail.com> wrote:
>
>>
>>
>> On Mon, Jul 9, 2012 at 9:11 AM, Steve Tuts <yiz5hwi@gmail.com> wrote:
>>
>>>
>>>
>>> On Tue, Jun 12, 2012 at 6:24 PM, Gary Palmer <gpalmer@freebsd.org>wrote:
>>>
>>>> On Thu, Jun 07, 2012 at 03:56:22PM -0400, Steve Tuts wrote:
>>>> > On Thu, Jun 7, 2012 at 3:54 AM, Steve Tuts <yiz5hwi@gmail.com> wrote:
>>>> >
>>>> > >
>>>> > >
>>>> > > On Thu, Jun 7, 2012 at 2:58 AM, Bernhard Fr?hlich <
>>>> decke@bluelife.at>wrote:
>>>> > >
>>>> > >> On Do.,   7. Jun. 2012 01:07:52 CEST, Kevin Oberman <
>>>> kob6558@gmail.com>
>>>> > >> wrote:
>>>> > >>
>>>> > >> > On Wed, Jun 6, 2012 at 3:46 PM, Steve Tuts <yiz5hwi@gmail.com>
>>>> wrote:
>>>> > >> > > On Wed, Jun 6, 2012 at 3:50 AM, Bernhard Froehlich
>>>> > >> > > <decke@freebsd.org>wrote:
>>>> > >> > >
>>>> > >> > > > On 05.06.2012 20:16, Bernhard Froehlich wrote:
>>>> > >> > > >
>>>> > >> > > > > On 05.06.2012 19:05, Steve Tuts wrote:
>>>> > >> > > > >
>>>> > >> > > > > > On Mon, Jun 4, 2012 at 4:11 PM, Rusty Nejdl
>>>> > >> > > > > > <rnejdl@ringofsaturn.com> wrote:
>>>> > >> > > > > >
>>>> > >> > > > > >  On 2012-06-02 12:16, Steve Tuts wrote:
>>>> > >> > > > > > >
>>>> > >> > > > > > >  Hi, we have a Dell poweredge server with a dozen
>>>> interfaces.
>>>> > >> > > > > > >  It hosts
>>>> > >> > > > > > > > a
>>>> > >> > > > > > > > few guests of web app and email servers with
>>>> > >> > > > > > > > VirtualBox-4.0.14.  The host
>>>> > >> > > > > > > > and all guests are FreeBSD 9.0 64bit.  Each guest is
>>>> bridged
>>>> > >> > > > > > > > to a distinct
>>>> > >> > > > > > > > interface.  The host and all guests are set to
>>>> 10.0.0.0
>>>> > >> > > > > > > > network NAT'ed to
>>>> > >> > > > > > > > a
>>>> > >> > > > > > > > cicso router.
>>>> > >> > > > > > > >
>>>> > >> > > > > > > > This runs well for a couple months, until we added a
>>>> new
>>>> > >> > > > > > > > guest recently.
>>>> > >> > > > > > > > Every few hours, none of the guests can be
>>>> connected.  We
>>>> > >> > > > > > > > can only connect
>>>> > >> > > > > > > > to the host from outside the router.  We can also go
>>>> to the
>>>> > >> > > > > > > > console of the
>>>> > >> > > > > > > > guests (except the new guest), but from there we
>>>> can't ping
>>>> > >> > > > > > > > the gateway 10.0.0.1 any more.  The new guest just
>>>> froze.
>>>> > >> > > > > > > >
>>>> > >> > > > > > > > Furthermore, on the host we can see a vboxheadless
>>>> process
>>>> > >> > > > > > > > for each guest,
>>>> > >> > > > > > > > including the new guest.  But we can not kill it,
>>>> not even
>>>> > >> > > > > > > > with "kill -9".
>>>> > >> > > > > > > > We looked around the web and someone suggested we
>>>> should use
>>>> > >> > > > > > > > "kill -SIGCONT" first since the "ps" output has the
>>>> "T" flag
>>>> > >> > > > > > > > for that vboxheadless process for that new guest,
>>>> but that
>>>> > >> > > > > > > > doesn't help.  We also
>>>> > >> > > > > > > > tried all the VBoxManager commands to poweroff/reset
>>>> etc
>>>> > >> > > > > > > > that new guest,
>>>> > >> > > > > > > > but they all failed complaining that vm is in
>>>> Aborted state.
>>>> > >> > > > > > > >  We also tried
>>>> > >> > > > > > > > VBoxManager commands to disconnect the network cable
>>>> for
>>>> > >> > > > > > > > that new guest,
>>>> > >> > > > > > > > it
>>>> > >> > > > > > > > didn't complain, but there was no effect.
>>>> > >> > > > > > > >
>>>> > >> > > > > > > > For a couple times, on the host we disabled the
>>>> interface
>>>> > >> > > > > > > > bridging that new
>>>> > >> > > > > > > > guest, then that vboxheadless process for that new
>>>> guest
>>>> > >> > > > > > > > disappeared (we
>>>> > >> > > > > > > > attempted to kill it before that).  And immediately
>>>> all
>>>> > >> > > > > > > > other vms regained
>>>> > >> > > > > > > > connection back to normal.
>>>> > >> > > > > > > >
>>>> > >> > > > > > > > But there is one time even the above didn't help -
>>>> the
>>>> > >> > > > > > > > vboxheadless process
>>>> > >> > > > > > > > for that new guest stubbonly remains, and we had to
>>>> reboot
>>>> > >> > > > > > > > the host.
>>>> > >> > > > > > > >
>>>> > >> > > > > > > > This is already a production server, so we can't
>>>> upgrade
>>>> > >> > > > > > > > virtualbox to the
>>>> > >> > > > > > > > latest version until we obtain a test server.
>>>> > >> > > > > > > >
>>>> > >> > > > > > > > Would you advise:
>>>> > >> > > > > > > >
>>>> > >> > > > > > > > 1. is there any other way to kill that new guest
>>>> instead of
>>>> > >> > > > > > > > rebooting? 2. what might cause the problem?
>>>> > >> > > > > > > > 3. what setting and test I can do to analyze this
>>>> problem?
>>>> > >> > > > > > > > ______________________________****_________________
>>>> > >> > > > > > > >
>>>> > >> > > > > > > >
>>>> > >> > > > > > > I haven't seen any comments on this and don't want you
>>>> to
>>>> > >> > > > > > > think you are being ignored but I haven't seen this
>>>> but also,
>>>> > >> > > > > > > the 4.0 branch was buggier
>>>> > >> > > > > > > for me than the 4.1 releases so yeah, upgrading is
>>>> probably
>>>> > >> > > > > > > what you are looking at.
>>>> > >> > > > > > >
>>>> > >> > > > > > > Rusty Nejdl
>>>> > >> > > > > > > ______________________________****_________________
>>>> > >> > > > > > >
>>>> > >> > > > > > >
>>>> > >> > > > > > >  sorry, just realize my reply yesterday didn't go to
>>>> the list,
>>>> > >> > > > > > > so am
>>>> > >> > > > > > re-sending with some updates.
>>>> > >> > > > > >
>>>> > >> > > > > > Yes, we upgraded all ports and fortunately everything
>>>> went back
>>>> > >> > > > > > and especially all vms has run peacefully for two days
>>>> now.  So
>>>> > >> > > > > > upgrading to the latest virtualbox 4.1.16 solved that
>>>> problem.
>>>> > >> > > > > >
>>>> > >> > > > > > But now we got a new problem with this new version of
>>>> > >> virtualbox:
>>>> > >> > > > > > whenever
>>>> > >> > > > > > we try to vnc to any vm, that vm will go to Aborted state
>>>> > >> > > > > > immediately. Actually, merely telnet from within the
>>>> host to the
>>>> > >> > > > > > vnc port of that vm will immediately Abort that vm.  This
>>>> > >> > > > > > prevents us from adding new vms. Also, when starting vm
>>>> with vnc
>>>> > >> > > > > > port, we got this message:
>>>> > >> > > > > >
>>>> > >> > > > > > rfbListenOnTCP6Port: error in bind IPv6 socket: Address
>>>> already
>>>> > >> > > > > > in use
>>>> > >> > > > > >
>>>> > >> > > > > > , which we found someone else provided a patch at
>>>> > >> > > > > >
>>>> > >>
>>>> http://permalink.gmane.org/**gmane.os.freebsd.devel.**emulation/10237<;
>>>> > >> http://permalink.gmane.org/gmane.os.freebsd.devel.emulation/10237>;
>>>> > >> > > > > >
>>>> > >> > > > > > So looks like when there are multiple vms on a ipv6
>>>> system (we
>>>> > >> > > > > > have 64bit FreeBSD 9.0) will get this problem.
>>>> > >> > > > > >
>>>> > >> > > > >
>>>> > >> > > > > Glad to hear that 4.1.16 helps for the networking problem.
>>>> The VNC
>>>> > >> > > > > problem is also a known one but the mentioned patch does
>>>> not work
>>>> > >> > > > > at least for a few people. It seems the bug is somewhere in
>>>> > >> > > > > libvncserver so downgrading net/libvncserver to an earlier
>>>> version
>>>> > >> > > > > (and rebuilding virtualbox) should help until we come up
>>>> with a
>>>> > >> > > > > proper fix.
>>>> > >> > > > >
>>>> > >> > > >
>>>> > >> > > > You are right about the "Address already in use" problem and
>>>> the
>>>> > >> > > > patch for it so I will commit the fix in a few moments.
>>>> > >> > > >
>>>> > >> > > > I have also tried to reproduce the VNC crash but I couldn't.
>>>> > >> Probably
>>>> > >> > > > because
>>>> > >> > > > my system is IPv6 enabled. flo@ has seen the same crash and
>>>> has no
>>>> > >> > > > IPv6 in his kernel which lead him to find this commit in
>>>> > >> > > > libvncserver:
>>>> > >> > > >
>>>> > >> > > >
>>>> > >> > > > commit 66282f58000c8863e104666c30cb67**b1d5cbdee3
>>>> > >> > > > Author: Kyle J. McKay <mackyle@gmail.com>
>>>> > >> > > > Date:   Fri May 18 00:30:11 2012 -0700
>>>> > >> > > >     libvncserver/sockets.c: do not segfault when
>>>> > >> > > > listenSock/listen6Sock == -1
>>>> > >> > > >
>>>> > >> > > > http://libvncserver.git.**
>>>> > >> sourceforge.net/git/gitweb.**cgi?p=libvncserver/
>>>> > >> > > > **libvncserver;a=commit;h=**66282f5<
>>>> > >>
>>>> http://libvncserver.git.sourceforge.net/git/gitweb.cgi?p=libvncserver/libvncserver;a=commit;h=66282f5
>>>> > >> >
>>>> > >> > > >
>>>> > >> > > >
>>>> > >> > > > It looks promising so please test this patch if you can
>>>> reproduce
>>>> > >> the
>>>> > >> > > > crash.
>>>> > >> > > >
>>>> > >> > > >
>>>> > >> > > > --
>>>> > >> > > > Bernhard Froehlich
>>>> > >> > > > http://www.bluelife.at/
>>>> > >> > > >
>>>> > >> > >
>>>> > >> > > Sorry, I tried to try this patch, but couldn't figure out how
>>>> to do
>>>> > >> > > that. I use ports to compile everything, and can see the file
>>>> is at
>>>> > >> > >
>>>> > >>
>>>> /usr/ports/net/libvncserver/work/LibVNCServer-0.9.9/libvncserver/sockets.c
>>>> > >> > > .  However, if I edit this file and do make clean, this patch
>>>> is wiped
>>>> > >> > > out before I can do "make" out of it.  How to apply this patch
>>>> in the
>>>> > >> > > ports?
>>>> > >> >
>>>> > >> > To apply patches to ports:
>>>> > >> > # make clean
>>>> > >> > # make patch
>>>> > >> > <Apply patch>
>>>> > >> > # make
>>>> > >> > # make deinstall
>>>> > >> > # make reinstall
>>>> > >> >
>>>> > >> > Note that the final two steps assume a version of the port is
>>>> already
>>>> > >> > installed. If not: 'make install'
>>>> > >> > I you use portmaster, after applying the patch: 'portmaster -C
>>>> > >> > net/libvncserver' --
>>>> > >>
>>>> > >> flo has already committed the patch to net/libvncserver so I guess
>>>> it
>>>> > >> fixes the problem. Please update your portstree and verify that it
>>>> works
>>>> > >> fine.
>>>> > >>
>>>> > >
>>>> > > I confirmed after upgrading all ports and noticing libvncserver
>>>> upgraded
>>>> > > to 0.99_1 and reboot, then I can vnc to the vms now.  Also,
>>>> starting vms
>>>> > > with vnc doesn't have that error now, instead it issues the
>>>> following info,
>>>> > > so all problem are solved.
>>>> > >
>>>> > > 07/06/2012 03:49:14 Listening for VNC connections on TCP port 5903
>>>> > > 07/06/2012 03:49:14 Listening for VNC connections on TCP6 port 5903
>>>> > >
>>>> > > Thanks everyone for your great help!
>>>> > >
>>>> >
>>>> > Unfortunately, seems that the original problem of one vm disrupts all
>>>> vms
>>>> > and entire network appears to remain, albeit to less scope.  After
>>>> running
>>>> > on virtualbox-ose-4.1.16_1 and libvncserver-0.9.9_1 for 12 hours, all
>>>> vms
>>>> > lost connection again.  Also, phpvirtualbox stopped responding, and
>>>> > attempts to restart vboxwebsrv hanged.  And trying to kill (-9) the
>>>> > vboxwebsrv process won't work.  The following was the output of "ps
>>>> > aux|grep -i box" at that time:
>>>> >
>>>> > root 3322  78.7 16.9 4482936 4248180  ??  Is    3:42AM   126:00.53
>>>> > /usr/local/bin/VBoxHeadless --startvm vm1
>>>> > root 3377   0.2  4.3 1286200 1078728  ??  Is    3:42AM    15:39.40
>>>> > /usr/local/bin/VBoxHeadless --startvm vm2
>>>> > root 3388   0.1  4.3 1297592 1084676  ??  Is    3:42AM    15:06.97
>>>> > /usr/local/bin/VBoxHeadless --startvm vm7 -n -m 5907 -o
>>>> jtlgjkrfyh9tpgjklfds
>>>> > root 2453   0.0  0.0 141684   7156  ??  Ts    3:38AM     4:14.09
>>>> > /usr/local/bin/vboxwebsrv
>>>> > root 2478   0.0  0.0  45288   2528  ??  S     3:38AM     1:29.99
>>>> > /usr/local/lib/virtualbox/VBoxXPCOMIPCD
>>>> > root 2494   0.0  0.0 121848   5380  ??  S     3:38AM     3:13.96
>>>> > /usr/local/lib/virtualbox/VBoxSVC --auto-shutdown
>>>> > root 3333   0.0  4.3 1294712 1079608  ??  Is    3:42AM    19:35.09
>>>> > /usr/local/bin/VBoxHeadless --startvm vm3
>>>> > root 3355   0.0  4.3 1290424 1079332  ??  Is    3:42AM    16:43.05
>>>> > /usr/local/bin/VBoxHeadless --startvm vm5
>>>> > root 3366   0.0  8.5 2351436 2140076  ??  Is    3:42AM    17:32.35
>>>> > /usr/local/bin/VBoxHeadless --startvm vm6
>>>> > root 3598   0.0  4.3 1294520 1078664  ??  Ds    3:50AM    15:01.04
>>>> > /usr/local/bin/VBoxHeadless --startvm vm4 -n -m 5904 -o
>>>> > u679y0uojlkdfsgkjtfds
>>>> >
>>>> > You can see the vboxwebsrv process has the "T" flag there, and the
>>>> > vboxheadless process for vm4 has "D" flag there.  Both of such
>>>> processes I
>>>> > can never kill them, not even with "kill -9".  So on the host I
>>>> disabled
>>>> > the interface bridged to vm4 and restarted network, and fortunately
>>>> both
>>>> > the vm4 and the vboxwebsrv processed disappeared.  And at that point
>>>> all
>>>> > other vms regained network.
>>>> >
>>>> > There may be one hope that the "troublemaker" may be limited to one
>>>> of the
>>>> > vms that started with vnc, although there was no vnc connection at
>>>> that
>>>> > time, and the other vm with vnc was fine.  And this is just a hopeful
>>>> guess.
>>>> >
>>>> > Also I found no log or error message related to virtualbox in any log
>>>> > file.  The VBoxSVC.log only had some information when started but
>>>> never
>>>> > since.
>>>>
>>>> If this is still a problem then
>>>>
>>>> ps alxww | grep -i box
>>>>
>>>> may be more helpful as it will show the wait channel of processes stuck
>>>> in the kernel.
>>>>
>>>> Gary
>>>>
>>>
>>> We avoided this problem by running all vms without vnc.  But forgot this
>>> problem and left one vm on with vnc, together with the other few running
>>> vms yesterday, and hit this problem again on virtualbox 4.1.16.  Only the
>>> old trick of turning off the host interface corresponding to the vm with
>>> vnc and then restarting host network got us out of the problem.
>>>
>>> We then upgraded virtualbox to 4.1.18, turning off all vms, wait until
>>> "ps aux|grep -i box" reported nothing, then started all vms.  And let no vm
>>> with vnc running.
>>>
>>> Still the problem hit us again.  Here is the output of " ps alxww | grep
>>> -i box" as you suggested:
>>>
>>> 1011    42725    1    0    20    0    1289796    1081064    IPRT S
>>> Is    ??    30:53.24  VBoxHeadless --startvm vm5
>>>
>>> after "kill -9 42725", the line changed to
>>>
>>> 1011    42725    1    0    20    0    1289796    1081064    keglim
>>> Ts    ??    30:53.24  VBoxHeadless --startvm vm5
>>>
>>> after "kill -9" for another vm, the line changed to something like
>>>
>>> 1011    42754    1    0    20    0    1289796    1081064    -    Ts
>>> ??    30:53.24  VBoxHeadless --startvm vm7
>>>
>>> and controlvm command don't work, and these command stuck there
>>> themselves.  The following are their outputs:
>>>
>>> 0    89572    79180    0    21    0    44708    1644    select     I+
>>> v6    0:00.01    VBoxManage controlvm projects_outside acpipowerbutton
>>> 0    89605    89586    0    21    0    44708    2196    select     I+
>>> v7    0:00.01    VBoxManage controlvm projects_outside poweroff
>>>
>>> We now rebooted the host, and left no vm with vnc running.
>>>
>>
>> The problem has become more rampant now.  After rebooting and running
>> virtualbox-ose-4.1.18, and no vm was started with console, the around 10
>> vms, bridged to each of its own dedicated interface, get no network
>> connection a couple times a day. Most times it would recover itself after
>> about 10 minutes, sometimes we have to restart host network which
>> immediately restore all connections.
>>
>
> Looks like multiple people had similar problem, see the threads in
>
> http://lists.freebsd.org/pipermail/freebsd-emulation/2011-July/008957.html
> http://lists.freebsd.org/pipermail/freebsd-stable/2011-July/063197.html
>
> and
>
> http://lists.freebsd.org/pipermail/freebsd-stable/2011-July/063172.html
>
> http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/058708.html
>
> http://lists.freebsd.org/pipermail/freebsd-stable/2011-July/thread.html#63221
>
> http://lists.freebsd.org/pipermail/freebsd-emulation/2011-July/thread.html#8957
>
> In all cases (including mine), the hardware is a beefy Dell, such as Dell
> R710, with multiple cores and multiple interfaces.  Virtualbox is running
> on all machines.  Some of them may also have ZFS running (like mine).
>
> My problem is, after some time, the entire network on the host stops
> working, so that one can't ssh to any guest or other machine, not even to
> itself.  At such point, restarting network on the host will restore
> everything.  My guests are bridged to its respective interfaces on the host.
>
> Those other people also reported that they can't scp large file from host
> to the guest.  (Un)fortunately I also confirmed the same problem.  When scp
> a 10G file from host to guest, it will get stalled sometimes afterwards.
> And at that time, the entire network is broken, as described in the
> paragraph above.
>
> I tried everything they described, including:
>
> 1. setting net.graph.maxdata and net.graph.maxalloc to 65536 in
> /boot/loader.conf
> 2. setting kern.ipc.maxsockbut, net.graph.maxdgram, net.graph.recvspaceto
> 8388608 in /etc/sysctl.conf
> 3. setting kern.maxdsiz in /boot/loader.conf .  Although I don't know what
> else to set, since somehow the default value is somehow 34G on my 24G
> memory server.
>
> Unfortunately, none of them work, ie, scp from host to guest always stalls
> after some variable time (varying from dozens of MBs to 7GB).  Scp from
> host to other machine appears to have no problem though.
>
> So appears that something is broken after scp from host to guest for some
> time.  But I don't know what is broken.  Still, this offers a reliable way
> to reproduce the problem.  If you have any suggestion of checking what is
> broken or what change to fix, I'd be happy to comply.  This is already in
> production so I can't leave the vms not running for more than a few minutes
> though.
>

One more observation is, that it becomes worse may be related to that I
installed more vms lately.  After shutting down a few vms I do notice this
network problem occurred less frequently, albeit still about once a day.

Also, do you think if this may not be a virtualbox problem or people in
other freebsd subforum may have better expertise as this may turn out to be
a network problem?  I just want to post on the forum that people would most
relate to, so your advice would be appreciated.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAEXKtDpTu1EaOm_UyNQxb6QeMekKs=zMxM21MrvGt%2BfZcxOs6g>