From owner-freebsd-emulation@FreeBSD.ORG Fri Jul 20 01:30:24 2012 Return-Path: Delivered-To: freebsd-emulation@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D7D1F106566B for ; Fri, 20 Jul 2012 01:30:24 +0000 (UTC) (envelope-from yiz5hwi@gmail.com) Received: from mail-vb0-f54.google.com (mail-vb0-f54.google.com [209.85.212.54]) by mx1.freebsd.org (Postfix) with ESMTP id 7F11D8FC08 for ; Fri, 20 Jul 2012 01:30:24 +0000 (UTC) Received: by vbmv11 with SMTP id v11so3167264vbm.13 for ; Thu, 19 Jul 2012 18:30:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=CWxEx4nxGK56RvR0tH58IZGaPIkr+ksvx/gyV0QN7i0=; b=hx3DDEOuCPnn5iy73OFgKQQMCmVEHzwmvUGruZsEDLBz+DifBbIHRDMeAVGtAW0+zr 0LOrcKcpCzfdVoc71yTi990GJRC1ayDXtziaQZuR6oqRJ9gnWSYxuoYBs3YN6RHrFPdY ns6xuLkiMP11u7mfpib3H3ryfvd6lhFC6kf6Ssr7AA5jYnSqFWETAKDGg2DlYcoY1z0L iREtdld5pZE1yfT1e6NogEcxFh9GBScGXAB3AqzXNVXF+IbTIxSFkU3wiS/RqnvNE+iG 5GYbfDDJcVgv1reTq2PaYol/z9ZXFxcfUjeJ+QUjqepTDZJ2pVeCacZuk4eAsHMuxQd4 FdwA== MIME-Version: 1.0 Received: by 10.52.89.35 with SMTP id bl3mr2459453vdb.106.1342747823018; Thu, 19 Jul 2012 18:30:23 -0700 (PDT) Received: by 10.52.115.134 with HTTP; Thu, 19 Jul 2012 18:30:22 -0700 (PDT) In-Reply-To: References: Date: Thu, 19 Jul 2012 21:30:22 -0400 Message-ID: From: Steve Tuts To: freebsd-emulation@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Re: become worse now - Re: one virtualbox vm disrupts all vms and entire network X-BeenThere: freebsd-emulation@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Development of Emulators of other operating systems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jul 2012 01:30:24 -0000 On Thu, Jul 19, 2012 at 9:13 PM, Steve Tuts wrote: > > > On Fri, Jul 13, 2012 at 4:34 PM, Steve Tuts wrote: > >> >> >> On Mon, Jul 9, 2012 at 9:11 AM, Steve Tuts wrote: >> >>> >>> >>> On Tue, Jun 12, 2012 at 6:24 PM, Gary Palmer wrote: >>> >>>> On Thu, Jun 07, 2012 at 03:56:22PM -0400, Steve Tuts wrote: >>>> > On Thu, Jun 7, 2012 at 3:54 AM, Steve Tuts wrote: >>>> > >>>> > > >>>> > > >>>> > > On Thu, Jun 7, 2012 at 2:58 AM, Bernhard Fr?hlich < >>>> decke@bluelife.at>wrote: >>>> > > >>>> > >> On Do., 7. Jun. 2012 01:07:52 CEST, Kevin Oberman < >>>> kob6558@gmail.com> >>>> > >> wrote: >>>> > >> >>>> > >> > On Wed, Jun 6, 2012 at 3:46 PM, Steve Tuts >>>> wrote: >>>> > >> > > On Wed, Jun 6, 2012 at 3:50 AM, Bernhard Froehlich >>>> > >> > > wrote: >>>> > >> > > >>>> > >> > > > On 05.06.2012 20:16, Bernhard Froehlich wrote: >>>> > >> > > > >>>> > >> > > > > On 05.06.2012 19:05, Steve Tuts wrote: >>>> > >> > > > > >>>> > >> > > > > > On Mon, Jun 4, 2012 at 4:11 PM, Rusty Nejdl >>>> > >> > > > > > wrote: >>>> > >> > > > > > >>>> > >> > > > > > On 2012-06-02 12:16, Steve Tuts wrote: >>>> > >> > > > > > > >>>> > >> > > > > > > Hi, we have a Dell poweredge server with a dozen >>>> interfaces. >>>> > >> > > > > > > It hosts >>>> > >> > > > > > > > a >>>> > >> > > > > > > > few guests of web app and email servers with >>>> > >> > > > > > > > VirtualBox-4.0.14. The host >>>> > >> > > > > > > > and all guests are FreeBSD 9.0 64bit. Each guest is >>>> bridged >>>> > >> > > > > > > > to a distinct >>>> > >> > > > > > > > interface. The host and all guests are set to >>>> 10.0.0.0 >>>> > >> > > > > > > > network NAT'ed to >>>> > >> > > > > > > > a >>>> > >> > > > > > > > cicso router. >>>> > >> > > > > > > > >>>> > >> > > > > > > > This runs well for a couple months, until we added a >>>> new >>>> > >> > > > > > > > guest recently. >>>> > >> > > > > > > > Every few hours, none of the guests can be >>>> connected. We >>>> > >> > > > > > > > can only connect >>>> > >> > > > > > > > to the host from outside the router. We can also go >>>> to the >>>> > >> > > > > > > > console of the >>>> > >> > > > > > > > guests (except the new guest), but from there we >>>> can't ping >>>> > >> > > > > > > > the gateway 10.0.0.1 any more. The new guest just >>>> froze. >>>> > >> > > > > > > > >>>> > >> > > > > > > > Furthermore, on the host we can see a vboxheadless >>>> process >>>> > >> > > > > > > > for each guest, >>>> > >> > > > > > > > including the new guest. But we can not kill it, >>>> not even >>>> > >> > > > > > > > with "kill -9". >>>> > >> > > > > > > > We looked around the web and someone suggested we >>>> should use >>>> > >> > > > > > > > "kill -SIGCONT" first since the "ps" output has the >>>> "T" flag >>>> > >> > > > > > > > for that vboxheadless process for that new guest, >>>> but that >>>> > >> > > > > > > > doesn't help. We also >>>> > >> > > > > > > > tried all the VBoxManager commands to poweroff/reset >>>> etc >>>> > >> > > > > > > > that new guest, >>>> > >> > > > > > > > but they all failed complaining that vm is in >>>> Aborted state. >>>> > >> > > > > > > > We also tried >>>> > >> > > > > > > > VBoxManager commands to disconnect the network cable >>>> for >>>> > >> > > > > > > > that new guest, >>>> > >> > > > > > > > it >>>> > >> > > > > > > > didn't complain, but there was no effect. >>>> > >> > > > > > > > >>>> > >> > > > > > > > For a couple times, on the host we disabled the >>>> interface >>>> > >> > > > > > > > bridging that new >>>> > >> > > > > > > > guest, then that vboxheadless process for that new >>>> guest >>>> > >> > > > > > > > disappeared (we >>>> > >> > > > > > > > attempted to kill it before that). And immediately >>>> all >>>> > >> > > > > > > > other vms regained >>>> > >> > > > > > > > connection back to normal. >>>> > >> > > > > > > > >>>> > >> > > > > > > > But there is one time even the above didn't help - >>>> the >>>> > >> > > > > > > > vboxheadless process >>>> > >> > > > > > > > for that new guest stubbonly remains, and we had to >>>> reboot >>>> > >> > > > > > > > the host. >>>> > >> > > > > > > > >>>> > >> > > > > > > > This is already a production server, so we can't >>>> upgrade >>>> > >> > > > > > > > virtualbox to the >>>> > >> > > > > > > > latest version until we obtain a test server. >>>> > >> > > > > > > > >>>> > >> > > > > > > > Would you advise: >>>> > >> > > > > > > > >>>> > >> > > > > > > > 1. is there any other way to kill that new guest >>>> instead of >>>> > >> > > > > > > > rebooting? 2. what might cause the problem? >>>> > >> > > > > > > > 3. what setting and test I can do to analyze this >>>> problem? >>>> > >> > > > > > > > ______________________________****_________________ >>>> > >> > > > > > > > >>>> > >> > > > > > > > >>>> > >> > > > > > > I haven't seen any comments on this and don't want you >>>> to >>>> > >> > > > > > > think you are being ignored but I haven't seen this >>>> but also, >>>> > >> > > > > > > the 4.0 branch was buggier >>>> > >> > > > > > > for me than the 4.1 releases so yeah, upgrading is >>>> probably >>>> > >> > > > > > > what you are looking at. >>>> > >> > > > > > > >>>> > >> > > > > > > Rusty Nejdl >>>> > >> > > > > > > ______________________________****_________________ >>>> > >> > > > > > > >>>> > >> > > > > > > >>>> > >> > > > > > > sorry, just realize my reply yesterday didn't go to >>>> the list, >>>> > >> > > > > > > so am >>>> > >> > > > > > re-sending with some updates. >>>> > >> > > > > > >>>> > >> > > > > > Yes, we upgraded all ports and fortunately everything >>>> went back >>>> > >> > > > > > and especially all vms has run peacefully for two days >>>> now. So >>>> > >> > > > > > upgrading to the latest virtualbox 4.1.16 solved that >>>> problem. >>>> > >> > > > > > >>>> > >> > > > > > But now we got a new problem with this new version of >>>> > >> virtualbox: >>>> > >> > > > > > whenever >>>> > >> > > > > > we try to vnc to any vm, that vm will go to Aborted state >>>> > >> > > > > > immediately. Actually, merely telnet from within the >>>> host to the >>>> > >> > > > > > vnc port of that vm will immediately Abort that vm. This >>>> > >> > > > > > prevents us from adding new vms. Also, when starting vm >>>> with vnc >>>> > >> > > > > > port, we got this message: >>>> > >> > > > > > >>>> > >> > > > > > rfbListenOnTCP6Port: error in bind IPv6 socket: Address >>>> already >>>> > >> > > > > > in use >>>> > >> > > > > > >>>> > >> > > > > > , which we found someone else provided a patch at >>>> > >> > > > > > >>>> > >> >>>> http://permalink.gmane.org/**gmane.os.freebsd.devel.**emulation/10237< >>>> > >> http://permalink.gmane.org/gmane.os.freebsd.devel.emulation/10237> >>>> > >> > > > > > >>>> > >> > > > > > So looks like when there are multiple vms on a ipv6 >>>> system (we >>>> > >> > > > > > have 64bit FreeBSD 9.0) will get this problem. >>>> > >> > > > > > >>>> > >> > > > > >>>> > >> > > > > Glad to hear that 4.1.16 helps for the networking problem. >>>> The VNC >>>> > >> > > > > problem is also a known one but the mentioned patch does >>>> not work >>>> > >> > > > > at least for a few people. It seems the bug is somewhere in >>>> > >> > > > > libvncserver so downgrading net/libvncserver to an earlier >>>> version >>>> > >> > > > > (and rebuilding virtualbox) should help until we come up >>>> with a >>>> > >> > > > > proper fix. >>>> > >> > > > > >>>> > >> > > > >>>> > >> > > > You are right about the "Address already in use" problem and >>>> the >>>> > >> > > > patch for it so I will commit the fix in a few moments. >>>> > >> > > > >>>> > >> > > > I have also tried to reproduce the VNC crash but I couldn't. >>>> > >> Probably >>>> > >> > > > because >>>> > >> > > > my system is IPv6 enabled. flo@ has seen the same crash and >>>> has no >>>> > >> > > > IPv6 in his kernel which lead him to find this commit in >>>> > >> > > > libvncserver: >>>> > >> > > > >>>> > >> > > > >>>> > >> > > > commit 66282f58000c8863e104666c30cb67**b1d5cbdee3 >>>> > >> > > > Author: Kyle J. McKay >>>> > >> > > > Date: Fri May 18 00:30:11 2012 -0700 >>>> > >> > > > libvncserver/sockets.c: do not segfault when >>>> > >> > > > listenSock/listen6Sock == -1 >>>> > >> > > > >>>> > >> > > > http://libvncserver.git.** >>>> > >> sourceforge.net/git/gitweb.**cgi?p=libvncserver/ >>>> > >> > > > **libvncserver;a=commit;h=**66282f5< >>>> > >> >>>> http://libvncserver.git.sourceforge.net/git/gitweb.cgi?p=libvncserver/libvncserver;a=commit;h=66282f5 >>>> > >> > >>>> > >> > > > >>>> > >> > > > >>>> > >> > > > It looks promising so please test this patch if you can >>>> reproduce >>>> > >> the >>>> > >> > > > crash. >>>> > >> > > > >>>> > >> > > > >>>> > >> > > > -- >>>> > >> > > > Bernhard Froehlich >>>> > >> > > > http://www.bluelife.at/ >>>> > >> > > > >>>> > >> > > >>>> > >> > > Sorry, I tried to try this patch, but couldn't figure out how >>>> to do >>>> > >> > > that. I use ports to compile everything, and can see the file >>>> is at >>>> > >> > > >>>> > >> >>>> /usr/ports/net/libvncserver/work/LibVNCServer-0.9.9/libvncserver/sockets.c >>>> > >> > > . However, if I edit this file and do make clean, this patch >>>> is wiped >>>> > >> > > out before I can do "make" out of it. How to apply this patch >>>> in the >>>> > >> > > ports? >>>> > >> > >>>> > >> > To apply patches to ports: >>>> > >> > # make clean >>>> > >> > # make patch >>>> > >> > >>>> > >> > # make >>>> > >> > # make deinstall >>>> > >> > # make reinstall >>>> > >> > >>>> > >> > Note that the final two steps assume a version of the port is >>>> already >>>> > >> > installed. If not: 'make install' >>>> > >> > I you use portmaster, after applying the patch: 'portmaster -C >>>> > >> > net/libvncserver' -- >>>> > >> >>>> > >> flo has already committed the patch to net/libvncserver so I guess >>>> it >>>> > >> fixes the problem. Please update your portstree and verify that it >>>> works >>>> > >> fine. >>>> > >> >>>> > > >>>> > > I confirmed after upgrading all ports and noticing libvncserver >>>> upgraded >>>> > > to 0.99_1 and reboot, then I can vnc to the vms now. Also, >>>> starting vms >>>> > > with vnc doesn't have that error now, instead it issues the >>>> following info, >>>> > > so all problem are solved. >>>> > > >>>> > > 07/06/2012 03:49:14 Listening for VNC connections on TCP port 5903 >>>> > > 07/06/2012 03:49:14 Listening for VNC connections on TCP6 port 5903 >>>> > > >>>> > > Thanks everyone for your great help! >>>> > > >>>> > >>>> > Unfortunately, seems that the original problem of one vm disrupts all >>>> vms >>>> > and entire network appears to remain, albeit to less scope. After >>>> running >>>> > on virtualbox-ose-4.1.16_1 and libvncserver-0.9.9_1 for 12 hours, all >>>> vms >>>> > lost connection again. Also, phpvirtualbox stopped responding, and >>>> > attempts to restart vboxwebsrv hanged. And trying to kill (-9) the >>>> > vboxwebsrv process won't work. The following was the output of "ps >>>> > aux|grep -i box" at that time: >>>> > >>>> > root 3322 78.7 16.9 4482936 4248180 ?? Is 3:42AM 126:00.53 >>>> > /usr/local/bin/VBoxHeadless --startvm vm1 >>>> > root 3377 0.2 4.3 1286200 1078728 ?? Is 3:42AM 15:39.40 >>>> > /usr/local/bin/VBoxHeadless --startvm vm2 >>>> > root 3388 0.1 4.3 1297592 1084676 ?? Is 3:42AM 15:06.97 >>>> > /usr/local/bin/VBoxHeadless --startvm vm7 -n -m 5907 -o >>>> jtlgjkrfyh9tpgjklfds >>>> > root 2453 0.0 0.0 141684 7156 ?? Ts 3:38AM 4:14.09 >>>> > /usr/local/bin/vboxwebsrv >>>> > root 2478 0.0 0.0 45288 2528 ?? S 3:38AM 1:29.99 >>>> > /usr/local/lib/virtualbox/VBoxXPCOMIPCD >>>> > root 2494 0.0 0.0 121848 5380 ?? S 3:38AM 3:13.96 >>>> > /usr/local/lib/virtualbox/VBoxSVC --auto-shutdown >>>> > root 3333 0.0 4.3 1294712 1079608 ?? Is 3:42AM 19:35.09 >>>> > /usr/local/bin/VBoxHeadless --startvm vm3 >>>> > root 3355 0.0 4.3 1290424 1079332 ?? Is 3:42AM 16:43.05 >>>> > /usr/local/bin/VBoxHeadless --startvm vm5 >>>> > root 3366 0.0 8.5 2351436 2140076 ?? Is 3:42AM 17:32.35 >>>> > /usr/local/bin/VBoxHeadless --startvm vm6 >>>> > root 3598 0.0 4.3 1294520 1078664 ?? Ds 3:50AM 15:01.04 >>>> > /usr/local/bin/VBoxHeadless --startvm vm4 -n -m 5904 -o >>>> > u679y0uojlkdfsgkjtfds >>>> > >>>> > You can see the vboxwebsrv process has the "T" flag there, and the >>>> > vboxheadless process for vm4 has "D" flag there. Both of such >>>> processes I >>>> > can never kill them, not even with "kill -9". So on the host I >>>> disabled >>>> > the interface bridged to vm4 and restarted network, and fortunately >>>> both >>>> > the vm4 and the vboxwebsrv processed disappeared. And at that point >>>> all >>>> > other vms regained network. >>>> > >>>> > There may be one hope that the "troublemaker" may be limited to one >>>> of the >>>> > vms that started with vnc, although there was no vnc connection at >>>> that >>>> > time, and the other vm with vnc was fine. And this is just a hopeful >>>> guess. >>>> > >>>> > Also I found no log or error message related to virtualbox in any log >>>> > file. The VBoxSVC.log only had some information when started but >>>> never >>>> > since. >>>> >>>> If this is still a problem then >>>> >>>> ps alxww | grep -i box >>>> >>>> may be more helpful as it will show the wait channel of processes stuck >>>> in the kernel. >>>> >>>> Gary >>>> >>> >>> We avoided this problem by running all vms without vnc. But forgot this >>> problem and left one vm on with vnc, together with the other few running >>> vms yesterday, and hit this problem again on virtualbox 4.1.16. Only the >>> old trick of turning off the host interface corresponding to the vm with >>> vnc and then restarting host network got us out of the problem. >>> >>> We then upgraded virtualbox to 4.1.18, turning off all vms, wait until >>> "ps aux|grep -i box" reported nothing, then started all vms. And let no vm >>> with vnc running. >>> >>> Still the problem hit us again. Here is the output of " ps alxww | grep >>> -i box" as you suggested: >>> >>> 1011 42725 1 0 20 0 1289796 1081064 IPRT S >>> Is ?? 30:53.24 VBoxHeadless --startvm vm5 >>> >>> after "kill -9 42725", the line changed to >>> >>> 1011 42725 1 0 20 0 1289796 1081064 keglim >>> Ts ?? 30:53.24 VBoxHeadless --startvm vm5 >>> >>> after "kill -9" for another vm, the line changed to something like >>> >>> 1011 42754 1 0 20 0 1289796 1081064 - Ts >>> ?? 30:53.24 VBoxHeadless --startvm vm7 >>> >>> and controlvm command don't work, and these command stuck there >>> themselves. The following are their outputs: >>> >>> 0 89572 79180 0 21 0 44708 1644 select I+ >>> v6 0:00.01 VBoxManage controlvm projects_outside acpipowerbutton >>> 0 89605 89586 0 21 0 44708 2196 select I+ >>> v7 0:00.01 VBoxManage controlvm projects_outside poweroff >>> >>> We now rebooted the host, and left no vm with vnc running. >>> >> >> The problem has become more rampant now. After rebooting and running >> virtualbox-ose-4.1.18, and no vm was started with console, the around 10 >> vms, bridged to each of its own dedicated interface, get no network >> connection a couple times a day. Most times it would recover itself after >> about 10 minutes, sometimes we have to restart host network which >> immediately restore all connections. >> > > Looks like multiple people had similar problem, see the threads in > > http://lists.freebsd.org/pipermail/freebsd-emulation/2011-July/008957.html > http://lists.freebsd.org/pipermail/freebsd-stable/2011-July/063197.html > > and > > http://lists.freebsd.org/pipermail/freebsd-stable/2011-July/063172.html > > http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/058708.html > > http://lists.freebsd.org/pipermail/freebsd-stable/2011-July/thread.html#63221 > > http://lists.freebsd.org/pipermail/freebsd-emulation/2011-July/thread.html#8957 > > In all cases (including mine), the hardware is a beefy Dell, such as Dell > R710, with multiple cores and multiple interfaces. Virtualbox is running > on all machines. Some of them may also have ZFS running (like mine). > > My problem is, after some time, the entire network on the host stops > working, so that one can't ssh to any guest or other machine, not even to > itself. At such point, restarting network on the host will restore > everything. My guests are bridged to its respective interfaces on the host. > > Those other people also reported that they can't scp large file from host > to the guest. (Un)fortunately I also confirmed the same problem. When scp > a 10G file from host to guest, it will get stalled sometimes afterwards. > And at that time, the entire network is broken, as described in the > paragraph above. > > I tried everything they described, including: > > 1. setting net.graph.maxdata and net.graph.maxalloc to 65536 in > /boot/loader.conf > 2. setting kern.ipc.maxsockbut, net.graph.maxdgram, net.graph.recvspaceto > 8388608 in /etc/sysctl.conf > 3. setting kern.maxdsiz in /boot/loader.conf . Although I don't know what > else to set, since somehow the default value is somehow 34G on my 24G > memory server. > > Unfortunately, none of them work, ie, scp from host to guest always stalls > after some variable time (varying from dozens of MBs to 7GB). Scp from > host to other machine appears to have no problem though. > > So appears that something is broken after scp from host to guest for some > time. But I don't know what is broken. Still, this offers a reliable way > to reproduce the problem. If you have any suggestion of checking what is > broken or what change to fix, I'd be happy to comply. This is already in > production so I can't leave the vms not running for more than a few minutes > though. > One more observation is, that it becomes worse may be related to that I installed more vms lately. After shutting down a few vms I do notice this network problem occurred less frequently, albeit still about once a day. Also, do you think if this may not be a virtualbox problem or people in other freebsd subforum may have better expertise as this may turn out to be a network problem? I just want to post on the forum that people would most relate to, so your advice would be appreciated.