Date: Fri, 27 Jul 2012 22:05:27 -0400 From: Steve Tuts <yiz5hwi@gmail.com> To: freebsd-emulation@freebsd.org Subject: Re: become worse now - Re: one virtualbox vm disrupts all vms and entire network - finally fixed now Message-ID: <CAEXKtDo--vefivg6AJij7d0V2KHACR6f6ff9BLnGTbeO9YEBeQ@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
finally fixed now, see fully reply in the end. On Fri, Jul 13, 2012 at 4:34 PM, Steve Tuts <yiz5hwi@gmail.com> wrote: > > > On Mon, Jul 9, 2012 at 9:11 AM, Steve Tuts <yiz5hwi@gmail.com> wrote: > >> >> >> On Tue, Jun 12, 2012 at 6:24 PM, Gary Palmer <gpalmer@freebsd.org> wrote: >> >>> On Thu, Jun 07, 2012 at 03:56:22PM -0400, Steve Tuts wrote: >>> > On Thu, Jun 7, 2012 at 3:54 AM, Steve Tuts <yiz5hwi@gmail.com> wrote: >>> > >>> > > >>> > > >>> > > On Thu, Jun 7, 2012 at 2:58 AM, Bernhard Fr?hlich <decke@bluelife.at >>> >wrote: >>> > > >>> > >> On Do., 7. Jun. 2012 01:07:52 CEST, Kevin Oberman < >>> kob6558@gmail.com> >>> > >> wrote: >>> > >> >>> > >> > On Wed, Jun 6, 2012 at 3:46 PM, Steve Tuts <yiz5hwi@gmail.com> >>> wrote: >>> > >> > > On Wed, Jun 6, 2012 at 3:50 AM, Bernhard Froehlich >>> > >> > > <decke@freebsd.org>wrote: >>> > >> > > >>> > >> > > > On 05.06.2012 20:16, Bernhard Froehlich wrote: >>> > >> > > > >>> > >> > > > > On 05.06.2012 19:05, Steve Tuts wrote: >>> > >> > > > > >>> > >> > > > > > On Mon, Jun 4, 2012 at 4:11 PM, Rusty Nejdl >>> > >> > > > > > <rnejdl@ringofsaturn.com> wrote: >>> > >> > > > > > >>> > >> > > > > > On 2012-06-02 12:16, Steve Tuts wrote: >>> > >> > > > > > > >>> > >> > > > > > > Hi, we have a Dell poweredge server with a dozen >>> interfaces. >>> > >> > > > > > > It hosts >>> > >> > > > > > > > a >>> > >> > > > > > > > few guests of web app and email servers with >>> > >> > > > > > > > VirtualBox-4.0.14. The host >>> > >> > > > > > > > and all guests are FreeBSD 9.0 64bit. Each guest is >>> bridged >>> > >> > > > > > > > to a distinct >>> > >> > > > > > > > interface. The host and all guests are set to >>> 10.0.0.0 >>> > >> > > > > > > > network NAT'ed to >>> > >> > > > > > > > a >>> > >> > > > > > > > cicso router. >>> > >> > > > > > > > >>> > >> > > > > > > > This runs well for a couple months, until we added a >>> new >>> > >> > > > > > > > guest recently. >>> > >> > > > > > > > Every few hours, none of the guests can be connected. >>> We >>> > >> > > > > > > > can only connect >>> > >> > > > > > > > to the host from outside the router. We can also go >>> to the >>> > >> > > > > > > > console of the >>> > >> > > > > > > > guests (except the new guest), but from there we >>> can't ping >>> > >> > > > > > > > the gateway 10.0.0.1 any more. The new guest just >>> froze. >>> > >> > > > > > > > >>> > >> > > > > > > > Furthermore, on the host we can see a vboxheadless >>> process >>> > >> > > > > > > > for each guest, >>> > >> > > > > > > > including the new guest. But we can not kill it, not >>> even >>> > >> > > > > > > > with "kill -9". >>> > >> > > > > > > > We looked around the web and someone suggested we >>> should use >>> > >> > > > > > > > "kill -SIGCONT" first since the "ps" output has the >>> "T" flag >>> > >> > > > > > > > for that vboxheadless process for that new guest, but >>> that >>> > >> > > > > > > > doesn't help. We also >>> > >> > > > > > > > tried all the VBoxManager commands to poweroff/reset >>> etc >>> > >> > > > > > > > that new guest, >>> > >> > > > > > > > but they all failed complaining that vm is in Aborted >>> state. >>> > >> > > > > > > > We also tried >>> > >> > > > > > > > VBoxManager commands to disconnect the network cable >>> for >>> > >> > > > > > > > that new guest, >>> > >> > > > > > > > it >>> > >> > > > > > > > didn't complain, but there was no effect. >>> > >> > > > > > > > >>> > >> > > > > > > > For a couple times, on the host we disabled the >>> interface >>> > >> > > > > > > > bridging that new >>> > >> > > > > > > > guest, then that vboxheadless process for that new >>> guest >>> > >> > > > > > > > disappeared (we >>> > >> > > > > > > > attempted to kill it before that). And immediately >>> all >>> > >> > > > > > > > other vms regained >>> > >> > > > > > > > connection back to normal. >>> > >> > > > > > > > >>> > >> > > > > > > > But there is one time even the above didn't help - the >>> > >> > > > > > > > vboxheadless process >>> > >> > > > > > > > for that new guest stubbonly remains, and we had to >>> reboot >>> > >> > > > > > > > the host. >>> > >> > > > > > > > >>> > >> > > > > > > > This is already a production server, so we can't >>> upgrade >>> > >> > > > > > > > virtualbox to the >>> > >> > > > > > > > latest version until we obtain a test server. >>> > >> > > > > > > > >>> > >> > > > > > > > Would you advise: >>> > >> > > > > > > > >>> > >> > > > > > > > 1. is there any other way to kill that new guest >>> instead of >>> > >> > > > > > > > rebooting? 2. what might cause the problem? >>> > >> > > > > > > > 3. what setting and test I can do to analyze this >>> problem? >>> > >> > > > > > > > ______________________________****_________________ >>> > >> > > > > > > > >>> > >> > > > > > > > >>> > >> > > > > > > I haven't seen any comments on this and don't want you >>> to >>> > >> > > > > > > think you are being ignored but I haven't seen this but >>> also, >>> > >> > > > > > > the 4.0 branch was buggier >>> > >> > > > > > > for me than the 4.1 releases so yeah, upgrading is >>> probably >>> > >> > > > > > > what you are looking at. >>> > >> > > > > > > >>> > >> > > > > > > Rusty Nejdl >>> > >> > > > > > > ______________________________****_________________ >>> > >> > > > > > > >>> > >> > > > > > > >>> > >> > > > > > > sorry, just realize my reply yesterday didn't go to >>> the list, >>> > >> > > > > > > so am >>> > >> > > > > > re-sending with some updates. >>> > >> > > > > > >>> > >> > > > > > Yes, we upgraded all ports and fortunately everything >>> went back >>> > >> > > > > > and especially all vms has run peacefully for two days >>> now. So >>> > >> > > > > > upgrading to the latest virtualbox 4.1.16 solved that >>> problem. >>> > >> > > > > > >>> > >> > > > > > But now we got a new problem with this new version of >>> > >> virtualbox: >>> > >> > > > > > whenever >>> > >> > > > > > we try to vnc to any vm, that vm will go to Aborted state >>> > >> > > > > > immediately. Actually, merely telnet from within the host >>> to the >>> > >> > > > > > vnc port of that vm will immediately Abort that vm. This >>> > >> > > > > > prevents us from adding new vms. Also, when starting vm >>> with vnc >>> > >> > > > > > port, we got this message: >>> > >> > > > > > >>> > >> > > > > > rfbListenOnTCP6Port: error in bind IPv6 socket: Address >>> already >>> > >> > > > > > in use >>> > >> > > > > > >>> > >> > > > > > , which we found someone else provided a patch at >>> > >> > > > > > >>> > >> >>> http://permalink.gmane.org/**gmane.os.freebsd.devel.**emulation/10237< >>> > >> http://permalink.gmane.org/gmane.os.freebsd.devel.emulation/10237> >>> > >> > > > > > >>> > >> > > > > > So looks like when there are multiple vms on a ipv6 >>> system (we >>> > >> > > > > > have 64bit FreeBSD 9.0) will get this problem. >>> > >> > > > > > >>> > >> > > > > >>> > >> > > > > Glad to hear that 4.1.16 helps for the networking problem. >>> The VNC >>> > >> > > > > problem is also a known one but the mentioned patch does >>> not work >>> > >> > > > > at least for a few people. It seems the bug is somewhere in >>> > >> > > > > libvncserver so downgrading net/libvncserver to an earlier >>> version >>> > >> > > > > (and rebuilding virtualbox) should help until we come up >>> with a >>> > >> > > > > proper fix. >>> > >> > > > > >>> > >> > > > >>> > >> > > > You are right about the "Address already in use" problem and >>> the >>> > >> > > > patch for it so I will commit the fix in a few moments. >>> > >> > > > >>> > >> > > > I have also tried to reproduce the VNC crash but I couldn't. >>> > >> Probably >>> > >> > > > because >>> > >> > > > my system is IPv6 enabled. flo@ has seen the same crash and >>> has no >>> > >> > > > IPv6 in his kernel which lead him to find this commit in >>> > >> > > > libvncserver: >>> > >> > > > >>> > >> > > > >>> > >> > > > commit 66282f58000c8863e104666c30cb67**b1d5cbdee3 >>> > >> > > > Author: Kyle J. McKay <mackyle@gmail.com> >>> > >> > > > Date: Fri May 18 00:30:11 2012 -0700 >>> > >> > > > libvncserver/sockets.c: do not segfault when >>> > >> > > > listenSock/listen6Sock == -1 >>> > >> > > > >>> > >> > > > http://libvncserver.git.** >>> > >> sourceforge.net/git/gitweb.**cgi?p=libvncserver/ >>> > >> > > > **libvncserver;a=commit;h=**66282f5< >>> > >> >>> http://libvncserver.git.sourceforge.net/git/gitweb.cgi?p=libvncserver/libvncserver;a=commit;h=66282f5 >>> > >> > >>> > >> > > > >>> > >> > > > >>> > >> > > > It looks promising so please test this patch if you can >>> reproduce >>> > >> the >>> > >> > > > crash. >>> > >> > > > >>> > >> > > > >>> > >> > > > -- >>> > >> > > > Bernhard Froehlich >>> > >> > > > http://www.bluelife.at/ >>> > >> > > > >>> > >> > > >>> > >> > > Sorry, I tried to try this patch, but couldn't figure out how >>> to do >>> > >> > > that. I use ports to compile everything, and can see the file >>> is at >>> > >> > > >>> > >> >>> /usr/ports/net/libvncserver/work/LibVNCServer-0.9.9/libvncserver/sockets.c >>> > >> > > . However, if I edit this file and do make clean, this patch >>> is wiped >>> > >> > > out before I can do "make" out of it. How to apply this patch >>> in the >>> > >> > > ports? >>> > >> > >>> > >> > To apply patches to ports: >>> > >> > # make clean >>> > >> > # make patch >>> > >> > <Apply patch> >>> > >> > # make >>> > >> > # make deinstall >>> > >> > # make reinstall >>> > >> > >>> > >> > Note that the final two steps assume a version of the port is >>> already >>> > >> > installed. If not: 'make install' >>> > >> > I you use portmaster, after applying the patch: 'portmaster -C >>> > >> > net/libvncserver' -- >>> > >> >>> > >> flo has already committed the patch to net/libvncserver so I guess >>> it >>> > >> fixes the problem. Please update your portstree and verify that it >>> works >>> > >> fine. >>> > >> >>> > > >>> > > I confirmed after upgrading all ports and noticing libvncserver >>> upgraded >>> > > to 0.99_1 and reboot, then I can vnc to the vms now. Also, starting >>> vms >>> > > with vnc doesn't have that error now, instead it issues the >>> following info, >>> > > so all problem are solved. >>> > > >>> > > 07/06/2012 03:49:14 Listening for VNC connections on TCP port 5903 >>> > > 07/06/2012 03:49:14 Listening for VNC connections on TCP6 port 5903 >>> > > >>> > > Thanks everyone for your great help! >>> > > >>> > >>> > Unfortunately, seems that the original problem of one vm disrupts all >>> vms >>> > and entire network appears to remain, albeit to less scope. After >>> running >>> > on virtualbox-ose-4.1.16_1 and libvncserver-0.9.9_1 for 12 hours, all >>> vms >>> > lost connection again. Also, phpvirtualbox stopped responding, and >>> > attempts to restart vboxwebsrv hanged. And trying to kill (-9) the >>> > vboxwebsrv process won't work. The following was the output of "ps >>> > aux|grep -i box" at that time: >>> > >>> > root 3322 78.7 16.9 4482936 4248180 ?? Is 3:42AM 126:00.53 >>> > /usr/local/bin/VBoxHeadless --startvm vm1 >>> > root 3377 0.2 4.3 1286200 1078728 ?? Is 3:42AM 15:39.40 >>> > /usr/local/bin/VBoxHeadless --startvm vm2 >>> > root 3388 0.1 4.3 1297592 1084676 ?? Is 3:42AM 15:06.97 >>> > /usr/local/bin/VBoxHeadless --startvm vm7 -n -m 5907 -o >>> jtlgjkrfyh9tpgjklfds >>> > root 2453 0.0 0.0 141684 7156 ?? Ts 3:38AM 4:14.09 >>> > /usr/local/bin/vboxwebsrv >>> > root 2478 0.0 0.0 45288 2528 ?? S 3:38AM 1:29.99 >>> > /usr/local/lib/virtualbox/VBoxXPCOMIPCD >>> > root 2494 0.0 0.0 121848 5380 ?? S 3:38AM 3:13.96 >>> > /usr/local/lib/virtualbox/VBoxSVC --auto-shutdown >>> > root 3333 0.0 4.3 1294712 1079608 ?? Is 3:42AM 19:35.09 >>> > /usr/local/bin/VBoxHeadless --startvm vm3 >>> > root 3355 0.0 4.3 1290424 1079332 ?? Is 3:42AM 16:43.05 >>> > /usr/local/bin/VBoxHeadless --startvm vm5 >>> > root 3366 0.0 8.5 2351436 2140076 ?? Is 3:42AM 17:32.35 >>> > /usr/local/bin/VBoxHeadless --startvm vm6 >>> > root 3598 0.0 4.3 1294520 1078664 ?? Ds 3:50AM 15:01.04 >>> > /usr/local/bin/VBoxHeadless --startvm vm4 -n -m 5904 -o >>> > u679y0uojlkdfsgkjtfds >>> > >>> > You can see the vboxwebsrv process has the "T" flag there, and the >>> > vboxheadless process for vm4 has "D" flag there. Both of such >>> processes I >>> > can never kill them, not even with "kill -9". So on the host I >>> disabled >>> > the interface bridged to vm4 and restarted network, and fortunately >>> both >>> > the vm4 and the vboxwebsrv processed disappeared. And at that point >>> all >>> > other vms regained network. >>> > >>> > There may be one hope that the "troublemaker" may be limited to one of >>> the >>> > vms that started with vnc, although there was no vnc connection at that >>> > time, and the other vm with vnc was fine. And this is just a hopeful >>> guess. >>> > >>> > Also I found no log or error message related to virtualbox in any log >>> > file. The VBoxSVC.log only had some information when started but never >>> > since. >>> >>> If this is still a problem then >>> >>> ps alxww | grep -i box >>> >>> may be more helpful as it will show the wait channel of processes stuck >>> in the kernel. >>> >>> Gary >>> >> >> We avoided this problem by running all vms without vnc. But forgot this >> problem and left one vm on with vnc, together with the other few running >> vms yesterday, and hit this problem again on virtualbox 4.1.16. Only the >> old trick of turning off the host interface corresponding to the vm with >> vnc and then restarting host network got us out of the problem. >> >> We then upgraded virtualbox to 4.1.18, turning off all vms, wait until >> "ps aux|grep -i box" reported nothing, then started all vms. And let no vm >> with vnc running. >> >> Still the problem hit us again. Here is the output of " ps alxww | grep >> -i box" as you suggested: >> >> 1011 42725 1 0 20 0 1289796 1081064 IPRT S >> Is ?? 30:53.24 VBoxHeadless --startvm vm5 >> >> after "kill -9 42725", the line changed to >> >> 1011 42725 1 0 20 0 1289796 1081064 keglim >> Ts ?? 30:53.24 VBoxHeadless --startvm vm5 >> >> after "kill -9" for another vm, the line changed to something like >> >> 1011 42754 1 0 20 0 1289796 1081064 - Ts >> ?? 30:53.24 VBoxHeadless --startvm vm7 >> >> and controlvm command don't work, and these command stuck there >> themselves. The following are their outputs: >> >> 0 89572 79180 0 21 0 44708 1644 select I+ >> v6 0:00.01 VBoxManage controlvm projects_outside acpipowerbutton >> 0 89605 89586 0 21 0 44708 2196 select I+ >> v7 0:00.01 VBoxManage controlvm projects_outside poweroff >> >> We now rebooted the host, and left no vm with vnc running. >> > > The problem has become more rampant now. After rebooting and running > virtualbox-ose-4.1.18, and no vm was started with console, the around 10 > vms, bridged to each of its own dedicated interface, get no network > connection a couple times a day. Most times it would recover itself after > about 10 minutes, sometimes we have to restart host network which > immediately restore all connections. > It turned out this might not be a virtualbox problem, as searching "freebsd network problem" has quite some problem, especially this page shows theBroadcom bce card problem (which is what we have) and the solution. After adding kern.ipc.nmbclusters="131072" hw.bce.tso_enable=0 hw.pci.enable_msix=0 to /boot/loader.conf.local and reboot, no network problem has occured for 3.5 days, while there were 3-7 occurrance of the network problem every day. So consider this problem finally solved. So it appears to be a Broadcom driver issue and probably system tuning issue.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAEXKtDo--vefivg6AJij7d0V2KHACR6f6ff9BLnGTbeO9YEBeQ>