Date: Fri, 15 Jan 2010 22:46:50 +0100 From: Floris Bos <info@je-eigen-domein.nl> To: pyunyh@gmail.com Cc: freebsd-net@freebsd.org Subject: Re: kern/92090: [bge] bge: watchdog timeout -- resetting Message-ID: <201001152246.50315.info@je-eigen-domein.nl> In-Reply-To: <20100115185424.GG1228@michelle.cdnetworks.com> References: <201001140140.o0E1e5hr072464@freefall.freebsd.org> <201001150333.59107.info@je-eigen-domein.nl> <20100115185424.GG1228@michelle.cdnetworks.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--Boundary-00=_KJOUL+qsQkRfsLe Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit On Friday 15 January 2010 07:54:24 pm Pyun YongHyeon wrote: > On Fri, Jan 15, 2010 at 03:33:58AM +0100, Floris Bos wrote: > > On Friday 15 January 2010 01:53:16 am Pyun YongHyeon wrote: > > > On Thu, Jan 14, 2010 at 09:48:56PM +0100, Floris Bos wrote: > > > > On Thursday 14 January 2010 09:11:44 pm Pyun YongHyeon wrote: > > > > > On Thu, Jan 14, 2010 at 09:08:02PM +0100, Floris Bos wrote: > > > > > > On Thursday 14 January 2010 06:56:03 pm Pyun YongHyeon wrote: > > > > > > > On Thu, Jan 14, 2010 at 04:33:19AM +0100, Floris Bos wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > On Thursday 14 January 2010 03:54:52 am Pyun YongHyeon wrote: > > > > > > > > > > == > > > > > > > > > > bge0: <HP NC107i PCIe Gigabit Server Adapter, ASIC rev. 0x5784100> mem 0xdf900000-0xdf90ffff irq 16 at device 0.0 on pci32 > > > > > > > > > > == > > > > > > > > > > > > > > > > > > > > After boot, the network works for about 5 seconds, barely enough time to get an IP by DHCP, and sent a ping or 2. > > > > > > > > > > Then network connectivity goes down, and after some time there is a "bge0: watchdog timeout -- resetting" message. > > > > > > > > > > > > > > > > > > > > Then network works again for 5 seconds, and goes down again. All the time, repeatedly. > > > > > > > > > > > > > > > > > > > > The system works fine under Ubuntu. So I assume the hardware is ok. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm not sure but it looks like you have a BCM5784 controller. What is > > > > > > > > > the output of "devinfo -rv | grep phy"? > > > > > > > > > > > > > > > > == > > > > > > > > ukphy0 pnpinfo oui=0x50ef model=0x3a rev=0x4 at phyno=1 > > > > > > > > ukphy1 pnpinfo oui=0x50ef model=0x3a rev=0x4 at phyno=1 > > > > > > > > == > > > > > > > > > > > > > > Support for the PHY was added in r202269. > > > > > > > Please try again after applying the change. Or you can download > > > > > > > sys/dev/mii/miidevs and sys/dev/mii/brgphy.c from HEAD and rebuild > > > > > > > kernel. > > > > > > > > > > > > Fetched the latest source using CVS on another computer, and transferred it to the system concerned by USB stick. > > > > > > Rebuild the kernel, but the problem is still there. > > > > > > > > > > > Would you show me full dmesg output including "watchodg timeout" > > > > > messages? > > > > > > > > === > > > > Copyright (c) 1992-2010 The FreeBSD Project. > > > > Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 > > > > The Regents of the University of California. All rights reserved. > > > > > > [...] > > > > > > > bge0: <HP NC107i PCIe Gigabit Server Adapter, ASIC rev. 0x5784100> mem 0xdf900000-0xdf90ffff irq 16 at device 0.0 on pci32 > > > > miibus0: <MII bus> on bge0 > > > > brgphy0: <BCM5784 10/100/1000baseTX PHY> PHY 1 on miibus0 > > > > brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto > > > > bge0: Ethernet address: f4:ce:46:0f:2a:2c > > > > bge0: [FILTER] > > > > pcib4: <ACPI PCI-PCI bridge> irq 16 at device 28.5 on pci0 > > > > pci34: <ACPI PCI bus> on pcib4 > > > > bge1: <HP NC107i PCIe Gigabit Server Adapter, ASIC rev. 0x5784100> mem 0xdfa00000-0xdfa0ffff irq 17 at device 0.0 on pci34 > > > > miibus1: <MII bus> on bge1 > > > > brgphy1: <BCM5784 10/100/1000baseTX PHY> PHY 1 on miibus1 > > > > brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto > > > > bge1: Ethernet address: f4:ce:46:0f:2a:2d > > > > bge1: [FILTER] > > > > > > [...] > > > > > > Would you give attached patch try? I don't know whether it help > > > or not though. I couldn't find any related information for possible > > > clue of the issue in publicly available datasheet. > > > > The patch did not make any difference. > > > > > > However I did notice something else odd. > > The problem only occurs on bge0, the second interface bge1 does work. > > > > I grabbed the U57DIAG diagnostic boot CD from the Broadcom site, and noticed that the first interface has ASF enabled, while the second one has not. > > I disabled ASF by doing: > > > > = > > b57udiag -cmd > > setasf -d > > == > > > > And now the first interface also works properly. > > > > Glad to hear you solved the issue. I totally forgot CURRENT enabled > ASF support by default(hw.bge.allow_asf). > > > So there is something with the ASF stuff that conflicts with FreeBSD. > > The IPMI card of the system is configured to use a dedicated 3rd LAN port, and is NOT sharing bge0. > > But perhaps the NIC is initialized differently nevertheless when ASF firmware is enabled, and that is causing issues? > > > > Yes, I remember there were a couple of issues related with ASF. > Linux seems to have very complex logic to coexist with ASF/IPMI > firmware which I don't still understand its implications at this > time. bge(4) may need more robust code to handle that but datasheet > seems to show very limited information. Lack of ASF/IPMI capable > bge(4) controller also make me hard to experiment some code. Can understand the difficulty to debug such things, without having the hardware. So I did some more research myself, and found the bug. You said Linux was complicated, so I took a look at the Opensolaris bge source instead, to see how they do ASF things and I noticed the following comment ( http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c ) == 5698 /* 5699 * The driver is supposed to notify ASF that the OS is still running 5700 * every three seconds, otherwise the management server may attempt 5701 * to reboot the machine. If it hasn't actually failed, this is 5702 * not a desirable result. However, this isn't running as a real-time 5703 * thread, and even if it were, it might not be able to generate the 5704 * heartbeat in a timely manner due to system load. As it isn't a 5705 * significant strain on the machine, we will set the interval to half 5706 * of the required value. 5707 */ == What a coincidence, although not the entire system is rebooted, my network link went up & down every 3 seconds according to the switch. Seems FreeBSD only notifies ASF every 5 seconds. Attached a patch that reduces it to 2 seconds, and it solves the problem for me, with ASF enabled. Yours sincerely, Floris Bos --Boundary-00=_KJOUL+qsQkRfsLe Content-Type: text/x-patch; charset="UTF-8"; name="bge_asf_driver_up.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="bge_asf_driver_up.patch" --- if_bge.orig 2010-01-15 22:16:08.325626860 +0100 +++ if_bge.c 2010-01-15 22:16:58.724265514 +0100 @@ -3677,7 +3677,7 @@ if (sc->bge_asf_count) sc->bge_asf_count --; else { - sc->bge_asf_count = 5; + sc->bge_asf_count = 2; bge_writemem_ind(sc, BGE_SOFTWARE_GENCOMM_FW, BGE_FW_DRV_ALIVE); bge_writemem_ind(sc, BGE_SOFTWARE_GENNCOMM_FW_LEN, 4); --Boundary-00=_KJOUL+qsQkRfsLe--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201001152246.50315.info>