Date: Mon, 27 May 2013 13:41:08 +0000 From: "Teske, Devin" <Devin.Teske@fisglobal.com> To: Daniel Braniss <danny@cs.huji.ac.il> Cc: "<pyunyh@gmail.com>" <pyunyh@gmail.com>, Devin Teske <dteske@freebsd.org>, FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org> Subject: Re: SunFire X2200 ilo's bge1 DOWN/UP Message-ID: <13CA24D6AB415D428143D44749F57D7201F62C26@ltcfiswmsgmb21> In-Reply-To: <E1UgsL2-000DBa-El@kabab.cs.huji.ac.il> References: <E1UgsL2-000DBa-El@kabab.cs.huji.ac.il>
next in thread | previous in thread | raw e-mail | index | archive | help
On May 27, 2013, at 12:59 AM, Daniel Braniss wrote: On Fri, May 24, 2013 at 05:31:13PM +0300, Daniel Braniss wrote: hi, after upgrading to 9.1-stable, this particular hardware - SunFire X2200, If you're truly running stable/9, and it's up-to-date, you should have have= already SVN revisions 248858 and 250650. Both of which have significant impact for (a) the SunFire X2200 (r248858) and (b) the DOWN/UP problem (r250650). Show me dmesg(bge(4) and brgphy(4) only) and 'ifconfig bge1' output. bge0: <Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0x009003> = mem 0xfdff0000-0xfdffffff,0xfdfe0000-0xfdfeffff irq 17 at device 4.0 on pci6 bge0: CHIP ID 0x00009003; ASIC REV 0x09; CHIP REV 0x90; PCI-X 133 MHz miibus2: <MII bus> on bge0 brgphy0: <BCM5714 1000BASE-T media interface> PHY 1 on miibus2 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow bge0: Ethernet address: 00:1b:24:5d:5b:bd bge1: <Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0x009003> = mem 0xfdfc0000-0xfdfcffff,0xfdfb0000-0xfdfbffff irq 18 at device 4.1 on pci6 bge1: CHIP ID 0x00009003; ASIC REV 0x09; CHIP REV 0x90; PCI-X 133 MHz miibus3: <MII bus> on bge1 brgphy1: <BCM5714 1000BASE-T media interface> PHY 1 on miibus3 brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow bge1: Ethernet address: 00:1b:24:5d:5b:be sf-10> ifconfig bge1 bge1: flags=3D8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=3D8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LI= NKSTA TE> ether 00:1b:24:5d:5b:be nd6 options=3D21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect (100baseTX <full-duplex>) status: active Saw similar things happening over here with different broadcom chipset, and= the above revisions helped significantly (URLs below): http://svnweb.freebsd.org/base?view=3Drevision&revision=3D248858 http://svnweb.freebsd.org/base?view=3Drevision&revision=3D250650 is toggeling bge1 DOWN/UP every few hours, this port is being used by the I= LO. To check, I upgraded another identical host, and the same problem appears. What is the last known working revision? I have no idea, but I have older versions, and ill start from the oldets (9.1-prerelease), but it will take time, since it takes hours till it happens. There are ways you can speed up the replication time. I tend to flood a ser= ver with TCP while I've heard of it happening under UDP flood too. Here's a nice way to flood a server with TCP (assuming you have SSH access = to the system via keys): sh -c 'while :;do dd if=3D/dev/urandom of=3D/dev/stdout bs=3D1m count=3D102= 4 | ssh HOST2KILL /sbin/md5; done' Run that about 16 times in separate screen sessions from various other host= s on your network, taking care to replace "HOST2KILL" with the hostname or IP of the box with = the SunFire X2200. Let that run for a while, and then when you think you've had a reset (if yo= u weren't standing there watching for one)=85 grep 'bge.*DOWN' /var/log/messages On a system that has booted and stayed up-and-running, there shouldn't be a= ny messages like this: bge0: link state changed to DOWN When you actually get this message (if your experience is like ours), you'l= l be down for 90 seconds while the NIC resets. However, since you say you have some older 9.1 releases=85 I'd start by fir= st trying to bring the replication time of the problem down by using TCP and/or UDP floods. That w= ay you'll be able to test for resolution of the problem as you progress up to stable/9 (where th= e problem should be fixed by the aforementioned SVN revisions -- specific to your hardware). There is not correlation with time, since they happend at totaly different times. I rebooted both hosts at almost the same time. one host : uptime: 5:24PM up 6:15, 0 users, load averages: 0.00, 0.00, 0.00 May 24 12:53:52 sf-04 kernel: bge1: link state changed to DOWN May 24 12:53:55 sf-04 kernel: bge1: link state changed to UP May 24 15:34:25 sf-04 kernel: bge1: link state changed to DOWN May 24 15:34:28 sf-04 kernel: bge1: link state changed to UP and uptime: 5:24PM up 6:14, 0 users, load averages: 0.00, 0.00, 0.00 May 24 16:30:44 sf-10 kernel: bge1: link state changed to DOWN May 24 16:30:44 sf-10 kernel: bge1: link state changed to UP this is not serious, the ilo (ssh) connection is ok, but it's anoying, we h= ave more than 10 of this hosts, and if I upgrade all of them, the logs will fill up with this :-) any ideas? Well, you say the connection is OK=85 so it doesn't sound like a full reset= as it was in our case (we have a different chipset). But I agree that a log full of those would be annoying. Try getting up to stable/9 in its current state (note: stable/8 also has al= l the aforementioned revisions too). -- Devin _____________ The information contained in this message is proprietary and/or confidentia= l. If you are not the intended recipient, please: (i) delete the message an= d all copies; (ii) do not disclose, distribute or use the message in any ma= nner; and (iii) notify the sender immediately. In addition, please be aware= that any message addressed to our domain is subject to archiving and revie= w by persons other than the intended recipient. Thank you.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?13CA24D6AB415D428143D44749F57D7201F62C26>