Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 27 May 2013 13:41:08 +0000
From:      "Teske, Devin" <Devin.Teske@fisglobal.com>
To:        Daniel Braniss <danny@cs.huji.ac.il>
Cc:        "<pyunyh@gmail.com>" <pyunyh@gmail.com>, Devin Teske <dteske@freebsd.org>, FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org>
Subject:   Re: SunFire X2200 ilo's bge1 DOWN/UP
Message-ID:  <13CA24D6AB415D428143D44749F57D7201F62C26@ltcfiswmsgmb21>
In-Reply-To: <E1UgsL2-000DBa-El@kabab.cs.huji.ac.il>
References:  <E1UgsL2-000DBa-El@kabab.cs.huji.ac.il>

next in thread | previous in thread | raw e-mail | index | archive | help

On May 27, 2013, at 12:59 AM, Daniel Braniss wrote:

On Fri, May 24, 2013 at 05:31:13PM +0300, Daniel Braniss wrote:
hi, after upgrading to 9.1-stable, this particular hardware - SunFire X2200,


If you're truly running stable/9, and it's up-to-date, you should have have=
 already
SVN revisions 248858 and 250650. Both of which have significant impact for
(a) the SunFire X2200 (r248858) and (b) the DOWN/UP problem (r250650).


Show me dmesg(bge(4) and brgphy(4) only) and 'ifconfig bge1' output.


bge0: <Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0x009003> =
mem
0xfdff0000-0xfdffffff,0xfdfe0000-0xfdfeffff irq 17 at device 4.0 on pci6
bge0: CHIP ID 0x00009003; ASIC REV 0x09; CHIP REV 0x90; PCI-X 133 MHz
miibus2: <MII bus> on bge0
brgphy0: <BCM5714 1000BASE-T media interface> PHY 1 on miibus2
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow
bge0: Ethernet address: 00:1b:24:5d:5b:bd
bge1: <Broadcom NetXtreme Gigabit Ethernet Controller, ASIC rev. 0x009003> =
mem
0xfdfc0000-0xfdfcffff,0xfdfb0000-0xfdfbffff irq 18 at device 4.1 on pci6
bge1: CHIP ID 0x00009003; ASIC REV 0x09; CHIP REV 0x90; PCI-X 133 MHz
miibus3: <MII bus> on bge1
brgphy1: <BCM5714 1000BASE-T media interface> PHY 1 on miibus3
brgphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-master, 1000baseT-FDX, 1000baseT-FDX-master, auto, auto-flow
bge1: Ethernet address: 00:1b:24:5d:5b:be

sf-10> ifconfig bge1
bge1: flags=3D8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
       options=3D8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LI=
NKSTA
TE>
       ether 00:1b:24:5d:5b:be
       nd6 options=3D21<PERFORMNUD,AUTO_LINKLOCAL>
       media: Ethernet autoselect (100baseTX <full-duplex>)
       status: active


Saw similar things happening over here with different broadcom chipset, and=
 the above revisions
helped significantly (URLs below):

http://svnweb.freebsd.org/base?view=3Drevision&revision=3D248858
http://svnweb.freebsd.org/base?view=3Drevision&revision=3D250650



is toggeling bge1 DOWN/UP every few hours, this port is being used by the I=
LO.
To check, I upgraded another identical host, and the same problem appears.

What is the last known working revision?

I have no idea, but I have older versions, and ill start from the oldets
(9.1-prerelease), but
it will take time, since it takes hours till it happens.


There are ways you can speed up the replication time. I tend to flood a ser=
ver with
TCP while I've heard of it happening under UDP flood too.

Here's a nice way to flood a server with TCP (assuming you have SSH access =
to the
system via keys):

sh -c 'while :;do dd if=3D/dev/urandom of=3D/dev/stdout bs=3D1m count=3D102=
4 | ssh HOST2KILL /sbin/md5; done'

Run that about 16 times in separate screen sessions from various other host=
s on your network,
taking care to replace "HOST2KILL" with the hostname or IP of the box with =
the SunFire X2200.

Let that run for a while, and then when you think you've had a reset (if yo=
u weren't standing
there watching for one)=85

grep 'bge.*DOWN' /var/log/messages

On a system that has booted and stayed up-and-running, there shouldn't be a=
ny messages like this:

bge0: link state changed to DOWN

When you actually get this message (if your experience is like ours), you'l=
l be down for 90 seconds
while the NIC resets.

However, since you say you have some older 9.1 releases=85 I'd start by fir=
st trying to bring the
replication time of the problem down by using TCP and/or UDP floods. That w=
ay you'll be able to
test for resolution of the problem as you progress up to stable/9 (where th=
e problem should be fixed
by the aforementioned SVN revisions -- specific to your hardware).




There
is not correlation with time, since they happend at totaly different times.
I rebooted both hosts at almost the same time.
one host :
uptime: 5:24PM  up  6:15, 0 users, load averages: 0.00, 0.00, 0.00
May 24 12:53:52 sf-04 kernel: bge1: link state changed to DOWN
May 24 12:53:55 sf-04 kernel: bge1: link state changed to UP
May 24 15:34:25 sf-04 kernel: bge1: link state changed to DOWN
May 24 15:34:28 sf-04 kernel: bge1: link state changed to UP

and
uptime: 5:24PM  up  6:14, 0 users, load averages: 0.00, 0.00, 0.00

May 24 16:30:44 sf-10 kernel: bge1: link state changed to DOWN
May 24 16:30:44 sf-10 kernel: bge1: link state changed to UP

this is not serious, the ilo (ssh) connection is ok, but it's anoying, we h=
ave
more
than 10 of this hosts, and if I upgrade all of them, the logs will fill up
with this :-)

any ideas?


Well, you say the connection is OK=85 so it doesn't sound like a full reset=
 as it
was in our case (we have a different chipset).

But I agree that a log full of those would be annoying.

Try getting up to stable/9 in its current state (note: stable/8 also has al=
l the
aforementioned revisions too).
--
Devin

_____________
The information contained in this message is proprietary and/or confidentia=
l. If you are not the intended recipient, please: (i) delete the message an=
d all copies; (ii) do not disclose, distribute or use the message in any ma=
nner; and (iii) notify the sender immediately. In addition, please be aware=
 that any message addressed to our domain is subject to archiving and revie=
w by persons other than the intended recipient. Thank you.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?13CA24D6AB415D428143D44749F57D7201F62C26>