Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 5 Oct 2009 09:57:49 -0700
From:      Jack Vogel <jfvogel@gmail.com>
To:        Daniel Bond <db@danielbond.org>
Cc:        freebsd-stable@freebsd.org, Rudy <crapsh@monkeybrains.net>
Subject:   Re: em0 watchdog timeouts
Message-ID:  <2a41acea0910050957x2d085e90w2ebea7f9eb87c3e4@mail.gmail.com>
In-Reply-To: <6194E9BC-3A3D-4941-A777-88C7411905B0@danielbond.org>
References:  <4AB9638B.8040607@monkeybrains.net> <4AC3DB8F.7010602@monkeybrains.net> <2a41acea0909301556g1df7dbafv813f5924553c8bfb@mail.gmail.com> <4AC5198E.7030609@monkeybrains.net> <4AC51B4C.7080905@monkeybrains.net> <2a41acea0910011450v41590f3dn112f367f26faed2d@mail.gmail.com> <4AC64835.3060107@monkeybrains.net> <2a41acea0910021237w415efa2cs4354a0f99aef8f6@mail.gmail.com> <4AC66437.4040704@monkeybrains.net> <6194E9BC-3A3D-4941-A777-88C7411905B0@danielbond.org>

next in thread | previous in thread | raw e-mail | index | archive | help
This posting just muddies the issue, first you talk about having a problem
that
involves Broadcom, ok, so post about that on something other than em :)

Then you make some references to hardware that you "might have bought"
but didn't, I'm not about debugging 'possible worlds problems' though so
can't help you there either :)

Finally you never say what the actual hardware is, other than a person who
I do not know told you it was the best performer... so, what exactly is it?

You have a problem once every 10 days,  and at a specific time no less,
this almost always means something in your environment, a cron job run
amok, a piece of hardware that resets, I dunno, but the last thing I would
suspect given this description is the driver.

You need a good sysadmin for this debugging I would venture, not a driver
developer.

Jack


On Mon, Oct 5, 2009 at 7:19 AM, Daniel Bond <db@danielbond.org> wrote:

> Hi,
>
> I've been struggling with watchdog timeouts in 7.1/7.2-RELEASE for the past
> 6months too. It looks related.
>
> I've tried to replace the hardware 3 times (2 different IBM x3755 chassis,
> one IBM x3650 chassis).
> I tried first with onboard broadcom NICs (bce-based) PCIx-based, until I
> had issues with "watchdog timeout".
>
> I tried replacing it with a 4-port pci-x Intel NIC, which gave me same
> problems. I was told that the 4-port intel NICs had an onboard
> bus-controller, that
> could cause trouble, so I replaced this with a 2-port PCI-e intel, which I
> was told by a Sepherosa Ziehau was the best performing gig-e NIC (rx/tx).
>
> Still getting watchdog timeouts, I tried upgrading all sort of sysctls I
> found in mailing-list threads (disable msi/msix interrupts, adjust rx/tx
> processing, etc, etc).
> I tried upgrading BIOS, firmware on all kinds of stuff (disks, BMC, etc,
> etc) to newest version. I also tried using a different qlogic isp(4)
> FC-controller (PCI-e).
>
> No matter what I tried, I could not diagnose this problem, or at least fix
> it. Also it happened rarely enough, to not be easy to debugging. I would get
> a series of "watchdog timeout -- resetting", until the NIC would go
> completly offline - at the point I'd reboot it from console.
>
> This happened about once every 1-10 days, usually about 11-13:00. This
> machine has now been replaced with Linux, unfortunately, just to avoid more
> customer complaints and downtime. The IBM x3755 with FreeBSD7.2 which was
> replaced with Linux, is still online, and
> can be put at disposal for any developers who would like to debug this
> further.
>
> Like Stefan Krueger mentioned, this machine is also running as NFS server,
> with a mix of BSD and Linux clients, and it's getting hit pretty hard by
> clients.
>
>
> Hope we can iron this bug out, in the future.
>
>
> Best regards,
>
>
> Daniel Bond.
>
>
>
>
> On Oct 2, 2009, at 10:36 PM, Rudy wrote:
>
>
>> Ah, I'll stop messing with them.
>>
>>
>> I just set them all to 0 to see if that will help and noticed the card
>> was leaving tx_int_delay=1.
>>
>> # sysctl dev.em.4.debug=1
>> Oct  2 13:26:07 mango kernel: em4: tx_int_delay = 1, tx_abs_int_delay = 0
>> Oct  2 13:26:07 mango kernel: em4: rx_int_delay = 0, rx_abs_int_delay = 0
>>
>> # sysctl dev.em.4
>> dev.em.4.%desc: Intel(R) PRO/1000 Network Connection 6.9.12
>> dev.em.4.rx_int_delay: 0
>> dev.em.4.tx_int_delay: 0
>> dev.em.4.rx_abs_int_delay: 0
>> dev.em.4.tx_abs_int_delay: 0
>>
>> Splitting traffic to different ports has brought down the watchdog
>> events to once a day.  ... essentially, I have a quad 30Mbps (not quad
>> 1Gbps) card.  heheh.
>> Would turning off net.inet.ip.fastforwarding or any other setting help?
>>
>> Today, I set net.inet.ip.fw.enable=0 and I'll see if that helps.  I have
>> a feeling that isn't related to the NIC at all, but I'm not sure what
>> else to try.
>>
>> Rudy
>>
>>
>>
>> Jack Vogel wrote:
>>
>>> Watchdog resets the adapter. Messing with these values is of dubious
>>> value
>>> anyway.
>>>
>>> Jack
>>>
>>>
>>> On Fri, Oct 2, 2009 at 11:36 AM, Rudy <crapsh@monkeybrains.net> wrote:
>>>
>>>
>>>  I noticed something interesting.
>>>>
>>>> I set the rc_int_delay to 0:
>>>> sysctl dev.em.5.rx_int_delay=0
>>>>
>>>> Chcking via sysctl dev.em.5.debug=1 shows ex_int_delay is indeed 0:
>>>> Oct  1 17:32:41 mango kernel: em5: rx_int_delay = 0, rx_abs_int_delay =
>>>> 66
>>>>
>>>> After a watchdog event, sysctl dev.em.5.debug=1 shows ex_int_delay is
>>>> now 32:
>>>> Oct  2 11:29:49 mango kernel: em5: rx_int_delay = 32, rx_abs_int_delay =
>>>> 66
>>>>
>>>> However, running sysctl dev.em.5 shows it as 0:
>>>> dev.em.5.rx_int_delay: 0
>>>> dev.em.5.tx_int_delay: 66
>>>>
>>>> Seems like the adapter and the kernel don't agree on the rx_int_delay
>>>> value.
>>>>
>>>> Rudy
>>>>
>>>>
>>>>
>>>
>>>
>> _______________________________________________
>> freebsd-stable@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
>>
>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2a41acea0910050957x2d085e90w2ebea7f9eb87c3e4>