Date: Mon, 5 Oct 2009 09:57:49 -0700 From: Jack Vogel <jfvogel@gmail.com> To: Daniel Bond <db@danielbond.org> Cc: freebsd-stable@freebsd.org, Rudy <crapsh@monkeybrains.net> Subject: Re: em0 watchdog timeouts Message-ID: <2a41acea0910050957x2d085e90w2ebea7f9eb87c3e4@mail.gmail.com> In-Reply-To: <6194E9BC-3A3D-4941-A777-88C7411905B0@danielbond.org> References: <4AB9638B.8040607@monkeybrains.net> <4AC3DB8F.7010602@monkeybrains.net> <2a41acea0909301556g1df7dbafv813f5924553c8bfb@mail.gmail.com> <4AC5198E.7030609@monkeybrains.net> <4AC51B4C.7080905@monkeybrains.net> <2a41acea0910011450v41590f3dn112f367f26faed2d@mail.gmail.com> <4AC64835.3060107@monkeybrains.net> <2a41acea0910021237w415efa2cs4354a0f99aef8f6@mail.gmail.com> <4AC66437.4040704@monkeybrains.net> <6194E9BC-3A3D-4941-A777-88C7411905B0@danielbond.org>
next in thread | previous in thread | raw e-mail | index | archive | help
This posting just muddies the issue, first you talk about having a problem that involves Broadcom, ok, so post about that on something other than em :) Then you make some references to hardware that you "might have bought" but didn't, I'm not about debugging 'possible worlds problems' though so can't help you there either :) Finally you never say what the actual hardware is, other than a person who I do not know told you it was the best performer... so, what exactly is it? You have a problem once every 10 days, and at a specific time no less, this almost always means something in your environment, a cron job run amok, a piece of hardware that resets, I dunno, but the last thing I would suspect given this description is the driver. You need a good sysadmin for this debugging I would venture, not a driver developer. Jack On Mon, Oct 5, 2009 at 7:19 AM, Daniel Bond <db@danielbond.org> wrote: > Hi, > > I've been struggling with watchdog timeouts in 7.1/7.2-RELEASE for the past > 6months too. It looks related. > > I've tried to replace the hardware 3 times (2 different IBM x3755 chassis, > one IBM x3650 chassis). > I tried first with onboard broadcom NICs (bce-based) PCIx-based, until I > had issues with "watchdog timeout". > > I tried replacing it with a 4-port pci-x Intel NIC, which gave me same > problems. I was told that the 4-port intel NICs had an onboard > bus-controller, that > could cause trouble, so I replaced this with a 2-port PCI-e intel, which I > was told by a Sepherosa Ziehau was the best performing gig-e NIC (rx/tx). > > Still getting watchdog timeouts, I tried upgrading all sort of sysctls I > found in mailing-list threads (disable msi/msix interrupts, adjust rx/tx > processing, etc, etc). > I tried upgrading BIOS, firmware on all kinds of stuff (disks, BMC, etc, > etc) to newest version. I also tried using a different qlogic isp(4) > FC-controller (PCI-e). > > No matter what I tried, I could not diagnose this problem, or at least fix > it. Also it happened rarely enough, to not be easy to debugging. I would get > a series of "watchdog timeout -- resetting", until the NIC would go > completly offline - at the point I'd reboot it from console. > > This happened about once every 1-10 days, usually about 11-13:00. This > machine has now been replaced with Linux, unfortunately, just to avoid more > customer complaints and downtime. The IBM x3755 with FreeBSD7.2 which was > replaced with Linux, is still online, and > can be put at disposal for any developers who would like to debug this > further. > > Like Stefan Krueger mentioned, this machine is also running as NFS server, > with a mix of BSD and Linux clients, and it's getting hit pretty hard by > clients. > > > Hope we can iron this bug out, in the future. > > > Best regards, > > > Daniel Bond. > > > > > On Oct 2, 2009, at 10:36 PM, Rudy wrote: > > >> Ah, I'll stop messing with them. >> >> >> I just set them all to 0 to see if that will help and noticed the card >> was leaving tx_int_delay=1. >> >> # sysctl dev.em.4.debug=1 >> Oct 2 13:26:07 mango kernel: em4: tx_int_delay = 1, tx_abs_int_delay = 0 >> Oct 2 13:26:07 mango kernel: em4: rx_int_delay = 0, rx_abs_int_delay = 0 >> >> # sysctl dev.em.4 >> dev.em.4.%desc: Intel(R) PRO/1000 Network Connection 6.9.12 >> dev.em.4.rx_int_delay: 0 >> dev.em.4.tx_int_delay: 0 >> dev.em.4.rx_abs_int_delay: 0 >> dev.em.4.tx_abs_int_delay: 0 >> >> Splitting traffic to different ports has brought down the watchdog >> events to once a day. ... essentially, I have a quad 30Mbps (not quad >> 1Gbps) card. heheh. >> Would turning off net.inet.ip.fastforwarding or any other setting help? >> >> Today, I set net.inet.ip.fw.enable=0 and I'll see if that helps. I have >> a feeling that isn't related to the NIC at all, but I'm not sure what >> else to try. >> >> Rudy >> >> >> >> Jack Vogel wrote: >> >>> Watchdog resets the adapter. Messing with these values is of dubious >>> value >>> anyway. >>> >>> Jack >>> >>> >>> On Fri, Oct 2, 2009 at 11:36 AM, Rudy <crapsh@monkeybrains.net> wrote: >>> >>> >>> I noticed something interesting. >>>> >>>> I set the rc_int_delay to 0: >>>> sysctl dev.em.5.rx_int_delay=0 >>>> >>>> Chcking via sysctl dev.em.5.debug=1 shows ex_int_delay is indeed 0: >>>> Oct 1 17:32:41 mango kernel: em5: rx_int_delay = 0, rx_abs_int_delay = >>>> 66 >>>> >>>> After a watchdog event, sysctl dev.em.5.debug=1 shows ex_int_delay is >>>> now 32: >>>> Oct 2 11:29:49 mango kernel: em5: rx_int_delay = 32, rx_abs_int_delay = >>>> 66 >>>> >>>> However, running sysctl dev.em.5 shows it as 0: >>>> dev.em.5.rx_int_delay: 0 >>>> dev.em.5.tx_int_delay: 66 >>>> >>>> Seems like the adapter and the kernel don't agree on the rx_int_delay >>>> value. >>>> >>>> Rudy >>>> >>>> >>>> >>> >>> >> _______________________________________________ >> freebsd-stable@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-stable >> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" >> > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2a41acea0910050957x2d085e90w2ebea7f9eb87c3e4>