From owner-freebsd-stable@FreeBSD.ORG Wed May 13 16:42:11 2009 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 56D3D106566C for ; Wed, 13 May 2009 16:42:11 +0000 (UTC) (envelope-from byshenknet@byshenk.net) Received: from core.byshenk.net (core.byshenk.net [62.58.73.230]) by mx1.freebsd.org (Postfix) with ESMTP id B69478FC18 for ; Wed, 13 May 2009 16:42:10 +0000 (UTC) (envelope-from byshenknet@byshenk.net) Received: from core.byshenk.net (localhost.aoes.com [127.0.0.1]) by core.byshenk.net (8.14.3/8.14.3) with ESMTP id n4DGg8B3084494 for ; Wed, 13 May 2009 18:42:08 +0200 (CEST) (envelope-from byshenknet@core.byshenk.net) Received: (from byshenknet@localhost) by core.byshenk.net (8.14.3/8.14.3/Submit) id n4DGg83t084493 for freebsd-stable@freebsd.org; Wed, 13 May 2009 18:42:08 +0200 (CEST) (envelope-from byshenknet) Date: Wed, 13 May 2009 18:42:07 +0200 From: Greg Byshenk To: freebsd-stable@freebsd.org Message-ID: <20090513164207.GD67116@core.byshenk.net> References: <20090426125008.GK1550@core.byshenk.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090426125008.GK1550@core.byshenk.net> User-Agent: Mutt/1.4.2.3i X-Spam-Status: No, score=-1.4 required=5.0 tests=ALL_TRUSTED autolearn=failed version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on core.byshenk.net Subject: Re: em0 watchdog timeout 7-stable X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 May 2009 16:42:11 -0000 As a followup to my own previous message, I continue to have annoying problems with "em?: watchdog timeout" on one of my machines (now running 7.2-STABLE as of 2009-05-08). I have discontinued using the on-board (em, copper) NICs, and replaced the original fibre NIC with a newer model, but the problem persists. I've also set hw.pci.enable_msix=0 hw.pci.enable_msi=0 hw.em.rxd=1024 hw.em.txd=1024 net.inet.tcp.tso=0 ...as suggested in some discussions of this problem, and set the em1 interface to 'polling', all to no avail. Frequently, though irregularly (once or twice a day), the console begins to display em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting em1: watchdog timeout -- resetting the nework is down, and the machine locks up. [Note: I am getting 'em1' now instead of 'em0' as previously, but this is due to changing all of the nics, which led to a different numbering; the timeout is still occurring on the (main) interface, the fibre gigabit connection.] What is particularly perverse (IMO) is that, since changing the NIC to the newer model (and updating the kernel), I can no longer break to the debugger when the lockup occurs (there is no response to the break) -- bit I _can_ shut the machine down cleanly via hardware (a touch of the power switch sends 'shutdown', and the machine shuts down cleanly -- after killing off processes waiting on network i/o). The machine is running nfs and samba (3.2.10, from ports), and pretty much nothing else. Anyone have any ideas about this...? I'm going mad with this. -greg byshenk # pciconf -lvb [...] em1@pci0:7:1:0: class=0x020000 card=0x10028086 chip=0x10118086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = '82545EM Gigabit Ethernet Controller (Fiber)' class = network subclass = ethernet bar [10] = type Memory, range 64, base 0xda300000, size 131072, enabled bar [20] = type I/O Port, range 32, base 0x5000, size 64, enabled [...] # vmstat -i interrupt total rate irq4: sio0 1666 0 irq6: fdc0 10 0 irq14: ata0 58 0 irq16: skc0 em0 1437801 98 irq18: twa0 846981 57 irq24: em1 4378650 299 cpu0: timer 29258004 1999 cpu1: timer 29249758 1999 cpu3: timer 29249816 1999 cpu7: timer 29249779 1999 cpu2: timer 29249729 1999 cpu4: timer 29249852 1999 cpu6: timer 29249851 1999 cpu5: timer 29249814 1999 Total 240671769 16450 On Sun, Apr 26, 2009 at 02:50:08PM +0200, Greg Byshenk wrote: > I have one machine that is seeing watchdog timeouts on em0, running 7-STABLE > amd64 as of 2009.04.19, and also some other more perverse errors. > > Twice now in the last 48 hours, this machine has become unreachable via the > network, and connecting to the console shows an endless string of > > [...] > em0: watchdog timeout -- resetting > em0: watchdog timeout -- resetting > em0: watchdog timeout -- resetting > > messages. The machine is almost locked up. That is, I can get a login > prompt, but can go no further than typing in a username; after the > username, no password prompt, and nothing further. The only option is > to hard reset the machine or to drop to debugger and reboot. > > Now the "perverse" part. After restarting, the system partition is no > more. > > Background detail: the machine is a fileserver, with a 3Ware 9650SE-16ML > SATA controller, connected to 16 1TB SATA drives, this configured as > a 14-drive RAID10 array (+ 2 hot spares), with a 50GB system partition > and 6.5TB data partition. The system partition is configured as da1, > with one slice and more or less standard partitions for / /var /tmp, etc. > (the data partition of the array is sliced with gpt). > > The issue here is that, upon restart, all parition information on da0 > seems to have disappeared, and restarting results in a "no operating > system found" message, and a failure to boot (obviously). > > But all of the data is still present. If I boot into rescue mode, > recreate da0s1, mark it bootable, and restore the bsdlabel, then > everything works again. I can restart the machine, and it comes back > up normally (it requires an fsck of everything on da0, but after that > everything is back to normal). > > I don't know if this is two unrelated problems, or one problem with > two symptoms, or something else. I think that I can safely say that > it is not a problem with the 3Ware controller itself, as I replaced > the controller with a spare (identical model), and the problem > recurred. Additionally, I have an almost-identical configuration on > four other machines, none of which are experiencing any problems. > One thing that is different is that the other machines use > Intel PRO/1000 PF (pci-e) NICs. > > Is there some known problem with the Intel 2572 fibre NIC? Or some > potential interaction of it with the 3ware RAID controller? > > For the moment, I've set hw.pci.enable_msi=0 (as discussed in the > threads on 7.2/bge), and am building a new kernel/world from sources > csup'd one hour ago, but I'd really like to hear any ideas about this > -- particularly the wiping of the label. > > Some information about the system: > > > # /dev/da0s1: > 8 partitions: > # size offset fstype [fsize bsize bps/cpg] > a: 2097152 0 4.2BSD 0 0 0 > b: 8388608 2097152 swap > c: 104856192 0 unused 0 0 # "raw" part, don't edit > d: 8388608 10485760 4.2BSD 0 0 0 > e: 2097152 18874368 4.2BSD 0 0 0 > f: 41943040 20971520 4.2BSD 0 0 0 > g: 41941632 62914560 4.2BSD 0 0 0 > > > em0@pci0:4:1:0: class=0x020000 card=0x10038086 chip=0x10018086 rev=0x02 hdr=0x00 > vendor = 'Intel Corporation'thernet Controller (Fiber)' > device = '2572 10/100/1000 Ethernet Controller (Fiber)' > class = networktory, range 32, base 0xda000000, size 131072, enabled > subclass = ethernetory, range 32, base 0xda000000, size 131072, enabled > bar [10] = type Memory, range 32, base 0xda000000, size 131072, enabled > bar [14] = type Memory, range 32, base 0xda020000, size 65536, enabled0x00 > > twa0@pci0:9:0:0: class=0x010400 card=0x100413c1 chip=0x100413c1 rev=0x01 hdr=0x00 > device = '9650SE Series PCI-Express SATA2 Raid Controller' > class = mass storage > subclass = RAID > bar [10] = type Prefetchable Memory, range 64, base 0xd8000000, size 33554432, enabled > bar [18] = type Memory, range 64, base 0xda300000, size 4096, enabled > bar [20] = type I/O Port, range 32, base 0x3000, size 256, enabled > cap 01[40] = powerspec 2 supports D0 D1 D2 D3 current D0 > cap 05[50] = MSI supports 32 messages, 64 bit > cap 10[70] = PCI-Express 1 legacy endpoint > -- greg byshenk - gbyshenk@byshenk.net - Leiden, NL