From owner-freebsd-stable@FreeBSD.ORG  Wed May 13 16:42:11 2009
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 56D3D106566C
	for <freebsd-stable@freebsd.org>; Wed, 13 May 2009 16:42:11 +0000 (UTC)
	(envelope-from byshenknet@byshenk.net)
Received: from core.byshenk.net (core.byshenk.net [62.58.73.230])
	by mx1.freebsd.org (Postfix) with ESMTP id B69478FC18
	for <freebsd-stable@freebsd.org>; Wed, 13 May 2009 16:42:10 +0000 (UTC)
	(envelope-from byshenknet@byshenk.net)
Received: from core.byshenk.net (localhost.aoes.com [127.0.0.1])
	by core.byshenk.net (8.14.3/8.14.3) with ESMTP id n4DGg8B3084494
	for <freebsd-stable@freebsd.org>; Wed, 13 May 2009 18:42:08 +0200 (CEST)
	(envelope-from byshenknet@core.byshenk.net)
Received: (from byshenknet@localhost)
	by core.byshenk.net (8.14.3/8.14.3/Submit) id n4DGg83t084493
	for freebsd-stable@freebsd.org; Wed, 13 May 2009 18:42:08 +0200 (CEST)
	(envelope-from byshenknet)
Date: Wed, 13 May 2009 18:42:07 +0200
From: Greg Byshenk <freebsd@byshenk.net>
To: freebsd-stable@freebsd.org
Message-ID: <20090513164207.GD67116@core.byshenk.net>
References: <20090426125008.GK1550@core.byshenk.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20090426125008.GK1550@core.byshenk.net>
User-Agent: Mutt/1.4.2.3i
X-Spam-Status: No, score=-1.4 required=5.0 tests=ALL_TRUSTED autolearn=failed
	version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on core.byshenk.net
Subject: Re: em0 watchdog timeout 7-stable
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 13 May 2009 16:42:11 -0000

As a followup to my own previous message, I continue to have annoying 
problems with "em?: watchdog timeout" on one of my machines (now running
7.2-STABLE as of 2009-05-08).

I have discontinued using the on-board (em, copper) NICs, and replaced
the original fibre NIC with a newer model, but the problem persists.
I've also set

   hw.pci.enable_msix=0
   hw.pci.enable_msi=0
   hw.em.rxd=1024
   hw.em.txd=1024
   net.inet.tcp.tso=0

...as suggested in some discussions of this problem, and set the em1
interface to 'polling', all to no avail.  Frequently, though irregularly
(once or twice a day), the console begins to display

   em1: watchdog timeout -- resetting
   em1: watchdog timeout -- resetting
   em1: watchdog timeout -- resetting

the nework is down, and the machine locks up.

[Note: I am getting 'em1' now instead of 'em0' as previously, but this
is due to changing all of the nics, which led to a different numbering;
the timeout is still occurring on the (main) interface, the fibre 
gigabit connection.]

What is particularly perverse (IMO) is that, since changing the NIC to
the newer model (and updating the kernel), I can no longer break to the
debugger when the lockup occurs (there is no response to the break) --
bit I _can_ shut the machine down cleanly via hardware (a touch of the
power switch sends 'shutdown', and the machine shuts down cleanly --
after killing off processes waiting on network i/o).

The machine is running nfs and samba (3.2.10, from ports), and pretty
much nothing else.


Anyone have any ideas about this...?  I'm going mad with this.

-greg byshenk


# pciconf -lvb
[...]
em1@pci0:7:1:0: class=0x020000 card=0x10028086 chip=0x10118086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82545EM Gigabit Ethernet Controller (Fiber)'
    class      = network
    subclass   = ethernet
    bar   [10] = type Memory, range 64, base 0xda300000, size 131072, enabled
    bar   [20] = type I/O Port, range 32, base 0x5000, size 64, enabled
[...]

# vmstat -i
interrupt                          total       rate
irq4: sio0                          1666          0
irq6: fdc0                            10          0
irq14: ata0                           58          0
irq16: skc0 em0                  1437801         98
irq18: twa0                       846981         57
irq24: em1                       4378650        299
cpu0: timer                     29258004       1999
cpu1: timer                     29249758       1999
cpu3: timer                     29249816       1999
cpu7: timer                     29249779       1999
cpu2: timer                     29249729       1999
cpu4: timer                     29249852       1999
cpu6: timer                     29249851       1999
cpu5: timer                     29249814       1999
Total                          240671769      16450


On Sun, Apr 26, 2009 at 02:50:08PM +0200, Greg Byshenk wrote:
> I have one machine that is seeing watchdog timeouts on em0, running 7-STABLE
> amd64 as of 2009.04.19, and also some other more perverse errors.
> 
> Twice now in the last 48 hours, this machine has become unreachable via the
> network, and connecting to the console shows an endless string of 
> 
>    [...]
>    em0: watchdog timeout -- resetting
>    em0: watchdog timeout -- resetting
>    em0: watchdog timeout -- resetting
> 
> messages. The machine is almost locked up.  That is, I can get a login
> prompt, but can go no further than typing in a username; after the
> username, no password prompt, and nothing further.  The only option is
> to hard reset the machine or to drop to debugger and reboot.
> 
> Now the "perverse" part.  After restarting, the system partition is no
> more.
> 
> Background detail:  the machine is a fileserver, with a 3Ware 9650SE-16ML
> SATA controller, connected to 16 1TB SATA drives, this configured as
> a 14-drive RAID10 array (+ 2 hot spares), with a 50GB system partition
> and 6.5TB data partition.  The system partition is configured as da1,
> with one slice and more or less standard partitions for / /var /tmp, etc.
> (the data partition of the array is sliced with gpt).
> 
> The issue here is that, upon restart, all parition information on da0
> seems to have disappeared, and restarting results in a "no operating
> system found" message, and a failure to boot (obviously).
> 
> But all of the data is still present.  If I boot into rescue mode,
> recreate da0s1, mark it bootable, and restore the bsdlabel, then
> everything works again.  I can restart the machine, and it comes back
> up normally (it requires an fsck of everything on da0, but after that
> everything is back to normal).
> 
> I don't know if this is two unrelated problems, or one problem with
> two symptoms, or something else.  I think that I can safely say that
> it is not a problem with the 3Ware controller itself, as I replaced
> the controller with a spare (identical model), and the problem
> recurred.  Additionally, I have an almost-identical configuration on
> four other machines, none of which are experiencing any problems.
> One thing that is different is that the other machines use
> Intel PRO/1000 PF (pci-e) NICs.
> 
> Is there some known problem with the Intel 2572 fibre NIC?  Or some
> potential interaction of it with the 3ware RAID controller?
> 
> For the moment, I've set hw.pci.enable_msi=0 (as discussed in the
> threads on 7.2/bge), and am building a new kernel/world from sources
> csup'd one hour ago, but I'd really like to hear any ideas about this
> -- particularly the wiping of the label.
> 
> Some information about the system:
> 
> 
> # /dev/da0s1:
> 8 partitions:
> #        size   offset    fstype   [fsize bsize bps/cpg]
>   a:  2097152        0    4.2BSD        0     0     0 
>   b:  8388608  2097152      swap                    
>   c: 104856192        0    unused        0     0         # "raw" part, don't edit
>   d:  8388608 10485760    4.2BSD        0     0     0 
>   e:  2097152 18874368    4.2BSD        0     0     0 
>   f: 41943040 20971520    4.2BSD        0     0     0 
>   g: 41941632 62914560    4.2BSD        0     0     0 
> 
> 
> em0@pci0:4:1:0: class=0x020000 card=0x10038086 chip=0x10018086 rev=0x02 hdr=0x00
>     vendor     = 'Intel Corporation'thernet Controller (Fiber)'
>     device     = '2572 10/100/1000 Ethernet Controller (Fiber)'
>     class      = networktory, range 32, base 0xda000000, size 131072, enabled
>     subclass   = ethernetory, range 32, base 0xda000000, size 131072, enabled
>     bar   [10] = type Memory, range 32, base 0xda000000, size 131072, enabled
>     bar   [14] = type Memory, range 32, base 0xda020000, size 65536, enabled0x00
>  
> twa0@pci0:9:0:0:        class=0x010400 card=0x100413c1 chip=0x100413c1 rev=0x01 hdr=0x00
>     device     = '9650SE Series PCI-Express SATA2 Raid Controller'
>     class      = mass storage
>     subclass   = RAID
>     bar   [10] = type Prefetchable Memory, range 64, base 0xd8000000, size 33554432, enabled
>     bar   [18] = type Memory, range 64, base 0xda300000, size 4096, enabled
>     bar   [20] = type I/O Port, range 32, base 0x3000, size 256, enabled
>     cap 01[40] = powerspec 2  supports D0 D1 D2 D3  current D0
>     cap 05[50] = MSI supports 32 messages, 64 bit
>     cap 10[70] = PCI-Express 1 legacy endpoint
> 

-- 
greg byshenk  -  gbyshenk@byshenk.net  -  Leiden, NL