From owner-freebsd-stable@FreeBSD.ORG Wed Oct 4 04:34:22 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6B82616A412 for ; Wed, 4 Oct 2006 04:34:22 +0000 (UTC) (envelope-from John.Marshall@riverwillow.com.au) Received: from mail2.riverwillow.net.au (ns2.riverwillow.net.au [203.58.93.41]) by mx1.FreeBSD.org (Postfix) with ESMTP id 90BE343D55 for ; Wed, 4 Oct 2006 04:34:20 +0000 (GMT) (envelope-from John.Marshall@riverwillow.com.au) Received: from rwmail.mby.riverwillow.net.au (rwsrv06.rw2.riverwillow.net.au [172.25.25.16]) by mail2.riverwillow.net.au (8.13.8/8.13.8) with ESMTP id k944YGnF055912 for ; Wed, 4 Oct 2006 14:34:16 +1000 (AEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable X-MimeOLE: Produced By Microsoft Exchange V6.5 Date: Wed, 4 Oct 2006 14:34:16 +1000 Message-ID: <9F7B653A50CF3D45A92C05401046239B0E0C27@rwsrv06.rw2.riverwillow.net.au> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Watchdog Timeout - bge devices thread-index: Acbnblm1gEbw1PfxS2+f1c+XTgnSog== From: "John Marshall" To: X-Spam-Status: No, score=-18.7 required=4.0 tests=ALL_TRUSTED,SPF_FAIL autolearn=disabled version=3.1.5 X-Spam-Checker-Version: SpamAssassin 3.1.5 (2006-08-29) on mail2.riverwillow.net.au Subject: Watchdog Timeout - bge devices X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Oct 2006 04:34:22 -0000 $ dmesg | grep bge bge0: mem 0xe8200000-0xe820ffff irq 17 at device 4.0 on pci4 miibus1: on bge0 bge0: Ethernet address: 00:0b:cd:e7:51:ba bge0: watchdog timeout -- resetting bge0: link state changed to DOWN bge0: link state changed to UP I initially pronounced the network cable dead and replaced it. Then I suspected the FastEthernet switch port and relocated to a different port. Watchdog timeouts persisted. I concluded that the bge hardware must be flaky until I read a recent thread on em device watchdog timeouts which led me to wonder about CPU scheduling. The server experiencing the bge timeouts was using SCHED_ULE. I built 6.2-PRERELEASE on a spare disk and booted the problem server from that disk - bge problem persisted. We have a second (identical) problem-free server configured with SCHED_4BSD. I reconfigured both machines so that the first machine (now 6.2-PRERELEASE) used SCHED_4BSD and the second machine (6.1-RELEASE) uses SCHED_ULE. Both machines are configured with PREEMPTION. +-----------------------------------------------+ | THE PROBLEM FOLLOWS SCHED_ULE ACROSS MACHINES | +-----------------------------------------------+ The machines are hp ProLiant ML110 servers. There is nothing sharing the interrupt with the bge device. No USB drivers are loaded. $ vmstat -i interrupt total rate irq1: atkbd0 70 0 irq6: fdc0 9 0 irq14: ata0 1234430 6 irq15: ata1 47 0 irq17: bge0 17543591 93 irq26: fxp0 70832 0 cpu0: timer 376381765 1999 Total 395230744 2099 $ sysctl kern.version kern.sched kern.smp hw.machine hw.model dev.bge kern.version: FreeBSD 6.1-RELEASE-p10 #1: Mon Oct 2 08:36:56 AEST 2006 kern.sched.name: ule kern.sched.slice_min: 10 kern.sched.slice_max: 142 kern.sched.preemption: 1 kern.smp.maxcpus: 1 kern.smp.active: 0 kern.smp.disabled: 0 kern.smp.cpus: 1 hw.machine: i386 hw.model: Intel(R) Pentium(R) 4 CPU 2.80GHz dev.bge.0.%desc: Broadcom BCM5705K Gigabit Ethernet, ASIC rev. 0x3003 dev.bge.0.%driver: bge dev.bge.0.%location: slot=3D4 function=3D0 dev.bge.0.%pnpinfo: vendor=3D0x14e4 device=3D0x1654 subvendor=3D0x103c subdevice=3D0x1654 class=3D0x020000 dev.bge.0.%parent: pci4 Is there any other information I ought to post to help with diagnosis - or is this a known problem? (I've only subscribed recently) John Marshall.