From owner-freebsd-current@FreeBSD.ORG Thu Oct 6 07:53:15 2005 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 79BF016A41F for ; Thu, 6 Oct 2005 07:53:15 +0000 (GMT) (envelope-from stijn@pcwin002.win.tue.nl) Received: from pastinakel.tue.nl (pastinakel.tue.nl [131.155.2.7]) by mx1.FreeBSD.org (Postfix) with ESMTP id E6FDC43D58 for ; Thu, 6 Oct 2005 07:53:14 +0000 (GMT) (envelope-from stijn@pcwin002.win.tue.nl) Received: from localhost (localhost [127.0.0.1]) by pastinakel.tue.nl (Postfix) with ESMTP id 8AACF14BDB5 for ; Thu, 6 Oct 2005 09:53:13 +0200 (CEST) Received: from pastinakel.tue.nl ([127.0.0.1]) by localhost (pastinakel.tue.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 47113-01 for ; Thu, 6 Oct 2005 09:53:12 +0200 (CEST) Received: from umta.win.tue.nl (umta.win.tue.nl [131.155.71.100]) by pastinakel.tue.nl (Postfix) with ESMTP id DFCA714BDB2 for ; Thu, 6 Oct 2005 09:53:12 +0200 (CEST) Received: from pcwin002.win.tue.nl (pcwin002 [131.155.71.72]) by umta.win.tue.nl (Postfix) with ESMTP id DD78731401C for ; Thu, 6 Oct 2005 09:53:12 +0200 (CEST) Received: by pcwin002.win.tue.nl (Postfix, from userid 1001) id CA42640E4; Thu, 6 Oct 2005 09:53:12 +0200 (CEST) Date: Thu, 6 Oct 2005 09:53:12 +0200 From: Stijn Hoop To: freebsd-current@freebsd.org Message-ID: <20051006075312.GM86136@pcwin002.win.tue.nl> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="45Z9DzgjV8m4Oswq" Content-Disposition: inline User-Agent: Mutt/1.4.2.1i X-Bright-Idea: Let's abolish HTML mail! X-Virus-Scanned: amavisd-new at tue.nl Subject: interrupt throttling stepping in too soon? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 06 Oct 2005 07:53:15 -0000 --45Z9DzgjV8m4Oswq Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi, On 6.0-BETA5 I was initializing a gvinum RAID-5 array of 565GB. This took about 28 hours, and completed without errors. Due to a rocky start while creating this volume and the fact that I've been bitten by parity errors in the past I decided to do a complete 'checkparity' right after the array came up. This was yesterday afternoon. This morning I came in to find the system unresponsive. On the console there were multiple DMA_TIMEOUT messages for the disks of the array, and just above those was a line about an 'interrupt storm for atapci0, throttling'. Sure enough this is the controller that all the disks are on. Doubly unfortunate however was that upon attempting to reboot (CTRL+ALT+DEL still worked), the system dropped into the debugger after having 'synced all disks'. Being naive I thought 'oh well I'll debug after reboot' and rebooted. Apparently this was a stupid thing to do because my whole / had gone missing. While this has no real relevance on the problem at hand, it had an unfortunate side effect: due to the fact that I had to reinstall, I have no actual log messages anymore :( What I do have are the dmesg lines for the setup, see below. Is it possible that the interrupt storm detection and subsequent throttling is triggering too early, leading to lost interrupts, and in this case DMA_TIMEOUTS? As far as I can tell the system worked beautifully while initializing, and for +- 8 hours checking parity. It is now rebuilding again (*twiddle thumbs*) and it still seems to work perfectly. I therefore do not primarily suspect the hardware (although it is of course possible). It could be that the daily periodic job triggered more I/O, so much so that the system overloaded (although the relevant controller has no mounted drives of course). Relevant dmesg: atapci0: port 0xd400-0xd407,0xd000-0x= d003,0xb800-0xb807,0xb400-0xb403,0xb000-0xb00f mem 0xed800000-0xed803fff ir= q 11 at device 13.0 on pci0 ata2: on atapci0 ata3: on atapci0 atapci1: port 0x1f0-0x1f7,0x3f6,0x170-0x177,= 0x376,0xa400-0xa40f at device 17.1 on pci0 ata0: on atapci1 ata1: on atapci1 ad0: 38166MB at ata0-master UDMA100 ad1: 117246MB at ata0-slave UDMA133 ad2: 117246MB at ata1-master UDMA133 ad3: 117246MB at ata1-slave UDMA133 ad4: 194481MB at ata2-master UDMA133 ad5: 194481MB at ata2-slave UDMA133 ad6: 194481MB at ata3-master UDMA133 ad7: 239372MB at ata3-slave UDMA133 The array is on disks ad4, ad5, ad6 and ad7 (yes, I know that having only 1 disk per channel is more effective. Speed is not really an issue here). gvinum setup (now reinitializing as you can see): V data State: down Plexes: 1 Size: 565 GB P data.p0 R5 State: down Subdisks: 4 Size: 565 GB S data.p0.s0 State: I 2% D: pluto Size: 188 GB S data.p0.s1 State: I 2% D: donald Size: 188 GB S data.p0.s2 State: I 2% D: goofy Size: 188 GB S data.p0.s3 State: I 2% D: mickey Size: 188 GB Thanks for any answers. --Stijn --=20 A "No" uttered from deepest conviction is better and greater than a "Yes" merely uttered to please, or what is worse, to avoid trouble. -- Mahatma Ghandi --45Z9DzgjV8m4Oswq Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (FreeBSD) iD8DBQFDRNfoY3r/tLQmfWcRAtfqAKCcaxoXACDfdJL1Xt28jWiceecjkQCfbi54 98kCsOhO7ZnyXdVfekSVMYs= =XFM0 -----END PGP SIGNATURE----- --45Z9DzgjV8m4Oswq--