From owner-freebsd-stable@FreeBSD.ORG Wed Jan 19 23:06:05 2005 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 671E016A4CE for ; Wed, 19 Jan 2005 23:06:05 +0000 (GMT) Received: from wproxy.gmail.com (wproxy.gmail.com [64.233.184.206]) by mx1.FreeBSD.org (Postfix) with ESMTP id E566A43D1D for ; Wed, 19 Jan 2005 23:06:04 +0000 (GMT) (envelope-from jsimola@gmail.com) Received: by wproxy.gmail.com with SMTP id 58so29100wri for ; Wed, 19 Jan 2005 15:06:04 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:references; b=YRdUkOGlcu8kZMgrA3VOVrbC9652uGQ91hSP7Oii5fGwYu7anH0isFsms4Q6CBLbyMN5TyaKnBiTsC408CiYLOIsYfPidPg8nVfqAWZwQeST1W2sqM3TlBfZI+/YjId3G1V1nMDQfP3W4Jb87433PrT5FlAa3SjX8iSLZ5vH4ik= Received: by 10.54.53.6 with SMTP id b6mr79482wra; Wed, 19 Jan 2005 15:06:04 -0800 (PST) Received: by 10.54.39.34 with HTTP; Wed, 19 Jan 2005 15:06:04 -0800 (PST) Message-ID: <8eea04080501191506237fc762@mail.gmail.com> Date: Wed, 19 Jan 2005 15:06:04 -0800 From: Jon Simola To: freebsd-stable@freebsd.org In-Reply-To: <8eea040805011913334b140af6@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit References: <20050119151301.A22310@Denninger.Net> <8eea040805011913334b140af6@mail.gmail.com> Subject: Re: Bad disk or kernel (ATA Driver) problem? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: jon@abccomm.com List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Jan 2005 23:06:05 -0000 On Wed, 19 Jan 2005 13:33:12 -0800, Jon Simola wrote: > I've got a few 1U Supermicro boxes running dual SATA drives: > I've run into all sorts of problems with every one, and changing the > IDE channel settings in the BIOS always fixes it. Which really annoys > me, because I setup a new box, run it for a couple weeks, then the > drives start getting flaky under load. Then I go change the setting in > the BIOS (that I always forget to do on initial setup) and it's dead > stable for months at a time. I was politely asked to actually dig up the settings, which cut through my lack of sleep. I should have done this earlier :) On this one box (Supermicro SuperServer 5013C-T, P4SCE BIOS v1.2c): 5.2.1-RELEASE-p4 atapci0: port 0xf000-0xf00f,0-0x3,0-0x7,0-0x3,0- 0x7 irq 16 at device 31.2 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ata0: [MPSAFE] ata1: at 0x170 irq 15 on atapci0 ata1: [MPSAFE] GEOM: create disk ad0 dp=0xc671a560 ad0: 70911MB [144073/16/63] at ata0-master UDMA100 GEOM: create disk ad1 dp=0xc671a460 ad1: 70911MB [144073/16/63] at ata0-slave UDMA100 acd0: CDROM at ata1-master PIO4 That's a pair of SATA 74GB WD Raptors. The BIOS IDE setting is for "Combined" - SATA drives will appear on the Primary IDE channel. On a different box (Supermicro SuperServer 5013C-T, P4SCE BIOS v1.2c): 5.3-STABLE-20050107 atapci0: port 0xf000-0xf00f,0x376,0x170-0x177,0x3f6,0x1f0-0x1f7 at device 31.1 on pci0 ata0: channel #0 on atapci0 ata1: channel #1 on atapci0 atapci1: port 0xd000-0xd00f,0xcc00-0xcc03,0xc800-0xc807,0xc400-0xc403,0xc000-0xc007 irq 18 at device 31.2 on pci0 ata2: channel #0 on atapci1 ata3: channel #1 on atapci1 acd0: CDROM at ata1-master UDMA33 ad4: 78167MB [158816/16/63] at ata2-master SATA150 ad6: 78167MB [158816/16/63] at ata3-master SATA150 A pair of Maxtor 80GBs, the BIOS is set for "Enhanced", up to 6 drives (4 IDE + 2 SATA). Crazy as though it seems, I wasn't kidding about changing the BIOS. The other 2 settings are "SATA only" and "Auto". When the drives started flaking out (timeouts on reads) I would go into the BIOS and cycle through the BIOS settings. After changing it once or twice, things would be fine for months at a time. My best suspicion is that "something" makes the ICH5 a little flaky, and twiddling the BIOS clears it somehow. My only evidence supporting that is that twice the bios stalled on probing the drives once this error had happened, and I had to physically remove the drives, twiddle the bios settings, and replace the drives before it would work again. On OpenBSD, this problem on the same hardware manifests as a read timeout failure during the initial boot probes. Same fix, play with the BIOS and it suddenly works. There's a term in the Jargon file for this, but I can't recall it at the moment.