From owner-freebsd-stable@FreeBSD.ORG Mon Aug 21 13:14:29 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B73B916A4DE for ; Mon, 21 Aug 2006 13:14:29 +0000 (UTC) (envelope-from matt@chronos.org.uk) Received: from chronos.org.uk (chronos.org.uk [82.152.140.138]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4720943D5F for ; Mon, 21 Aug 2006 13:14:25 +0000 (GMT) (envelope-from matt@chronos.org.uk) Received: from [2001:618:400:6f4e:204:75ff:fe75:30d6] (md001@[IPv6:2001:618:400:6f4e:204:75ff:fe75:30d6]) (authenticated bits=0) by chronos.org.uk (8.13.6/8.13.6) with ESMTP id k7LDEHSE084607 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 21 Aug 2006 14:14:17 +0100 (BST) (envelope-from matt@chronos.org.uk) From: Matt Dawson To: freebsd-stable@freebsd.org Date: Mon, 21 Aug 2006 14:14:16 +0100 User-Agent: KMail/1.9.3 References: <20060821120052.0B25816A526@hub.freebsd.org> In-Reply-To: <20060821120052.0B25816A526@hub.freebsd.org> X-Face: Zrm9At!%e{M_#Po+[-\; RFQih#L0/\!^6f8JS_1Nz,8`(@bR%|T,c)3:o6my`.sy$Rt)'^)ec9cWp!MmeH^Gp|Afl)BkcH1GENCBqb&wZ$cdqN27uYfD=jU@1:vWXf|)LmuVKo?1wuS68KeDX&3,#wZP2$N1Ao!_'mZOws67 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200608211414.16731.matt@chronos.org.uk> X-Spam-Status: No, score=-2.6 required=3.0 tests=BAYES_00,NO_RELAYS autolearn=unavailable version=3.1.4 X-Spam-Checker-Version: SpamAssassin 3.1.4 (2006-07-25) on central.local.chronos.org.uk X-Virus-Scanned: ClamAV 0.88.4/1696/Sun Aug 20 21:21:18 2006 on central.local.chronos.org.uk X-Virus-Status: Clean Cc: Miroslav Lachman <000.fbsd@quip.cz> Subject: Re: ATA problems again ... general problem of ICH7 or ATA? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Aug 2006 13:14:29 -0000 On Monday 21 August 2006 13:00, freebsd-stable-request@freebsd.org wrote: > > I can confirm the same behaviour with a ULi M1689/Newcastle Athlon64 > > based system running 6.1-RELEASE-p3 (i386). ad6 just detaches without > > warning and it takes a reboot to bring it back. atacontrol reinit has no > > effect. Tried the following to resolve the problems: > > I don't know what is supposed to be the canonical way to > reattach a disconnected SATA drive, but while testing our > new hardware and hot-pulling a drive while the system > was running, atacaontrol reinit didn't find the reinserted drive > here, either. > > atacontrol detach ata3; atacontrol attach ata3 did. Yes, that is the method for a controlled remove and reattach, a la hotplug SATA. AIUI, though, if the drive goes AWOL on its own you need to reinit the channel before issuing an atacontrol attach foo. In theory... (man 8 atacontrol) In practice, the drive disappears, never to be probed again. A warm reboot without power down makes it appear again, so the drive itself isn't confused. FWIW, the problem takes *far* longer to rear its head when the SATA controller has a PCI INT and IRQ to itself. Put a NIC onto a shared slot (a very Bad Thing [TM] as the BIOS simply maps the INT to a single IRQ and both devices end up sharing it. Now tranfer a large file over the network and watch the ensuing hilarity) and it happens at least every couple of days. Now, with the slot shared with the SATA controller empty, I have six days uptime since the last event, which means I'm probably due one any time now. At least gmirror rebuilds the array after a simple reboot, but I would expect the dd operation to throw a wobbly if it's a timing issue/fight for interrupt between the two drives/channels. It doesn't, which makes me wonder if I'm barking up the wrong tree, but I can't help noticing that SATA channels have one interrupt between them whereas PATA channels have one each and all of these reports are from SATA users... I wonder what pciconf -lv shows on Miroslav's system? Is the SATA controller sharing an INT/IRQ with something else? Does moving that device to another slot alleviate the problem at all? Please not that Miroslav and I are using totally different drives, chipsets and processors. He's using, IIRC, an Intel chip with an ICH7 southbridge and Samsung drives. I'm using an AMD Athlon 64 Newcastle (running the i386 port) on a ULi M1689 chipset with WD RE2 drives so, although I'd be more than happy to be the numpty that is wrong and to have ata(4) vindicated by someone else, I suspect it is ata(4) that is the problem. However, finger pointing isn't productive and is certainly not fair given that ata(4) has been progressing so well. Anything else I can try to nail this irksome beast? Any suggestions for where I've been an idiot (easy, tiger!) and missed something obvious? BTW, this is a production server (DLT backed up nightly, so the data is safe) so I can't just pull it to bits. I do have an identical (CPU/mobo) box in the workshop as a workstation, however, which I could buy/borrow another drive for and set up gmirror to try things out. -- Matt Dawson. matt@chronos.org.uk MTD15-RIPE OpenNIC M_D9 MD51-6BONE