From owner-freebsd-stable@FreeBSD.ORG Mon Aug 21 19:30:58 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B033F16A4DE for ; Mon, 21 Aug 2006 19:30:58 +0000 (UTC) (envelope-from 000.fbsd@quip.cz) Received: from slimak.dkm.cz (slimak.dkm.cz [62.24.64.34]) by mx1.FreeBSD.org (Postfix) with SMTP id DC82343D5C for ; Mon, 21 Aug 2006 19:30:54 +0000 (GMT) (envelope-from 000.fbsd@quip.cz) Received: (qmail 37195 invoked by uid 0); 21 Aug 2006 19:30:52 -0000 Received: from grimm.quip.cz (HELO ?192.168.1.2?) (213.220.192.218) by slimak.dkm.cz with SMTP; 21 Aug 2006 19:30:52 -0000 Message-ID: <44EA09EC.5000605@quip.cz> Date: Mon, 21 Aug 2006 21:30:52 +0200 From: Miroslav Lachman <000.fbsd@quip.cz> User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 X-Accept-Language: cs, cz, en, en-us MIME-Version: 1.0 To: Matt Dawson References: <20060821120052.0B25816A526@hub.freebsd.org> <200608211414.16731.matt@chronos.org.uk> In-Reply-To: <200608211414.16731.matt@chronos.org.uk> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-stable@freebsd.org Subject: Re: ATA problems again ... general problem of ICH7 or ATA? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Aug 2006 19:30:58 -0000 Matt Dawson wrote: > On Monday 21 August 2006 13:00, freebsd-stable-request@freebsd.org wrote: > >>>I can confirm the same behaviour with a ULi M1689/Newcastle Athlon64 >>>based system running 6.1-RELEASE-p3 (i386). ad6 just detaches without >>>warning and it takes a reboot to bring it back. atacontrol reinit has no >>>effect. Tried the following to resolve the problems: >> >>I don't know what is supposed to be the canonical way to >>reattach a disconnected SATA drive, but while testing our >>new hardware and hot-pulling a drive while the system >>was running, atacaontrol reinit didn't find the reinserted drive >>here, either. >> >>atacontrol detach ata3; atacontrol attach ata3 did. > > > Yes, that is the method for a controlled remove and reattach, a la hotplug > SATA. AIUI, though, if the drive goes AWOL on its own you need to reinit the > channel before issuing an atacontrol attach foo. In theory... (man 8 > atacontrol) In practice, the drive disappears, never to be probed again. A > warm reboot without power down makes it appear again, so the drive itself > isn't confused. This is same in my case. > FWIW, the problem takes *far* longer to rear its head when the SATA controller > has a PCI INT and IRQ to itself. Put a NIC onto a shared slot (a very Bad > Thing [TM] as the BIOS simply maps the INT to a single IRQ and both devices > end up sharing it. Now tranfer a large file over the network and watch the > ensuing hilarity) and it happens at least every couple of days. Now, with the > slot shared with the SATA controller empty, I have six days uptime since the > last event, which means I'm probably due one any time now. I thought so, but it did not solve my problem. I had UHCI sharing same IRQ with SATA (both on irq 19). Instead of playing with device.hints(5), I disabled all unused peripheries in BIOS (USB ports, LPT port, FDD...) After few days, system reports next disk lose. > At least gmirror rebuilds the array after a simple reboot, but I would expect > the dd operation to throw a wobbly if it's a timing issue/fight for interrupt > between the two drives/channels. It doesn't, which makes me wonder if I'm > barking up the wrong tree, but I can't help noticing that SATA channels have > one interrupt between them whereas PATA channels have one each and all of > these reports are from SATA users... Maybe you are right, I don't saw any report with one disk machine. All problems comes from machines with 2 or more SATA disks. > I wonder what pciconf -lv shows on Miroslav's system? Is the SATA controller > sharing an INT/IRQ with something else? Does moving that device to another > slot alleviate the problem at all? SATA is no longer sharing IRQs, but problem persists. system dmesg after verbose boot http://www.quip.cz/1/freebsd/asus_rs120-e3/track_dmesg_verbose_2006-08-21.txt pciconf -lv http://www.quip.cz/1/freebsd/asus_rs120-e3/track_pciconf_2006-08-21.txt Mentioned problem appeared only on heavy disk load (e.g. ports tree copy). I have 3rd system with minimal disk load running for 10 days without problem (FreeBSD 6.0, now in production for mentioned 10 days - machine is "quick replacement" of failed server, system mirrored from old disks to new by dump & restore) > Please not that Miroslav and I are using totally different drives, chipsets > and processors. He's using, IIRC, an Intel chip with an ICH7 southbridge and > Samsung drives. I'm using an AMD Athlon 64 Newcastle (running the i386 port) > on a ULi M1689 chipset with WD RE2 drives so, although I'd be more than happy > to be the numpty that is wrong and to have ata(4) vindicated by someone else, > I suspect it is ata(4) that is the problem. However, finger pointing isn't > productive and is certainly not fair given that ata(4) has been progressing > so well. Anything else I can try to nail this irksome beast? Any suggestions > for where I've been an idiot (easy, tiger!) and missed something obvious? > > BTW, this is a production server (DLT backed up nightly, so the data is safe) > so I can't just pull it to bits. I do have an identical (CPU/mobo) box in the > workshop as a workstation, however, which I could buy/borrow another drive > for and set up gmirror to try things out.