Date: Mon, 21 Aug 2006 21:30:52 +0200 From: Miroslav Lachman <000.fbsd@quip.cz> To: Matt Dawson <matt@chronos.org.uk> Cc: freebsd-stable@freebsd.org Subject: Re: ATA problems again ... general problem of ICH7 or ATA? Message-ID: <44EA09EC.5000605@quip.cz> In-Reply-To: <200608211414.16731.matt@chronos.org.uk> References: <20060821120052.0B25816A526@hub.freebsd.org> <200608211414.16731.matt@chronos.org.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
Matt Dawson wrote: > On Monday 21 August 2006 13:00, freebsd-stable-request@freebsd.org wrote: > >>>I can confirm the same behaviour with a ULi M1689/Newcastle Athlon64 >>>based system running 6.1-RELEASE-p3 (i386). ad6 just detaches without >>>warning and it takes a reboot to bring it back. atacontrol reinit has no >>>effect. Tried the following to resolve the problems: >> >>I don't know what is supposed to be the canonical way to >>reattach a disconnected SATA drive, but while testing our >>new hardware and hot-pulling a drive while the system >>was running, atacaontrol reinit didn't find the reinserted drive >>here, either. >> >>atacontrol detach ata3; atacontrol attach ata3 did. > > > Yes, that is the method for a controlled remove and reattach, a la hotplug > SATA. AIUI, though, if the drive goes AWOL on its own you need to reinit the > channel before issuing an atacontrol attach foo. In theory... (man 8 > atacontrol) In practice, the drive disappears, never to be probed again. A > warm reboot without power down makes it appear again, so the drive itself > isn't confused. This is same in my case. > FWIW, the problem takes *far* longer to rear its head when the SATA controller > has a PCI INT and IRQ to itself. Put a NIC onto a shared slot (a very Bad > Thing [TM] as the BIOS simply maps the INT to a single IRQ and both devices > end up sharing it. Now tranfer a large file over the network and watch the > ensuing hilarity) and it happens at least every couple of days. Now, with the > slot shared with the SATA controller empty, I have six days uptime since the > last event, which means I'm probably due one any time now. I thought so, but it did not solve my problem. I had UHCI sharing same IRQ with SATA (both on irq 19). Instead of playing with device.hints(5), I disabled all unused peripheries in BIOS (USB ports, LPT port, FDD...) After few days, system reports next disk lose. > At least gmirror rebuilds the array after a simple reboot, but I would expect > the dd operation to throw a wobbly if it's a timing issue/fight for interrupt > between the two drives/channels. It doesn't, which makes me wonder if I'm > barking up the wrong tree, but I can't help noticing that SATA channels have > one interrupt between them whereas PATA channels have one each and all of > these reports are from SATA users... Maybe you are right, I don't saw any report with one disk machine. All problems comes from machines with 2 or more SATA disks. > I wonder what pciconf -lv shows on Miroslav's system? Is the SATA controller > sharing an INT/IRQ with something else? Does moving that device to another > slot alleviate the problem at all? SATA is no longer sharing IRQs, but problem persists. system dmesg after verbose boot http://www.quip.cz/1/freebsd/asus_rs120-e3/track_dmesg_verbose_2006-08-21.txt pciconf -lv http://www.quip.cz/1/freebsd/asus_rs120-e3/track_pciconf_2006-08-21.txt Mentioned problem appeared only on heavy disk load (e.g. ports tree copy). I have 3rd system with minimal disk load running for 10 days without problem (FreeBSD 6.0, now in production for mentioned 10 days - machine is "quick replacement" of failed server, system mirrored from old disks to new by dump & restore) > Please not that Miroslav and I are using totally different drives, chipsets > and processors. He's using, IIRC, an Intel chip with an ICH7 southbridge and > Samsung drives. I'm using an AMD Athlon 64 Newcastle (running the i386 port) > on a ULi M1689 chipset with WD RE2 drives so, although I'd be more than happy > to be the numpty that is wrong and to have ata(4) vindicated by someone else, > I suspect it is ata(4) that is the problem. However, finger pointing isn't > productive and is certainly not fair given that ata(4) has been progressing > so well. Anything else I can try to nail this irksome beast? Any suggestions > for where I've been an idiot (easy, tiger!) and missed something obvious? > > BTW, this is a production server (DLT backed up nightly, so the data is safe) > so I can't just pull it to bits. I do have an identical (CPU/mobo) box in the > workshop as a workstation, however, which I could buy/borrow another drive > for and set up gmirror to try things out.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?44EA09EC.5000605>