Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 21 Aug 2006 21:30:52 +0200
From:      Miroslav Lachman <000.fbsd@quip.cz>
To:        Matt Dawson <matt@chronos.org.uk>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: ATA problems again ... general problem of ICH7 or ATA?
Message-ID:  <44EA09EC.5000605@quip.cz>
In-Reply-To: <200608211414.16731.matt@chronos.org.uk>
References:  <20060821120052.0B25816A526@hub.freebsd.org> <200608211414.16731.matt@chronos.org.uk>

next in thread | previous in thread | raw e-mail | index | archive | help
Matt Dawson wrote:

> On Monday 21 August 2006 13:00, freebsd-stable-request@freebsd.org wrote:
> 
>>>I can confirm the same behaviour with a ULi M1689/Newcastle Athlon64
>>>based system running 6.1-RELEASE-p3 (i386). ad6 just detaches without
>>>warning and it takes a reboot to bring it back. atacontrol reinit has no
>>>effect. Tried the following to resolve the problems:
>>
>>I don't know what is supposed to be the canonical way to
>>reattach a disconnected SATA drive, but while testing our
>>new hardware and hot-pulling a drive while the system
>>was running, atacaontrol reinit didn't find the reinserted drive
>>here, either.
>>
>>atacontrol detach ata3; atacontrol attach ata3 did.
> 
> 
> Yes, that is the method for a controlled remove and reattach, a la hotplug 
> SATA. AIUI, though, if the drive goes AWOL on its own you need to reinit the 
> channel before issuing an atacontrol attach foo. In theory... (man 8 
> atacontrol) In practice, the drive disappears, never to be probed again. A 
> warm reboot without power down makes it appear again, so the drive itself 
> isn't confused.

This is same in my case.

> FWIW, the problem takes *far* longer to rear its head when the SATA controller 
> has a PCI INT and IRQ to itself. Put a NIC onto a shared slot (a very Bad 
> Thing [TM] as the BIOS simply maps the INT to a single IRQ and both devices 
> end up sharing it. Now tranfer a large file over the network and watch the 
> ensuing hilarity) and it happens at least every couple of days. Now, with the 
> slot shared with the SATA controller empty, I have six days uptime since the 
> last event, which means I'm probably due one any time now. 

I thought so, but it did not solve my problem. I had UHCI sharing same 
IRQ with SATA (both on irq 19). Instead of playing with device.hints(5), 
I disabled all unused peripheries in BIOS (USB ports, LPT port, FDD...) 
After few days, system reports next disk lose.

> At least gmirror rebuilds the array after a simple reboot, but I would expect 
> the dd operation to throw a wobbly if it's a timing issue/fight for interrupt 
> between the two drives/channels. It doesn't, which makes me wonder if I'm 
> barking up the wrong tree, but I can't help noticing that SATA channels have 
> one interrupt between them whereas PATA channels have one each and all of 
> these reports are from SATA users...

Maybe you are right, I don't saw any report with one disk machine. All 
problems comes from machines with 2 or more SATA disks.

> I wonder what pciconf -lv shows on Miroslav's system? Is the SATA controller 
> sharing an INT/IRQ with something else? Does moving that device to another 
> slot alleviate the problem at all?

SATA is no longer sharing IRQs, but problem persists.

system dmesg after verbose boot
http://www.quip.cz/1/freebsd/asus_rs120-e3/track_dmesg_verbose_2006-08-21.txt
pciconf -lv
http://www.quip.cz/1/freebsd/asus_rs120-e3/track_pciconf_2006-08-21.txt

Mentioned problem appeared only on heavy disk load (e.g. ports tree 
copy). I have 3rd system with minimal disk load running for 10 days 
without problem (FreeBSD 6.0, now in production for mentioned 10 days - 
machine is "quick replacement" of failed server, system mirrored from 
old disks to new by dump & restore)

> Please not that Miroslav and I are using totally different drives, chipsets 
> and processors. He's using, IIRC, an Intel chip with an ICH7 southbridge and 
> Samsung drives. I'm using an AMD Athlon 64 Newcastle (running the i386 port) 
> on a ULi M1689 chipset with WD RE2 drives so, although I'd be more than happy 
> to be the numpty that is wrong and to have ata(4) vindicated by someone else, 
> I suspect it is ata(4) that is the problem. However, finger pointing isn't 
> productive and is certainly not fair given that ata(4) has been progressing 
> so well. Anything else I can try to nail this irksome beast? Any suggestions 
> for where I've been an idiot (easy, tiger!) and missed something obvious?
> 
> BTW, this is a production server (DLT backed up nightly, so the data is safe) 
> so I can't just pull it to bits. I do have an identical (CPU/mobo) box in the 
> workshop as a workstation, however, which I could buy/borrow another drive 
> for and set up gmirror to try things out.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?44EA09EC.5000605>