FreeBSD Mail Archives

Date:      Tue, 23 Feb 2010 17:08:02 +0100
From:      Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
To:        Alexander Motin <mav@FreeBSD.org>
Cc:        freebsd-stable@FreeBSD.org
Subject:   Re: ahcich timeouts, only with ahci, not with ataahci
Message-ID:  <4B83FD62.2020407@omnilan.de>
In-Reply-To: <4B83EFD4.8050403@FreeBSD.org>
References:  <1266934981.00222684.1266922202@10.7.7.3> <4B83EFD4.8050403@FreeBSD.org>

index | next in thread | previous in thread | raw e-mail


[-- Attachment #1 --]
Alexander Motin schrieb am 23.02.2010 16:10 (localtime):
> Harald Schmalzbauer wrote:
>> I'm frequently getting my machine locked with ahcichX timeouts:
>> ahcich2: Timeout on slot 0
>> ahcich2: is 00000000 cs 00000001 ss 00000000 rs 00000001 tfd c0 serr
>> 00000000
>> ahcich2: Timeout on slot 8
>> ahcich2: is 00000000 cs 00000100 ss 00000000 rs 00000100 tfd c0 serr
>> 00000000
>> ahcich2: Timeout on slot 8
>> ahcich2: is 00000000 cs fffff07f ss ffffff7f rs ffffff7f tfd c0 serr
>> 00000000
>> ...
> 
> Looking that is (Interrupt status) is zero and `rs == cs | ss` (running
> command bitmasks in driver and hardware), controller doesn't report
> command completion. Looking on TFD status 0xc0 with BUSY bit set, I
> would suppose that either disk stuck in command processing for some
> reason, or controller missed command completion status.
> 
> Have you noticed 30 second (default ATA timeout) pause before timeout
> message printed? Just want to be sure that driver waited enough before
> give up.

Yes, there is some pause between the occurance of the hang and the first 
timeout message. But I can't tell you exactly if it's 30 seconds. I 
guess rather more than 30 sec.

>> This happens when backup over GbE overloads ZFS/HDD capabilities.
>> I reduced vfs.zfs.txg.timeout to 1 to prevent the machine from locking
>> up almost immediately, but from it still happens.
>> When I don't use ahci but ataahci (the old driver if I understand things
>> correct) I also see the ZFS burst write congestion, but this doesn't
>> lead to controller timeouts, thus blocking the machine.
>>
>> Sometimes the machine recovers from the disk lock, but most often I have
>> to reboot.
> 
> How it looks when it doesn't? Can you send me full log messages?

Unfortunately not. That happened only once (which I recognized), 3 days 
ago and messages got turned over 5 times since then...
But I have some messages from 02/15, with kernel from january. Usually 
the messages continue to pop up until I reset the machine. This time 
there were only the three above, even after waiting half an hour (had to 
go on site). The old messages:

ahcich2: Timeout on slot 20
ahcich2: is 00000000 cs ff07ffff ss fff7ffff rs fff7ffff tfd c0 serr 
00000000
ahcich4: Timeout on slot 24
ahcich4: is 00000000 cs f07fffff ss ff7fffff rs ff7fffff tfd c0 serr 
00000000
ahcich2: Timeout on slot 17
ahcich2: is 00000000 cs fff9ffff ss ffffffff rs ffffffff tfd c0 serr 
00000000
ahcich4: Timeout on slot 20
ahcich4: is 00000000 cs 00300000 ss 00000000 rs 00300000 tfd c0 serr 
00000000
ahcich2: Timeout on slot 15
ahcich2: is 00000000 cs fff87fff ss ffffffff rs ffffffff tfd c0 serr 
00000000
ahcich4: Timeout on slot 22
ahcich4: is 00000000 cs fc0fffff ss ffcfffff rs ffcfffff tfd c0 serr 
00000000
ahcich2: Timeout on slot 13
ahcich2: is 00000000 cs ffff1fff ss ffffffff rs ffffffff tfd c0 serr 
00000000
ahcich4: Timeout on slot 16
ahcich4: is 00000000 cs 00010000 ss 00000000 rs 00010000 tfd c0 serr 
00000000
ahcich2: Timeout on slot 11
ahcich2: is 00000000 cs ffffc7ff ss ffffffff rs ffffffff tfd c0 serr 
00000000
ahcich4: Timeout on slot 16
ahcich4: is 00000000 cs 00000000 ss 00010000 rs 00010000 tfd 40 serr 
00000000

Maybe it's helpful to you. Since I haven't seen the hang after 
upgrading, although doing extensive network transfer tests, I thought it 
vanished and haven't kept logs safe...

>> Kernel is from Feb. 19, so recent ahci improovements are active.
>> Controller is ICH9R with 3 Samsung F3 SpinPoints.
>>
>> Any ideas how to work arround the hangs other than using the old ahci
>> driver?
> 
> Old ataahci driver wasn't using NCQ. NCQ may trigger some bugs in drive
> firmware or expose some protocol inconsistencies. I would recommend you
> to search for some errata for your drive and possibly firmware update.

Sounds reasonable.
How can I disable NCQ with new ahci?
I guess if it's a HDD firmware issue with NCQ the hang shouldn't happen 
when NCQ is disabled.
Btw, I found camcontrol cmd ada0 -a "EF 85 00 00 00 00 00 00 00 00 00 
00" for disabling APM and another one for disabling AAM. I did that for 
my drives. Is there a wiki where we can place such valuable commands?

Thanks,

-Harry


[-- Attachment #2 --]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.13 (FreeBSD)

iEYEARECAAYFAkuD/WMACgkQLDqVQ9VXb8hgRgCeJo/dUvVw3mzgwXf/JPjh245g
230An31KgZM6DP+Jy95EgfvnkhXOAm0F
=b+YU
-----END PGP SIGNATURE-----

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4B83FD62.2020407>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation