Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Feb 2012 16:42:47 -0700
From:      Scott Long <scottl@samsco.org>
To:        Victor Balada Diaz <victor@bsdes.net>
Cc:        Harald Schmalzbauer <h.schmalzbauer@omnilan.de>, Alexander Motin <mav@freebsd.org>, freebsd-stable@freebsd.org, Jeremy Chadwick <freebsd@jdc.parodius.com>, Claudius Herder <claudius@ambtec.de>
Subject:   Re: problems with AHCI on FreeBSD 8.2
Message-ID:  <6D5E973B-6D98-41D7-B5E9-64A497F0F9F5@samsco.org>
In-Reply-To: <20120214233420.GU2010@equilibrium.bsdes.net>
References:  <20120214091909.GP2010@equilibrium.bsdes.net> <20120214100513.GA94501@icarus.home.lan> <20120214135435.GQ2010@equilibrium.bsdes.net> <20120214141601.GA98986@icarus.home.lan> <4F3A83DE.3000200@ambtec.de> <20120214165029.GA1852@icarus.home.lan> <4F3A971F.9040407@omnilan.de> <20120214221527.GT2010@equilibrium.bsdes.net> <20120214230958.GA8434@icarus.home.lan> <20120214233420.GU2010@equilibrium.bsdes.net>

next in thread | previous in thread | raw e-mail | index | archive | help

On Feb 14, 2012, at 4:34 PM, Victor Balada Diaz wrote:

> On Tue, Feb 14, 2012 at 03:09:58PM -0800, Jeremy Chadwick wrote:
>> On Tue, Feb 14, 2012 at 11:15:27PM +0100, Victor Balada Diaz wrote:
>>> On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote:
>>>> schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime):
>>>>> On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote:
>>>>>> Hello,
>>>>>>=20
>>>>>> I have got a quite similar problem with AHCI on FreeBSD 8.2 and =
it still
>>>>>> persists on FreeBSD 9.0 release.
>>>>>>=20
>>>>>> Switching from ahci to ataahci resolved the problem for me too.
>>>>>>=20
>>>>>> I'm using gmirror for swap, system is on a zpool and the problem =
first
>>>>>> occurred during a zpool scrub, but it is easily reproducible with =
dd.
>>>>>>=20
>>>>>> The timeouts only occur when writing to disks, dd =
if=3D/dev/ada{0|1}
>>>>>> of=3D/dev/null is not an issue.
>>>>>> Sometimes I need to power off the server because after a reboot =
one disk
>>>>>> is still missing.
>>>>>>=20
>>>>>> I really would like to help in this issue, so let me know if you =
need
>>>>>> any more information.
>>>>> I find it interesting that, at least so far, the only people =
reporting
>>>>> problems of this type with the ahci.ko driver are people using =
Samsung
>>>>> disks.  The only difference is that your models are F1s while the =
OPs
>>>>> are F2s.
>>>>=20
>>>> I saw such timeouts long ago and mav@ had a look at my postings and =
he
>>>> mentioned it could be a NCQ problem.
>>>> I suspected the disks firmware.
>>>> I never tracked it down further, because after replacing the =
Samsung (F3
>>>> in that case) disks with hitachi ones solved all my problems and =
gave a
>>>> big performance kick as well (with zfs).
>>>> You can find the discussion here:
>>>> =
http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.htm=
l
>>>>=20
>>>=20
>>> You gave me a good idea: try to disable NCQ and see if that's the =
fault. So
>>> i went and applied the attached patch. After it, i can no longer =
reproduce
>>> the issue with ahci driver.
>>>=20
>>> I know this is not a solution because it disables NCQ at controller =
level
>>> instead of disk level, but at least we know for sure where the =
problem is.
>>>=20
>>> I think the solution would be to add a new quirk ADA_Q_NONCQ in =
sys/cam/ata/ata_da.c.
>>> Quirks infraestructure is already built, so adding a new quirk for =
this seems
>>> easy.
>>>=20
>>> Is someone interested? Do you think there is a better solution?
>>>=20
>>> If someone is interested i can build a patch to add ADA_Q_NONCQ =
quirk and add my drives
>>> to it.
>>=20
>> I took a stab at this, but I don't feel confident this is the proper
>> solution/method.  I worry there's some sort of chicken-or-the-egg
>> condition here (quirk setup/matching comes *after* SATA capabilities
>> detection), or that it makes the code messier.  Need mav@'s
>> recommendations on this.
>>=20
>> Below is for RELENG_8.  I should note I haven't tested if this works, =
or
>> even compiles -- normally I don't provide such patches without =
testing
>> so I apologise in advance / user beware.
>=20
> You're amazingly fast. Thanks for all your help :)
>=20
> You start applying the quirks before=20
>=20
>        snprintf(announce_buf, sizeof(announce_buf),
>            "kern.cam.ada.%d.quirks", periph->unit_number);
>        quirks =3D softc->quirks;
>        TUNABLE_INT_FETCH(announce_buf, &quirks);
>=20
> So you're breaking quirk setting at boot time.
>=20
> See my attached patch. I can confirm it works for me.
>=20
> Regards.
>=20

I don't think that disabling NCQ entirely is the right solution.  It's a =
tag starvation issue in the firmware, not a complete failure, and it can =
be dealt with in the CAM XPT scheduler fairly efficiently.  Alexander =
and I talked about this recently, and though we differ on the details, a =
tag hack is not in order, IMHO.  In the short term, try just using "cam =
control tags ada0 -N 1" to limit the concurrent commands to 1.

Scott





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?6D5E973B-6D98-41D7-B5E9-64A497F0F9F5>