Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Feb 2012 16:10:20 -0800
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Victor Balada Diaz <victor@bsdes.net>
Cc:        Harald Schmalzbauer <h.schmalzbauer@omnilan.de>, Alexander Motin <mav@freebsd.org>, freebsd-stable@freebsd.org, Claudius Herder <claudius@ambtec.de>
Subject:   Re: problems with AHCI on FreeBSD 8.2
Message-ID:  <20120215001020.GA9386@icarus.home.lan>
In-Reply-To: <20120214233420.GU2010@equilibrium.bsdes.net>
References:  <20120214091909.GP2010@equilibrium.bsdes.net> <20120214100513.GA94501@icarus.home.lan> <20120214135435.GQ2010@equilibrium.bsdes.net> <20120214141601.GA98986@icarus.home.lan> <4F3A83DE.3000200@ambtec.de> <20120214165029.GA1852@icarus.home.lan> <4F3A971F.9040407@omnilan.de> <20120214221527.GT2010@equilibrium.bsdes.net> <20120214230958.GA8434@icarus.home.lan> <20120214233420.GU2010@equilibrium.bsdes.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Feb 15, 2012 at 12:34:20AM +0100, Victor Balada Diaz wrote:
> On Tue, Feb 14, 2012 at 03:09:58PM -0800, Jeremy Chadwick wrote:
> > On Tue, Feb 14, 2012 at 11:15:27PM +0100, Victor Balada Diaz wrote:
> > > On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote:
> > > >  schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime):
> > > > > On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote:
> > > > >> Hello,
> > > > >>
> > > > >> I have got a quite similar problem with AHCI on FreeBSD 8.2 and it still
> > > > >> persists on FreeBSD 9.0 release.
> > > > >>
> > > > >> Switching from ahci to ataahci resolved the problem for me too.
> > > > >>
> > > > >> I'm using gmirror for swap, system is on a zpool and the problem first
> > > > >> occurred during a zpool scrub, but it is easily reproducible with dd.
> > > > >>
> > > > >> The timeouts only occur when writing to disks, dd if=/dev/ada{0|1}
> > > > >> of=/dev/null is not an issue.
> > > > >> Sometimes I need to power off the server because after a reboot one disk
> > > > >> is still missing.
> > > > >>
> > > > >> I really would like to help in this issue, so let me know if you need
> > > > >> any more information.
> > > > > I find it interesting that, at least so far, the only people reporting
> > > > > problems of this type with the ahci.ko driver are people using Samsung
> > > > > disks.  The only difference is that your models are F1s while the OPs
> > > > > are F2s.
> > > > 
> > > > I saw such timeouts long ago and mav@ had a look at my postings and he
> > > > mentioned it could be a NCQ problem.
> > > > I suspected the disks firmware.
> > > > I never tracked it down further, because after replacing the Samsung (F3
> > > > in that case) disks with hitachi ones solved all my problems and gave a
> > > > big performance kick as well (with zfs).
> > > > You can find the discussion here:
> > > > http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.html
> > > > 
> > > 
> > > You gave me a good idea: try to disable NCQ and see if that's the fault. So
> > > i went and applied the attached patch. After it, i can no longer reproduce
> > > the issue with ahci driver.
> > > 
> > > I know this is not a solution because it disables NCQ at controller level
> > > instead of disk level, but at least we know for sure where the problem is.
> > > 
> > > I think the solution would be to add a new quirk ADA_Q_NONCQ in sys/cam/ata/ata_da.c.
> > > Quirks infraestructure is already built, so adding a new quirk for this seems
> > > easy.
> > > 
> > > Is someone interested? Do you think there is a better solution?
> > > 
> > > If someone is interested i can build a patch to add ADA_Q_NONCQ quirk and add my drives
> > > to it.
> > 
> > I took a stab at this, but I don't feel confident this is the proper
> > solution/method.  I worry there's some sort of chicken-or-the-egg
> > condition here (quirk setup/matching comes *after* SATA capabilities
> > detection), or that it makes the code messier.  Need mav@'s
> > recommendations on this.
> > 
> > Below is for RELENG_8.  I should note I haven't tested if this works, or
> > even compiles -- normally I don't provide such patches without testing
> > so I apologise in advance / user beware.
> 
> You're amazingly fast. Thanks for all your help :)
> 
> You start applying the quirks before 
> 
>         snprintf(announce_buf, sizeof(announce_buf),
>             "kern.cam.ada.%d.quirks", periph->unit_number);
>         quirks = softc->quirks;
>         TUNABLE_INT_FETCH(announce_buf, &quirks);
> 
> So you're breaking quirk setting at boot time.

I'm too tired to quite understand (in full) what's wrong with my patch,
but I think you're referring to situations where someone would have
kern.cam.ada.X.quirks set in loader.conf?

If so, I believe that same situation would happen presently if someone
set kern.cam.ada.X.quirks in their loader.conf to a value that did not
contain bit #0 set to 1, and used one of the 4K sector disks listed in
ada_quirk_table -- what's in loader.conf looks like it would overwrite
whatever the kernel code bits chose automatically:

 910         match = cam_quirkmatch((caddr_t)&cgd->ident_data,
 911                                (caddr_t)ada_quirk_table,
 912                                sizeof(ada_quirk_table)/sizeof(*ada_quirk_table),
 913                                sizeof(*ada_quirk_table), ata_identify_match);
 914         if (match != NULL)
 915                 softc->quirks = ((struct ada_quirk_entry *)match)->quirks;
 916         else
 917                 softc->quirks = ADA_Q_NONE;
 ...
 931         snprintf(announce_buf, sizeof(announce_buf),
 932             "kern.cam.ada.%d.quirks", periph->unit_number);
 933         quirks = softc->quirks;
 934         TUNABLE_INT_FETCH(announce_buf, &quirks);
 935         softc->quirks = quirks;

I read this to mean:

Lines 910-917 -- if there's a device ID string match in ada_quirk_table,
set softc->quirks to the content of that struct entry.  So, for example,
4K sector disks would set softc->quirks to 0x01.

Lines 931-933 -- assign the "quirks" variable to what softc->quirks
currently contains.  Thus, for 4K sector disks, quirks = 0x01.

Line 934 -- load into "quirks" variable the contents of loader.conf
entries pertaining to kern.cam.ada.N.quirks, if set.  If someone had an
entry in loader.conf saying kern.cam.ada.N.quirks=0 then yes, this would
overwrite what the kernel "auto-chose".

Line 935 -- assign softc->quirks = quirks, thus softc->quirks = 0x00,
assuming someone set it to such in loader.conf.

> See my attached patch. I can confirm it works for me.

I thought of taking that approach, but for me it's "dirty".  Here's what
I mean by that:

ADA_FLAG_CAN_NCQ gets set in softc->flags around line 892, but then
roughly a hundred lines later, you clear the exact same flag you just
set (based on quirk conditionals).

I dunno how people feel about that, but my impression is that it's
dirty/confusing.  My opinion is to only set the bit once and not mess
about with repeated if() statements that set it, then clear it, etc...

-- 
| Jeremy Chadwick                              jdc at parodius.com |
| Parodius Networking                     http://www.parodius.com/ |
| UNIX Systems Administrator                 Mountain View, CA, US |
| Making life hard for others since 1977.             PGP 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120215001020.GA9386>