Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 22 May 2015 04:11:46 +0300
From:      Alexander Motin <mav@FreeBSD.org>
To:        Warner Losh <imp@bsdimp.com>, Neffi <nefftd@gmail.com>
Cc:        hackers@freebsd.org, imp@freebsd.org
Subject:   Re: Botched NCQ on SSD - cannot disable?
Message-ID:  <555E8252.2060307@FreeBSD.org>
In-Reply-To: <8EDE2E6C-FED8-498B-9211-E3534A28D2FC@bsdimp.com>
References:  <CA%2BK1YHFrxHt5rVU%2BLsH9UN37dr_7or1C7rEB0eHfJisU7sPE0Q@mail.gmail.com> <8EDE2E6C-FED8-498B-9211-E3534A28D2FC@bsdimp.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 21.05.2015 21:54, Warner Losh wrote:
> 
>> On May 21, 2015, at 12:42 PM, Neffi <nefftd@gmail.com> wrote:
>> 
>> I was discussing this issue in freenode/#freebsd and I was
>> recommended to shoot an email to you fellows about it.
>> 
>> I've got an Samsung 840 EVO SSD (model MZ-7TE250BW), which uses
>> Samsung's own controller from what I can gather. I had issues of
>> mass data corruption when used under Linux, and several programs
>> crashing unexpectedly when used under FreeBSD. I've gone through
>> 2 drives under warranty with the same issue before customer
>> service suggested to disable drive queuing.
>> 
>> After some research it seems as though this drive (and several
>> other common SSDs) report that they support NCQ, but in fact are
>> botched and will have all sorts of problems with NCQ enabled
>> ranging from poor performance, to I/O stalls to data corruption.
>> 
>> Sure enough the logs on Linux spit out something along the lines
>> of:
>> 
>>> ata1: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x10 frozen 
>>> ata1.00: failed command: READ FPDMA QUEUED
>> 
>> This happens several times when used on Linux, in the few hours
>> leading up to total filesystem corruption.
>> 
>> The recommendation in the Linux world is to disable NCQ on these
>> drives, for which there is an easy boot-time tunable for it. This
>> fixes the issue. No more data corruption.
>> 
>> There doesn't seem to be a tunable for this anywhere on FreeBSD.
>> camcontrol(8) mentions setting the tags used, but only between
>> some hardcoded limits, with a default of 2 -- not sufficient to
>> disable NCQ on the drive. It looks like presently the only option
>> is to manually patch the quirks for this drive in the kernel and
>> recompile before I can even install the system to the drive.
> 
> One option is to use drives that don’t suck so bad.
> 
> If you are using the AHCI controller, it has quirks for some cards
> that don’t properly fill in the NCQ tags, but so far that’s a tiny
> list of mostly older gear. What’s the host controller you are
> using.
> 
> Also, just because the command that hung on the drive is an NCQ
> command, that doesn’t mean disabling NCQ commands will keep you
> safe. That’s just the first one that’s issued after the firmware
> wedges (or could be: that’s a very common scenario for this kind of
> failure mode).
> 
> There’s a quirk for the 840 EVO, but that’s just to force 4k sector
> size.
> 
> While I haven’t used this generation of Samsung SSDs, I’d be highly
> surprised if this issue was really a problem in the drive instead
> of some cabling issue, or other environmental issue leading the the
> wedge.
> 
> It’s true there’s no way to totally disable NCQ, but if the drive
> is hanging with NCQ depth of 2, I’d be highly surprised if it is
> actually NCQ causing this...

IIRC camcontrol can disable NCQ, even though it is not very intuitive:
`camcontrol negotiate adaX -T disable ; camcontrol reset <CAM bus
number where adaX connected>`

-- 
Alexander Motin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?555E8252.2060307>