From owner-freebsd-stable@FreeBSD.ORG  Mon Apr 22 16:52:25 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mandree.no-ip.org (freefall.freebsd.org
 [IPv6:2001:1900:2254:206c::16:87])
 by hub.freebsd.org (Postfix) with ESMTP id 354C9507;
 Mon, 22 Apr 2013 16:52:25 +0000 (UTC)
 (envelope-from mandree@FreeBSD.org)
Received: from [127.0.0.1] (localhost.localdomain [127.0.0.1])
 by apollo.emma.line.org (Postfix) with ESMTP id 1F71F23CE96;
 Mon, 22 Apr 2013 08:26:32 +0200 (CEST)
Message-ID: <5174D817.9070405@FreeBSD.org>
Date: Mon, 22 Apr 2013 08:26:31 +0200
From: Matthias Andree <mandree@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/20130329 Thunderbird/17.0.5
MIME-Version: 1.0
To: Jeremy Chadwick <jdc@koitsu.org>
Subject: Re: Any objections/comments on axing out old ATA stack?
References: <51536306.5030907@FreeBSD.org>
 <20130331130409.GO3178@equilibrium.bsdes.net>
 <C699FE76-B456-49C7-8D3A-DD54F98DAFC1@samsco.org>
 <515B25D8.7050902@FreeBSD.org> <515BF5AE.4050804@FreeBSD.org>
 <515CAA04.1050108@FreeBSD.org> <20130403233815.GA65719@icarus.home.lan>
 <515CC704.90302@FreeBSD.org> <20130404010526.GA66858@icarus.home.lan>
 <515D3312.3010109@FreeBSD.org> <20130420212957.GA19158@icarus.home.lan>
In-Reply-To: <20130420212957.GA19158@icarus.home.lan>
X-Enigmail-Version: 1.4.6
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: Alexander Motin <mav@FreeBSD.org>, freebsd-current@freebsd.org,
 freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 22 Apr 2013 16:52:25 -0000

Am 20.04.2013 23:29, schrieb Jeremy Chadwick:

>> My feeling is that the stalls are mostly from the error handler and the
>> overall time the drive is "frozen" gets shorter. If it had not _felt_
>> faster, I'd not have left that in sysctl.conf in the first place.
> 
> Your understanding of what that sysctl does is wrong, or I'm
> misunderstanding what you're saying (very possible!).

What I am saying is a high-level view on the situation.

If I leave the default slot timeout set, whenever the computer gets into
an episode of stalls, it becomes unusable (all I/O stalled so anything
that needs disk I/O will hang) for so long that it is much faster to
depress the reset button, reboot, force fsck, and retry.

This usually entails hand-holding and manually cleaning up debris, such
as b0rked .o files from a buildworld, or similar.

These stalls happens out of the middle of the buildworld, under heavy
I/O, so I'd dispute excessive head unloading and drive spindown is the
issue -- the computer (and fans in particular) is generally very quiet,
no VGA board (just fanless onboard Radeon HD 3300), I could hear
re-spinups or parking heads.  I don't hear anything like it.

I don't know how rescheduling commands that timed out and get
rescheduled happens overall.

> How I interpret what you're saying: that the sysctl somehow "decreases
> stall times" during I/O operations that fail.  This is incorrect.

That may not be the intention of the sysctl, but it is the high-level
outcome.

> What that sysctl does is define the number of seconds that transpire
> ***before*** the CAM layer says "Okay, I didn't get a response to the
> ATA CDB I sent the disk", and then re-submits the same CDB to the disk.

The other question (to Alexander Motin) then is why do I see the
timeouts for the related slots rougly $timeout seconds apart.

Alexander, is there any way we can make the kernel dump the entire set
of pending NCQ queue entries including submitted timestamp, or timeout
values, so that we can see how much workload is queued?

Note also that the CRC count has not increased since I've put the
smartctl output online, it's still at 14 -- I would have to see CRC
errors and their consequences in Linux or Windows, too.

Linux's smartd 5.41 never mailed about an increase of the CRC value, and
I told it not to mail temperature changes.

> Rephrased: in the case of a disk stalling on an I/O request, you will
> experience the effects of that stall no matter what that sysctl is set
> to.  A lower value in that sysctl will result in CAM spitting out
> nasties on the console + hitting the CDB retry submission scenario
> sooner, which if the drive is awake/responsive by that time will go
> smoothly.
> 
> That's all it does.

That's how you have explained and I have understood it on the queue-slot
level (microscopic), but at a larger scale, I do not observe that the
shorter timeout sysctl value led to these stall episodes happen more
often (as should be the consequence if spindown were the cause of the
stalls), only recovery is faster.

> Thus a value of 5 indicates a device/drive did not respond to a CDB
> within 5 seconds, and a value of 30 indicates a device/drive did not
> respond to a CDB within 30 seconds.  Regardless, those lengths of time
> are VERY long for an I/O operation on a mechanical HDD.

Indeed they are, and because /usr is on the offending drive, I lowered
the value to 5 s, which I still deem conservative.  I know that an older
ATA standard edition permitted longer completion times for flushing HDD
internal write caches to platters (15 s IIRC).

> Oh look, it's the Samsung SpinPoint series, especially the EcoGreen
> ("EG") series.  No joke: ~60% of the "problem reports" I deal with when
> it comes to "weird wonky problems" stem from this drive series.  I have
> no idea why, but they're a common pain point for me.

I know they are, especially the larger siblings 1.5 G up.

> Politely, your analysis of the drive ("looks sane to me") is an
> indicator of why SMART output needs to be interpreted by a person who is
> familiar with the information.  That drive *does not* look sane to me.
> :-)

14 CRC errors with a drive that moved through computers that got
modified over time, that does not run the whole day, and that was first
attached to a computer whose controller (VIA garbage) could only talk to
1.5 Gb/s ATA drives but not 3 Gb/s is not something I care about.

> Key points about these errors:
> 
[...]

> - These are conditions that short, long, select (LBA range scan), and
>   conveyance SMART tests would probably not detect.  Like I said: it
>   seems to be all over the board.

I agree that it is more likely to be a communications issue between
FreeBSD and the drive's logic, with all components, hard- and software
involved.

> Bernd Walter responded indicating that his experience indicated that the
> issue related to NCQ compatibility.  This would not surprise me.

Neither would it surprise me, but Linux should suffer, too, then.  It
does use NCQ, too.  FreeBSD can be booted either on bare metal, or
booted through a Linux-hosted VirtualBox and then uses raw partition
access.  It should see the same 4k sectory misalignment pattern - if we
go by your earlier assumption - or NCQ incompatibility because the
hardware is the same.  The board uses an AMD 750 south bridge.

> What doesn't help is that SpinPoint drives have a history of pretty
> awful firmware bugs, such as this one, which still blows my mind to this
> day:
> 
> http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks
> 
> Your drive is using firmware version 1AG01118, but I can't easily find a
> newer firmware because of the whole Seagate/Samsung buyout (Seagate
> buying out Samsung's MHDD division).

According to the Samsung tools I could find before as well as after the
buyout this is the latest-and-greatest firmware; but I did not contact
their support to confirm that.

> - The "EG" series are known to park their heads excessively, and much to
>   my annoyance, do not track this behaviour in SMART (normally it's
>   tracked in attribute 193, which the drive lacks (probably
>   intentionally)).  This head-parking nonsense is known to cause
>   problems in certain situations, reported by the OS as timeouts and
>   I/O errors as the drive is trying to wake up and respond to the CDB.
>   There are many drives on the markets that do this now, and I
>   generally boycott them all (it's only useful for laptops).  I can
>   talk at length about that some other time, or you can find/read my
>   blog (I wrote an article about the WD30EFRX doing this -- at least
>   on WD drives you can inhibit the behaviour, while on Seagate you
>   can't).

Unlikely as stated above, unless the drive starts to park heads during
NCQ timeout error recovery.

> My suggestions to you at this point in time:
> 
> - Remove the sysctl and leave it at its default (30 seconds).  Or if
>   you really must adjust it, set it to 15.  YMMV with this.

I really must either use Linux or at least lower this sysctl, else the
system is unusable.

> - Replace the drive and/or choose another drive vendor.

I am inclined to do that, and I am very much inclined to buy high-grade
WD stuff again, either Velociraptor or the TLER Black series. (I don't
care for drives to retry for half an our to entirely scratch the surface
where it attempts to read a b0rked block, even if not in a RAID -- this
is the 10^-15 errors...).

> My suggestions for FreeBSD at this time:
> 
> - Regardless of what the root cause of the above is, we really do need a
>   no-NCQ quirk, and we also need to print the quirks used (in a similar
>   fashion to how CPU features are shown) during boot.

That, and when entering the ATA driver's error handler with NCQ active,
dump the entire queue contents, or at least its pending requests, with
rescheduling.  We may need a loader tunable or sysctl to enable such
excessive debug logging.

Please check the stall episode timing in the logs - it would appear that
the actual slots timing out were filed well after one an other; with NCQ
being put to good use, I'd suspect that slots time out more or less at
the same time, but I see that the entire timeout elapses before the next
slot times out.  I find that suspicious.  Alexander may, however, expect
that from the way the driver reschedules failing requests, or how the
driver inserts a non-NCQ command as some sort of write barrier to
enforce a NCQ flush.  (Not sure if that's any good on these Samsung
drives, I'd like to disable it just to experiment.)

I can move FreeBSD to the SSD any time and leave the FreeBSD partitions
on the offending Samsung drive for experimentation; I just did not want
to change any of the setup, so that we would not lose any analysis chances.