From owner-freebsd-stable@FreeBSD.ORG Mon Apr 22 16:52:25 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mandree.no-ip.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by hub.freebsd.org (Postfix) with ESMTP id 354C9507; Mon, 22 Apr 2013 16:52:25 +0000 (UTC) (envelope-from mandree@FreeBSD.org) Received: from [127.0.0.1] (localhost.localdomain [127.0.0.1]) by apollo.emma.line.org (Postfix) with ESMTP id 1F71F23CE96; Mon, 22 Apr 2013 08:26:32 +0200 (CEST) Message-ID: <5174D817.9070405@FreeBSD.org> Date: Mon, 22 Apr 2013 08:26:31 +0200 From: Matthias Andree User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130329 Thunderbird/17.0.5 MIME-Version: 1.0 To: Jeremy Chadwick Subject: Re: Any objections/comments on axing out old ATA stack? References: <51536306.5030907@FreeBSD.org> <20130331130409.GO3178@equilibrium.bsdes.net> <515B25D8.7050902@FreeBSD.org> <515BF5AE.4050804@FreeBSD.org> <515CAA04.1050108@FreeBSD.org> <20130403233815.GA65719@icarus.home.lan> <515CC704.90302@FreeBSD.org> <20130404010526.GA66858@icarus.home.lan> <515D3312.3010109@FreeBSD.org> <20130420212957.GA19158@icarus.home.lan> In-Reply-To: <20130420212957.GA19158@icarus.home.lan> X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Alexander Motin , freebsd-current@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Apr 2013 16:52:25 -0000 Am 20.04.2013 23:29, schrieb Jeremy Chadwick: >> My feeling is that the stalls are mostly from the error handler and the >> overall time the drive is "frozen" gets shorter. If it had not _felt_ >> faster, I'd not have left that in sysctl.conf in the first place. > > Your understanding of what that sysctl does is wrong, or I'm > misunderstanding what you're saying (very possible!). What I am saying is a high-level view on the situation. If I leave the default slot timeout set, whenever the computer gets into an episode of stalls, it becomes unusable (all I/O stalled so anything that needs disk I/O will hang) for so long that it is much faster to depress the reset button, reboot, force fsck, and retry. This usually entails hand-holding and manually cleaning up debris, such as b0rked .o files from a buildworld, or similar. These stalls happens out of the middle of the buildworld, under heavy I/O, so I'd dispute excessive head unloading and drive spindown is the issue -- the computer (and fans in particular) is generally very quiet, no VGA board (just fanless onboard Radeon HD 3300), I could hear re-spinups or parking heads. I don't hear anything like it. I don't know how rescheduling commands that timed out and get rescheduled happens overall. > How I interpret what you're saying: that the sysctl somehow "decreases > stall times" during I/O operations that fail. This is incorrect. That may not be the intention of the sysctl, but it is the high-level outcome. > What that sysctl does is define the number of seconds that transpire > ***before*** the CAM layer says "Okay, I didn't get a response to the > ATA CDB I sent the disk", and then re-submits the same CDB to the disk. The other question (to Alexander Motin) then is why do I see the timeouts for the related slots rougly $timeout seconds apart. Alexander, is there any way we can make the kernel dump the entire set of pending NCQ queue entries including submitted timestamp, or timeout values, so that we can see how much workload is queued? Note also that the CRC count has not increased since I've put the smartctl output online, it's still at 14 -- I would have to see CRC errors and their consequences in Linux or Windows, too. Linux's smartd 5.41 never mailed about an increase of the CRC value, and I told it not to mail temperature changes. > Rephrased: in the case of a disk stalling on an I/O request, you will > experience the effects of that stall no matter what that sysctl is set > to. A lower value in that sysctl will result in CAM spitting out > nasties on the console + hitting the CDB retry submission scenario > sooner, which if the drive is awake/responsive by that time will go > smoothly. > > That's all it does. That's how you have explained and I have understood it on the queue-slot level (microscopic), but at a larger scale, I do not observe that the shorter timeout sysctl value led to these stall episodes happen more often (as should be the consequence if spindown were the cause of the stalls), only recovery is faster. > Thus a value of 5 indicates a device/drive did not respond to a CDB > within 5 seconds, and a value of 30 indicates a device/drive did not > respond to a CDB within 30 seconds. Regardless, those lengths of time > are VERY long for an I/O operation on a mechanical HDD. Indeed they are, and because /usr is on the offending drive, I lowered the value to 5 s, which I still deem conservative. I know that an older ATA standard edition permitted longer completion times for flushing HDD internal write caches to platters (15 s IIRC). > Oh look, it's the Samsung SpinPoint series, especially the EcoGreen > ("EG") series. No joke: ~60% of the "problem reports" I deal with when > it comes to "weird wonky problems" stem from this drive series. I have > no idea why, but they're a common pain point for me. I know they are, especially the larger siblings 1.5 G up. > Politely, your analysis of the drive ("looks sane to me") is an > indicator of why SMART output needs to be interpreted by a person who is > familiar with the information. That drive *does not* look sane to me. > :-) 14 CRC errors with a drive that moved through computers that got modified over time, that does not run the whole day, and that was first attached to a computer whose controller (VIA garbage) could only talk to 1.5 Gb/s ATA drives but not 3 Gb/s is not something I care about. > Key points about these errors: > [...] > - These are conditions that short, long, select (LBA range scan), and > conveyance SMART tests would probably not detect. Like I said: it > seems to be all over the board. I agree that it is more likely to be a communications issue between FreeBSD and the drive's logic, with all components, hard- and software involved. > Bernd Walter responded indicating that his experience indicated that the > issue related to NCQ compatibility. This would not surprise me. Neither would it surprise me, but Linux should suffer, too, then. It does use NCQ, too. FreeBSD can be booted either on bare metal, or booted through a Linux-hosted VirtualBox and then uses raw partition access. It should see the same 4k sectory misalignment pattern - if we go by your earlier assumption - or NCQ incompatibility because the hardware is the same. The board uses an AMD 750 south bridge. > What doesn't help is that SpinPoint drives have a history of pretty > awful firmware bugs, such as this one, which still blows my mind to this > day: > > http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks > > Your drive is using firmware version 1AG01118, but I can't easily find a > newer firmware because of the whole Seagate/Samsung buyout (Seagate > buying out Samsung's MHDD division). According to the Samsung tools I could find before as well as after the buyout this is the latest-and-greatest firmware; but I did not contact their support to confirm that. > - The "EG" series are known to park their heads excessively, and much to > my annoyance, do not track this behaviour in SMART (normally it's > tracked in attribute 193, which the drive lacks (probably > intentionally)). This head-parking nonsense is known to cause > problems in certain situations, reported by the OS as timeouts and > I/O errors as the drive is trying to wake up and respond to the CDB. > There are many drives on the markets that do this now, and I > generally boycott them all (it's only useful for laptops). I can > talk at length about that some other time, or you can find/read my > blog (I wrote an article about the WD30EFRX doing this -- at least > on WD drives you can inhibit the behaviour, while on Seagate you > can't). Unlikely as stated above, unless the drive starts to park heads during NCQ timeout error recovery. > My suggestions to you at this point in time: > > - Remove the sysctl and leave it at its default (30 seconds). Or if > you really must adjust it, set it to 15. YMMV with this. I really must either use Linux or at least lower this sysctl, else the system is unusable. > - Replace the drive and/or choose another drive vendor. I am inclined to do that, and I am very much inclined to buy high-grade WD stuff again, either Velociraptor or the TLER Black series. (I don't care for drives to retry for half an our to entirely scratch the surface where it attempts to read a b0rked block, even if not in a RAID -- this is the 10^-15 errors...). > My suggestions for FreeBSD at this time: > > - Regardless of what the root cause of the above is, we really do need a > no-NCQ quirk, and we also need to print the quirks used (in a similar > fashion to how CPU features are shown) during boot. That, and when entering the ATA driver's error handler with NCQ active, dump the entire queue contents, or at least its pending requests, with rescheduling. We may need a loader tunable or sysctl to enable such excessive debug logging. Please check the stall episode timing in the logs - it would appear that the actual slots timing out were filed well after one an other; with NCQ being put to good use, I'd suspect that slots time out more or less at the same time, but I see that the entire timeout elapses before the next slot times out. I find that suspicious. Alexander may, however, expect that from the way the driver reschedules failing requests, or how the driver inserts a non-NCQ command as some sort of write barrier to enforce a NCQ flush. (Not sure if that's any good on these Samsung drives, I'd like to disable it just to experiment.) I can move FreeBSD to the SSD any time and leave the FreeBSD partitions on the offending Samsung drive for experimentation; I just did not want to change any of the setup, so that we would not lose any analysis chances.