From owner-freebsd-current@FreeBSD.ORG Sat Apr 20 21:29:59 2013 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 88354C3D for ; Sat, 20 Apr 2013 21:29:59 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from qmta13.emeryville.ca.mail.comcast.net (qmta13.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:44:76:96:27:243]) by mx1.freebsd.org (Postfix) with ESMTP id 6A2DCF5E for ; Sat, 20 Apr 2013 21:29:59 +0000 (UTC) Received: from omta22.emeryville.ca.mail.comcast.net ([76.96.30.89]) by qmta13.emeryville.ca.mail.comcast.net with comcast id SLZi1l0011vN32cADMVzRu; Sat, 20 Apr 2013 21:29:59 +0000 Received: from koitsu.strangled.net ([67.180.84.87]) by omta22.emeryville.ca.mail.comcast.net with comcast id SMVy1l00D1t3BNj8iMVyl1; Sat, 20 Apr 2013 21:29:58 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 04DF773A33; Sat, 20 Apr 2013 14:29:58 -0700 (PDT) Date: Sat, 20 Apr 2013 14:29:58 -0700 From: Jeremy Chadwick To: Matthias Andree Subject: Re: Any objections/comments on axing out old ATA stack? Message-ID: <20130420212957.GA19158@icarus.home.lan> References: <51536306.5030907@FreeBSD.org> <20130331130409.GO3178@equilibrium.bsdes.net> <515B25D8.7050902@FreeBSD.org> <515BF5AE.4050804@FreeBSD.org> <515CAA04.1050108@FreeBSD.org> <20130403233815.GA65719@icarus.home.lan> <515CC704.90302@FreeBSD.org> <20130404010526.GA66858@icarus.home.lan> <515D3312.3010109@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <515D3312.3010109@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1366493399; bh=CwqtmSuUoQhEabOXV8j+PTjvQvi/2D3CiVf10Y9epcY=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=LcCh2YaATdkGrR3v+vO68Fc6OgwajtA/MLpgsIFQ5fnPOX9aBzviFUbwnAbA8SNr5 OR5cgQjej2iimp9QIkX40l+vIubbvEW9RrbkJFWmER6xgll3xcuz3Y2q/Z3PV0X8Tl X8Ll4tv1ypNeLSpFwJSMIgo2ZHec+WRhJxQSX+waC5HncfJNH/DAFvpIfvBFpouWBd PdluJbMp9pg2B/KePiOFabg2kFOTWsfPvEMfBi/a5y6Lpd/MD7aIFJNY6ZH91Uhknf 4qB7ZYxTKbMtkuf36bokEh7+YxOBVJAgXaGjcmHlBqq/LlqyqGkhRe/I+GS3Sn/ITD iRwibqhM0haIQ== X-Mailman-Approved-At: Sat, 20 Apr 2013 23:59:02 +0000 Cc: Alexander Motin , freebsd-current@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Apr 2013 21:29:59 -0000 On Thu, Apr 04, 2013 at 10:00:18AM +0200, Matthias Andree wrote: > Am 04.04.2013 03:05, schrieb Jeremy Chadwick: > > { snipping stuff I have no comment on. reference thread: } > { > http://lists.freebsd.org/pipermail/freebsd-stable/2013-April/073036.html } > > > One piece of evidence that refutes my theory is that if Windows and/or > > Linux partition are something you boot into and use often, I would > > imagine NCQ would be used in both of those environments and would suffer > > from the same issue. Although Windows tends to hide all sorts of > > transient errors from the user (sigh), Linux tends to be like FreeBSD > > with regards to such issues (on the console anyway; you wouldn't see > > such messages normally inside of X). > > Now, the FreeBSD slice is the only partition on that disk that would > likely see concurrent write accesses (think "make -j8" on a quadcore > computer) which is more prone to ferret out such alignment contention. > > The NTFS partition is aligned on a multi-MB boundary, so wouldn't hit > the problem anyways. > > The Linux partition is in ext4 format for mostly sequential access to > files usually in excess of 10 MB each. > > Linux's ext4 jumps through several hoops to end up with bulk writes, > like extents, delayed allocations (to avoid fragmentation), reordering > of data and metadata writes, serialized log writes and all that stuff, > and it would appear I am permitting it to cache writes -- Linux uses > write barriers to enforce proper ordering of journal/meta-data writes. > > It would be rather hard to hit ATA taskfile timeouts, the expected rate > with which the drive needs to do a partial write is orders of magnitude > lower. > > Any good "concurrent write" exercise tools for Unix that I could run on > the Linux ext4 partition that you would propose? The only tool I'm familiar with is bonnie++. But I don't think this (partition alignment) is what matters now. Your smartctl output has shed some light on your situation. > >> - I am running with kern.cam.ada.default_timeout=5 which makes the > >> computer recover faster > > > > I can definitely imagine cases where a drive using NCQ but doing writes > > to a non-aligned partition could take longer than 5 seconds to respond > > to an ATA CDB (this is different than a SATA or AHCI layer timeout). I am > > not telling you "change this back to 30", but it might not be helping > > your situation at all given my above theory. > > My feeling is that the stalls are mostly from the error handler and the > overall time the drive is "frozen" gets shorter. If it had not _felt_ > faster, I'd not have left that in sysctl.conf in the first place. Your understanding of what that sysctl does is wrong, or I'm misunderstanding what you're saying (very possible!). How I interpret what you're saying: that the sysctl somehow "decreases stall times" during I/O operations that fail. This is incorrect. What that sysctl does is define the number of seconds that transpire ***before*** the CAM layer says "Okay, I didn't get a response to the ATA CDB I sent the disk", and then re-submits the same CDB to the disk. Rephrased: in the case of a disk stalling on an I/O request, you will experience the effects of that stall no matter what that sysctl is set to. A lower value in that sysctl will result in CAM spitting out nasties on the console + hitting the CDB retry submission scenario sooner, which if the drive is awake/responsive by that time will go smoothly. That's all it does. Thus a value of 5 indicates a device/drive did not respond to a CDB within 5 seconds, and a value of 30 indicates a device/drive did not respond to a CDB within 30 seconds. Regardless, those lengths of time are VERY long for an I/O operation on a mechanical HDD. When you get to the bottom of my Email, you'll understand why I screamed at you about adjusting that sysctl. > > Finally: could you please provide output from "smartctl -x /dev/ada1"? > > I would like to rule out any possibility of your drive having some other > > kind of issue that might cause it to go catatonic. Thanks. > > I have fetched the data with Linux this time (should not make a > difference as it's all drive internal data, not host OS stuff). > > Looks sane to me, . > I'll be happy to refetch this data with a more current smartctl version > under FreeBSD if required. Oh look, it's the Samsung SpinPoint series, especially the EcoGreen ("EG") series. No joke: ~60% of the "problem reports" I deal with when it comes to "weird wonky problems" stem from this drive series. I have no idea why, but they're a common pain point for me. First, about the shown sector size: smartmontools 5.41 was the first release to show the sector sizes per ATA IDENTIFY. I assume they got this right from the get-go. So as of this moment I'm going to assume that this drive really is a 512-byte sector drive. Politely, your analysis of the drive ("looks sane to me") is an indicator of why SMART output needs to be interpreted by a person who is familiar with the information. That drive *does not* look sane to me. :-) The first thing that comes to my attention is attribute 199, indicating that the drive has experienced a total of 14 CRC errors during its lifetime (10779 hours as of that moment). Usually this attribute is zeroed at the factory (other attributes are often not). Just yesterday I wrote a very long/detailed analysis about what this attribute means, so I'll just link you to that post. Please focus on just the part about CRC errors: http://www.dslreports.com/forum/r28219261- The next thing I see are 14 errors in your SMART error log. It's worth noting that this number correlates with the CRC error count above (though depending on drive firmware they may not have a symbiotic relationship). Your SMART error log consists of entries indicating the drive itself sent back error conditions to the controller/OS (which FreeBSD or Linux would show on the console). The timestamps of these events are based on power-on hour count, so the most recent event was at 7747 hours, but there are others going back all the way to 6528 hours. Sadly, the SMART error log is very small (2 sectors / 1024 bytes), so only the last 8 errors can be shown. Key points about these errors: - The LBAs being accessed varies/is all over the board, indicating that it's very unlikely this anomaly is being caused by physical defects on the platters (the drive also shows no remapped LBAs or pending/suspect LBAs, which further supports that theory), - The ATA commands which lead up to the error also vary. Many are for write requests, and from some entries I can see that the OS was doing NCQ writes (WRITE FPDMA QUEUED) and then suddenly decided to do a classic 28-bit LBA write (WRITE DMA). I'm not sure why an OS would do this (there's nothing optimal about it) unless there were conditions occurring where the OS/ATA driver said "this NCQ write isn't working (timeout, etc.), let me retry with a classic 28-bit LBA write". There is one entry (the last) which shows a similar situation happening but with NCQ reads. - These are conditions that short, long, select (LBA range scan), and conveyance SMART tests would probably not detect. Like I said: it seems to be all over the board. This is not the first time I have seen this behaviour with SpinPoint drives. Bernd Walter responded indicating that his experience indicated that the issue related to NCQ compatibility. This would not surprise me. NCQ incompatibilities have happened in the past; the most notable (to me) was between Maxtor drives and nVidia SATA controllers. Both companies blamed the other, yet both came out with "fixes" (Maxtor with a firmware update, nVidia with a driver update). Neither company stated anything concrete/useful publicly (oh America, so stock-focused you are). My personal opinion is that the bug was in Maxtor's firmware, and nVidia ceased use of NCQ requests to drives matching specific model numbers (similar to what we do in FreeBSD, re: 4KB quirks). What doesn't help is that SpinPoint drives have a history of pretty awful firmware bugs, such as this one, which still blows my mind to this day: http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks Your drive is using firmware version 1AG01118, but I can't easily find a newer firmware because of the whole Seagate/Samsung buyout (Seagate buying out Samsung's MHDD division). Because of the "random" nature of this issue, my opinion is that what you're experiencing is caused by one of the following: - The "EG" series are known to park their heads excessively, and much to my annoyance, do not track this behaviour in SMART (normally it's tracked in attribute 193, which the drive lacks (probably intentionally)). This head-parking nonsense is known to cause problems in certain situations, reported by the OS as timeouts and I/O errors as the drive is trying to wake up and respond to the CDB. There are many drives on the markets that do this now, and I generally boycott them all (it's only useful for laptops). I can talk at length about that some other time, or you can find/read my blog (I wrote an article about the WD30EFRX doing this -- at least on WD drives you can inhibit the behaviour, while on Seagate you can't). I noticed that SMART attribute 3 on your drive indicates it takes roughly 6.2 seconds to spin up. This may change over time as well (often getting worse as the drive gets older (spindle motors do wear down over time)). Now take into consideration the sysctl you changed, and what I said earlier about me knowing some conditions where a drive may take >5 seconds to handle certain I/O ops. - NCQ bugs in the drive's firmware. You can try to talk to Samsung about this, but you'll probably get no where due to how deep within companies actual engineers live. My suggestions to you at this point in time: - Remove the sysctl and leave it at its default (30 seconds). Or if you really must adjust it, set it to 15. YMMV with this. - Replace the drive and/or choose another drive vendor. My suggestions for FreeBSD at this time: - Regardless of what the root cause of the above is, we really do need a no-NCQ quirk, and we also need to print the quirks used (in a similar fashion to how CPU features are shown) during boot. I can try to write the code for this, but I am going to need help. Kernel land is not something I'm generally good at. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |