Date: Sun, 31 Mar 2013 15:00:15 -0700 From: Jeremy Chadwick <jdc@koitsu.org> To: Scott Long <scottl@samsco.org> Cc: Victor Balada Diaz <victor@bsdes.net>, Alexander Motin <mav@freebsd.org>, "freebsd-current@freebsd.org FreeBSD" <freebsd-current@freebsd.org>, "freebsd-stable@freebsd.org Stable" <freebsd-stable@freebsd.org> Subject: Re: Any objections/comments on axing out old ATA stack? Message-ID: <20130331220015.GA93163@icarus.home.lan> In-Reply-To: <C699FE76-B456-49C7-8D3A-DD54F98DAFC1@samsco.org> References: <51536306.5030907@FreeBSD.org> <20130331130409.GO3178@equilibrium.bsdes.net> <C699FE76-B456-49C7-8D3A-DD54F98DAFC1@samsco.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Mar 31, 2013 at 03:02:09PM -0600, Scott Long wrote: > On Mar 31, 2013, at 7:04 AM, Victor Balada Diaz <victor@bsdes.net> wrote: > > On Wed, Mar 27, 2013 at 11:22:14PM +0200, Alexander Motin wrote: > >> Hi. > >> > >> Since FreeBSD 9.0 we are successfully running on the new CAM-based ATA > >> stack, using only some controller drivers of old ata(4) by having > >> `options ATA_CAM` enabled in all kernels by default. I have a wish to > >> drop non-ATA_CAM ata(4) code, unused since that time from the head > >> branch to allow further ATA code cleanup. > >> > >> Does any one here still uses legacy ATA stack (kernel explicitly built > >> without `options ATA_CAM`) for some reason, for example as workaround > >> for some regression? Does anybody have good ideas why we should not drop > >> it now? > > > > Hello, > > > > At my previous job we had troubles with NCQ on some controllers. It caused > > failures and silent data corruption. As old ata code didn't use NCQ we just used > > it. > > > > I reported some of the problems on 8.2[1] but the problem existed with 8.3. > > > > I no longer have access to those systems, so i don't know if the problem > > still exists or have been fixed on newer versions. > > So what I hear you and Matthias saying, I believe, is that it should be easier to > force disks to fall back to non-NCQ mode, and/or have a more responsive > black-list for problematic controllers. Would this help the situation? It's hard to > justify holding back overall forward progress because of some bad controllers; > we do several Tbps off of AHCI controllers with NCQ enabled on FreeBSD 9.x, > enough to make up a sizable percentage of the internet's traffic, and we see no > problems. How can we move forward but also take care of you guys with > problematic hardware? I've read a referenced PR (157397) except there really isn't enough technical troubleshooting/detail to determine what the root cause is. That isn't the fault of the reporter either -- the reporter needs to be told what information they need to provide / how to troubleshoot it. Meaning: kernel folks who are in-the-know need to step up and help. That PR is soon-to-be 2 years old and is missing tons of information that, even as a non-kernel guy, that *I* would find useful: 1. Output from: - camcontrol tags ada1 -v - camcontrol identify ada1 - What sorts of filesystems are on ada1; if UFS, tunefs -p output would be greatly appreciated - If the timeouts happen during heavy I/O load, and if so, during what kinds of I/O load (reads or writes). 2. Does "camcontrol tags ada1 -N 31" help? I mention this because stated here: http://lists.freebsd.org/pipermail/freebsd-stable/2013-March/072985.html ...there are statements which imply decreasing queue length may solve the issue. What confuses me, however, is that the queue length on my own systems (with different models of disks, as well as an SSD) all have a limit of 32. I dug through the kernel source for a while but could not easily find where this number comes from. (I have very little familiarity with command queuing at the protocol level) 3. Why not find out why Linux (probably libata) has a 32 (or 31?) queue limit? They have commit logs, and there is the LVKM where you could ask. While I understand reluctance to add something "just because Linux does it", it doesn't appear anyone's stepped up to the plate to ask them why; I pray this is not caused by anti-Linux sentiment. 4. The ada1 device in the PR is a Samsung Spinpoint EcoGreen F2 hard drive (1TB, 5400rpm, 32MB cache). Possibly the drive has firmware bugs relating to its NCQ implementation, or possibly it's going into some power-saving mode (it is an EcoGreen model). I've always been wary of the EcoGreen disks since reading about the F4 EcoGreen firmware fiasco (even though the same page says the F1 and F3 EcoGreen had no issue): http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks 5. We really need to have some way to print "active quirks" for devices, even if it's only at boot-up, e.g.: ada3: quirks=0x0003<4K,NO_NCQ> I'd be happy to write the code for this (basing it on how we do CPU flags), but as I've said in the past, kernel-land is scary to me. 6. The controller referenced is an ATI IXP700. I cannot tell you how many times on the mailing lists I've seen "weird issues" reported by people using that controller. I am in no way/shape/form saying the issue is with the controller or with AHCI compatibility (FreeBSD vs. ATI), because I have no proof. I just find it very unnerving that so many issues have been reported where that controller is involved, and often across all sorts of different device/disk models. All that said: I agree a loader tunable to inhibit command queueing would be nice. sysctl would be even more convenient (easier for real-time testing) but I don't know the implications of turning CQ off in the middle of any pending I/O requests. -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130331220015.GA93163>