From owner-freebsd-stable@FreeBSD.ORG Wed Aug 10 23:47:01 2005 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id B8B9116A420 for ; Wed, 10 Aug 2005 23:47:01 +0000 (GMT) (envelope-from karl@FS.denninger.net) Received: from FS.denninger.net (wsip-68-15-213-52.at.at.cox.net [68.15.213.52]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7921B43D48 for ; Wed, 10 Aug 2005 23:47:00 +0000 (GMT) (envelope-from karl@FS.denninger.net) Received: from fs.denninger.net (localhost [127.0.0.1]) by FS.denninger.net (8.13.3/8.13.1) with SMTP id j7ANkxC5020548 for ; Wed, 10 Aug 2005 18:46:59 -0500 (CDT) (envelope-from karl@FS.denninger.net) Received: from fs.denninger.net [127.0.0.1] by Spamblock-sys (LOCAL); Wed Aug 10 18:46:59 2005 Received: (from karl@localhost) by FS.denninger.net (8.13.3/8.13.1/Submit) id j7ANkxo4020546 for freebsd-stable@freebsd.org; Wed, 10 Aug 2005 18:46:59 -0500 (CDT) (envelope-from karl) Date: Wed, 10 Aug 2005 18:46:59 -0500 From: Karl Denninger To: freebsd-stable@freebsd.org Message-ID: <20050810234659.GA19768@FS.denninger.net> Mail-Followup-To: freebsd-stable@freebsd.org References: <20050810023111.GA2913@FS.denninger.net> <20050810024618.GA8198@drjekyll.mkbuelow.net> <6.2.1.2.0.20050810081251.05298ff0@64.7.153.2> <20050810133159.GA10150@FS.denninger.net> <6.2.1.2.0.20050810094204.06c46098@64.7.153.2> <20050810144148.GB10150@FS.denninger.net> <790a9fff0508100844a7e5435@mail.gmail.com> <4A1BF8DF-EC50-4067-A69B-84D9BE5B22C7@FreeBSD.ORG> <20050810205101.GA17483@FS.denninger.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i Organization: Karl's Sushi and Packet Smashers X-Die-Spammers: Spammers cheerfully broiled for supper and served with ketchup! Subject: Re: ad10: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=11441599 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Aug 2005 23:47:01 -0000 On Thu, Aug 11, 2005 at 12:46:04AM +0200, S?ren Schmidt wrote: > > On 10/08/2005, at 22:51, Karl Denninger wrote: > > > >This is the subject of the PR I filed back in February. > > > >Again, if you want either a controller shipped to you OR access to a > >development machine (e.g. ssh in and play) which has the suspect > >configuration on it, the latter of which is probably the best > >option (since > >making it fail is simple) I'm willing to provide either - my only > >caveat is > >that if I send hardware I want it back when you're done, and I > >believe its > >reasonable to expect that 6.0 will get HELD in its release cycle > >until this > >is resolved. > > I have plenty of the sii3112's around, so thats not needed, however > I've not managed to get ahold of a machine in which it fails reliably > with ATA as is in 6.0. I have two which reliably fail if you put TWO disks on them in a gmirror config within minutes of starting a "make buildworld". With one disk it takes a bit longer and more effort, but can still be forced to fail. It appears to require a mix of read and write operations and a fairly heavy - but not horiffically so - I/O load to make it blow up. All reads or all writes do NOT fail. For example, you can do a gmirror rebuild and it will succeed. That's all writes (to the new disks) until complete. Seconds to minutes after the rebuilds complete if the system is under heavy random I/O load it will fail. >From this and other tests I've concluded that a MIX of read and write operations are required, and the total load must be substantial. Either reads alone or writes alone do not appear to provoke it, even with 100% disk utilization. > >The latter offer (ssh access) has been on the table for several > >months. The > >former I just put on the table as I threw up my hands and bought a > >3ware > >card - which means I now have TWO of the suspect cards and need > >only one > >for my own testing (in the sandbox) > > > >I'm willing to go WELL out of my way to make it possible for this > >to get > >fixed, since there appears to be an issue with access to hardware that > >breaks reliably. However, I, and others, would like to know that > >we're > >going to see the problem get resolved. > > I've already gone WAY out of my way to try to support the sii3112, > and I'm not inclined to waste more of my precious spare time on it. > However, if it really is that important to enough people to try to > workaround the silicon bugs (which very likely isn't possible), get > together and get me failing HW on my desk and time to work on it. Ok, then do the RIGHT THING and document that the SiI chips are declared BROKEN by FreeBSD and likely to cause people trouble - including irrevocable data corruption. This would have saved me COUNTLESS hours when I first ran into this issue. Indeed, it was not until someone else started posting excerpts from commit logs (months after I filed the PR originally!) that I was aware FreeBSD developers considered these chipsets "damaged goods". Where is fair warning in the hardware compatability guide? Second, your requirement for hardware simply can't be met. It is not possible for anyone to manufacture or deliver time. Is it thus necessary for us "mere users" to consider this an issue that will simply not be addressed? If so, then just say so up front > >Again - this is hardware that is STABLE and works under 4.x - in > >the case of > >my specific configuration I ran under 4.x for over a year without a > >single > >incident. With 5.4 and 6.0-BETA I can kill it inside of 2 minutes > >with > >nothing more complicated than a "make -j4 buildworld". > > First. you cannot by any degree of the word call the sii3112 for > STABLE hardware, its broken beyond repair or workarounds, and even > the supplier acknowledges that fact. Well then how about if FreeBSD officially DECLARES this hardware to be broken beyond repair and workaround, and simply says "if this doesn't work for you, don't bitch or complain, because we have nothing further we can do about it"? That is acceptable, although I bet it costs 'ya a fair number of users, particularly in the small server and workstation markets. Of course since its not "money lost", that may be perfectly OK to the FreeBSD team. It definitely will change MY focus as a developer of software often run on small office and home network machines though. It HAS TO Soren. This isn't a matter of me not wanting to be a FreeBSD evangelist - but if I try to tell people that half of the machines out there that they might run FreeBSD on are likely to fail, and if they do my only recommendation is "sorry, I can't do anything about it other than sell you this hardware", the obvious next reply is that they will want the software to be made available on an operating system that DOESN'T blow up like this. Linux ends up being something I have to support of necessity down that road...... (a thing I've studiously avoided now for five years, by the way.) I have a 3ware card in my production machine now and the "allegedly broken" disks are magically just fine. Guess the disks are fine eh? Of course I lost the functionality that I thought I was getting with the newer ATA code anyway, since the 3ware software doesn't support hot plug, and I also lost access to the disk statistics and self-test capabilities that smartmontools has, since 3ware's board doesn't pass that through cleanly either. But all this begs the question - why did it work on 4.x, and how come the same timing constraints and code paths that worked on 4.x can't / weren't incorporated into what's there now? > Second, you cannot possibly have used gmirror (as in the PR) on 4.x > so what was the config back then ? I didn't NEED gmirror back then. Attempting to use these disks on a SiI controller WITHOUT gmirror in 5.4 or even 6.0 is asking to have to reload the machine as the errors cause irrevocable data corruption. I'm not about to subject myself to having to reload a machine a few hundred times while troubleshooting it, and I suspect you know that is a completely unreasonable request. Gmirror was added to my config in an attempt to stop the crashes during testing - with at least one disk in the mirror on the ICH5 adapter the system (and data) survives. It turns out that on 5.x this is much more "reasonable" to use than vinum, which was severely broken in 5.x (may be fixed now as "gvinum", I didn't give it anoyther crack after pulling my hair out for quite a long time with THAT one.) I assure you that the load profiles that generate BOOMs on 5.4 and 6.0-BETA do NOT under 4.x with the IDENTICAL hardware in use. Over a year of heavy production use of 4.x with ZERO trouble is my evidence for this. > Third, please get gmirror out of the loop (use atacontrol to create a > mirror if need be) and let me know if that changes anything. Uh, if the abstraction done by GEOM is hardware-independant, and the error comes from the DRIVER, how can GEOM be involved? GEOM (gmirror in this case) prevents me from having to reload the machine every time it blows up due to data corruption that cannot be fixed. Never mind that others are reporting irrevocable data loss and crashes - they aren't mirrored..... I've managed to keep my data intact.... "atacontrol" doesn't help me as there is no rebuild mechanism available for "garden variety" controllers (at least the last time I tried it that did nothing.) So you can build the array but after the first crash you had no way to recover. That's only marginally better than having the crash wipe the sidewalk with the data on your drive, in terms of troubleshooting effort. > Forth, another thing to try is fumbling with BIOS settings, some > setups has been reported to start working when PCI timings is changed > YMMV.. > > - S?ren I can play with this.... but if the hardware is the cause and requires tweaking timing in the PCI BIOS config, how come 4.x works without any tweaking on the same hardware? In short, what's changed in the DRIVER timing that provokes this sort of thing , and does it NEED to have changed? Again, I can easily set up ssh access to a machine that has problems with this, and the "BOOM"s are VERY repeatable. >From the other postings here, I am by no means an isolated user with an isolated problem - the issue is fairly widespread. I suspect, but of course cannot prove, that if you find the issue with my machine, you will likely fix a lot of other people's issues with similar problems...... I could be wrong, but I bet not...... In any event the ATA code changes have hurt a LOT of people Soren and led to a huge amount of wasted time. If it was known that the SiI chipsets simply were never going to get full support (because they are considered "unsupportable") then it is only right for the development team to DOCUMENT THIS rather than letting people find out for themselves the hard way, pulling their hair out looking for phantom bad disk drives and phantom problems with cables - neither of which has anything to do with it. If there is going to be no path out of this mess then just say so and we'll realign our expectations of where FreeBSD fits in terms of what environments it is reasonable to consider it for. -- -- Karl Denninger (karl@denninger.net) Internet Consultant & Kids Rights Activist http://www.denninger.net My home on the net - links to everything I do! http://scubaforum.org Your UNCENSORED place to talk about DIVING! http://genesis3.blogspot.com Musings Of A Sentient Mind