Date: Mon, 1 Dec 2008 12:11:50 -0800 From: Jo Rhett <jrhett@netconsonance.com> To: Ken Smith <kensmith@cse.buffalo.edu> Cc: freebsd-stable Stable <freebsd-stable@freebsd.org> Subject: Re: Can I get a committer to mark this bug as blocking 6.4-RELEASE ? Message-ID: <C6BDF7BE-52D2-41EB-BD9B-0371B0DD0962@netconsonance.com> In-Reply-To: <1228159822.15856.45.camel@bauer.cse.buffalo.edu> References: <A5A9A4D4-CD16-45FA-A2AC-62C4B5AE976D@netconsonance.com> <BEBF7B15-DECE-4872-9687-4AD4BE65DB05@netconsonance.com> <84E1EC10-5323-4A8C-AD60-31142621DB32@netconsonance.com> <200810271151.47366.jhb@freebsd.org> <C6DC3DB1-40FF-4896-81DB-EF37874428AF@netconsonance.com> <280616DD-A58F-4AE5-AB03-92C5F2C244EC@netconsonance.com> <1227733967.83059.1.camel@neo.cse.buffalo.edu> <EC872352-4A50-404E-A93E-DBA5FCAA1431@netconsonance.com> <1228159822.15856.45.camel@bauer.cse.buffalo.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Dec 1, 2008, at 11:30 AM, Ken Smith wrote: > Both John and Xin Li have chimed in on the two threads I've seen that > are related to this specific topic. John diagnosed it as a issue with > the BIOS. That's what makes it a nebulous problem. When working on > those sorts of things most people liken it to "Whack-a-mole". Diagnosed without testing. John never asked for any more information than the page fault description from me. When I asked what else to test and offered to supply systems for testing he stopped responding. Xin Li proposed a work-around that would have castrated the systems. It might work, but it wasn't a useful workaround so I deferred testing and focused on trying to get someone to address the real problem. >> This is very big problem that will affect thousands of freebsd >> servers. > > Its still not clear it will affect thousands of servers. Um... Rackable. Rackable ships cabinets full of systems to people that run FreeBSD. They don't sell to home or small corporate users, period. Any problem that affects a standard Rackable build will by definition affect thousands of systems. (much like any standard Dell or HP server build) > This all left me with a decision. My choices were to back out the BTX > changes that were known to fix boot issues with certain motherboards > and > enabled booting from USB devices or leave things as they are. Or do some more testing and determine the problem and fix it. I had a stack of systems demonstrating the problem. I could have shipped one to each freebsd developer you wanted to work on it. If you were willing to identify the affect source code and relevant gdb traps I would have happily worked on the source directly if that is what it took. I would test. I would supply console access and build systems. I would ship them to anyone who wanted one in their hot little hands. I would investigate the source code myself with a mere hour of "here's the relevant bits you need to consider" training. You could have done *anything* that suited your needs for testing. Instead you did nothing. > The > motherboards that didn't boot with the older code had no work-around. > The motherboards that did boot with the older code but not the newer > code do have a work-around (use the old loader). Not true. I tested this, installing the old loader and it did not change the problem. As reported. > Decisions like that > suck, no matter which choice I make it's wrong. Holding the release > until all bios issues get resolved isn't a viable option because of > the > "Whack-a-mole" thing mentioned above. Fix it for one and two > break. It > takes a lot of time/work to settle into what seems to work for the > widest set of machines. Break the boot loader for a very wide variety of systems rather than spend EVEN A SINGLE HOUR trying to diagnose the boot problem? Ken, your diagnosis here would make sense if ANY diagnosis had been attempted. This could be a trivial problem. It could be solved with 5 minutes of actually looking at it. What happened here is that you proceeded WITHOUT EVEN TRYING. > So you're saying John and Xin Li's responses (Xin Li's questions still > un-answered) to you show a complete lack to even consider > investigating > it? No actual diagnosis was done. I'm sorry, but if I pull my car up to my mechanic's garage and he makes a diagnosis of "no idea what's wrong" without even popping the hood, yeah that counts as "didn't even consider investigating" Worse yet, I would happily have done all of the grunt work for the investigation. But I'm not going to start by reading the source tree and making guesses where to look. If someone had given me some useful tests to do, I would have done them. > I know from past email threads your preference is for 6.X right now Not my preference, my ability to justify the evaluation and testing costs based on the support available for a given release. 7.0 doesn't work on this hardware at all. No, I haven't tested 7.1 because 6.4 was the easier testing target and I had thought that the security team was working on fixing the support model. So now we have the brilliance strategy of a long-term support -REL that we will never be able to use. The same stupid stunt that gave us 6.1 which was unusable and 6.2 which worked great but expired at the same time as 6.1. Etc and such forth. 6.5 will likely be short term support again, but the first release we can consider for deployment. > but as a test point if you aren't totally fried over this whole > thing it > would still be useful to know for sure if the issue exists with 7.1 > test > builds. If yes it eliminates a variety of possibilities and helps > focus > on the exact problem. I'm not burnt, but testing 7.1 has no meaningful relevance to my day job until we have a reasonable and working support mechanism. And given that I really pulled out the stops to make sure we had hardware for testing 6.4 (I went a bought a whole stack of systems *JUST FOR THIS*) and filed PRs and followed up, and couldn't get much more than "it sounds like this" kind of response ... seriously, would you invest a lot of time testing a very unstable release under those conditions? I mean jesus, 6.4 is supposed to be truly stable and yet you're willing to ship it not working with dozens of nearly identical reports of the same symptoms for both 6.4 and 7.1? Think seriously about what happened here, and how exactly I'm supposed to convince any executive of the logic of trying to test 7.1, when we're stuck on 6.3 until/if 6.5, which will be screwed for support? I mean seriously? The problem BTW is *EXACTLY* why I proposed the revisions to the support policy I did. Now you're stuck supporting 6.4 for 2 years, and you won't want to release 6.5 because you'll end up supporting three 6.x releases at the same time. Which would suck. Which is exactly what my proposed change to the policy would have fixed. FreeBSD has usually been a solid product on the more stable releases. It's really unfortunate that the release management is so willing to ignore the evidence which leads to major releases with serious flaws, and on top of that seems to take delight in marking the known flawed releases as the long support releases. Brilliance. Just plain brilliant, top to bottom. -- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C6BDF7BE-52D2-41EB-BD9B-0371B0DD0962>