Date: Fri, 14 Aug 2009 06:08:54 -0400 From: Michael Powell <nightrecon@hotmail.com> To: freebsd-questions@freebsd.org Subject: Re: boot sector f*ed Message-ID: <h63d34$rqp$1@ger.gmane.org> References: <20090811173211.6FE4D106567B@hub.freebsd.org> <20090812193008.F19821@sola.nimnet.asn.au> <4A82A8D9.30406@videotron.ca> <20090812172704.GA27066@slackbox.xs4all.nl> <4A831DF7.9090506@videotron.ca> <20090812232810.GA37833@slackbox.xs4all.nl> <4A841AC2.1050809@videotron.ca> <20090814012551.H19821@sola.nimnet.asn.au>
next in thread | previous in thread | raw e-mail | index | archive | help
Ian Smith wrote: [snip] > > Smells like flakey hardware .. intermittent, inexplicable glitches. It > might survive hours on one workload, minutes on another, no sense to it? > > > All that I am seeing is that there is either a problem with the bios > > (which I even reinstalled and that changed nothing in the functioning) > > or something is going on with the OS. > > After you've thoroughly proven the hardware is AOK under sustained and > varied pressure, then you can suspect software issues - which tend to be > far more consistent and repeatable - but if the hardware's acting flakey > then you likely won't see any consistency in software issues, which does > seem to concur with your descriptions to date. > In my experience, hardware problems can quite possibly show little pattern to where and when in the usage of said machine they cause the box to flake. One that is malfunctioning all the time is relatively easy to find. The intermittent is the bane of all troubleshooting. I hate the intermittent more than I hate anything. One pattern an intermittent will show is eventually as the bad part gets worse the period between flakes will get shorter, and ultimately at some point die completely. Initially the period can be quite large so proper troubleshooting is difficult as you can't troubleshoot during the 'in between' when it's not malfunctioning. I also have an 80/20 rule about hardware as to whether it is a hot or cold failure. The 80% part is that most hardware problems occur when very dense VLSI chips heat up. So a machine may not show any problem until it's been powered up for a while. The other 20% is the cold start. Turn the box on and there is immediately some kind of problem early on in the course of booting. Leave it powered on, walk away for 20 minutes to get a coffee, and reset it after it's had a chance to warm up and now it works fine the rest of the day. These patterns are indicative of a typical pattern in hardware trouble behavior. A software error, on the other hand, most of the time shows itself as a well defined repeatable sequence of steps that cause the problem every time the sequence is executed. This can also usually be easily reproduced by others running the same, or similar enough, platform(s) by executing said sequence. This can get quite sticky as even the BIOS code is software! Bad buggy BIOS code having a bad reaction to the compiled boot loader binary, even though probably quite rare, is not totally outside the realm of possibility. Somewhere very near the root of the geometric logic tree of troubleshooting you need to be able to drive a wedge between hardware and software in a divide and conquer kind of way. Making any arbitrary assumptions as to which side is the problem early on will blind the troubleshooter to avenues of hypothesis this and test that. Assume that the hardware is 100% OK so it must be a software problem without proof is a mistake, and vice versa. And it might be as simple as installing another OS such as a Linux distro or Windows to the box. If it is truly a hardware problem it may continue to malfunction and cause trouble no matter what the choice of OS. Or it may not, as sometimes buggy hardware design failures are compensated for with workarounds in drivers, thus hiding the flaw. It's the old 'have a <insert brand name> box with xyz hardware' with a known problem and the fix is to download and install <insert brand name> driver revision such and such from the OEM. Since these kinds of things are not generally propagated far and wide an OS such as FreeBSD may not be privy to such bad hardware details. Sometimes the developers do incorporate hacks for hardware. If you can accurately identify such a situation the most likely way to get it fixed for the long run is to file a proper PR. If done well enough and it catches the eye of a dev who may be interested and actually possess the piece of hardware a workaround may get coded and become a part of FreeBSD. Just a lot of generalizations here. As always, there is the YMMV clause. :-) [snip] -Mike
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?h63d34$rqp$1>