Date: Tue, 26 Sep 2000 01:27:46 -0700 From: "Crist J . Clark" <cjclark@reflexnet.net> To: "Jason C. Wells" <jcwells@nwlink.com> Cc: cjclark@alum.mit.edu, Haikal Saadh <wyldephyre2@yahoo.com>, chat@FreeBSD.ORG Subject: Re: So what do (unix) sysadmins do anyway? Message-ID: <20000926012746.Q59015@149.211.6.64.reflexcom.com> In-Reply-To: <Pine.SOL.3.96.1000925093125.2335A-100000@utah>; from jcwells@nwlink.com on Mon, Sep 25, 2000 at 10:11:15AM -0700 References: <20000924224054.H59015@149.211.6.64.reflexcom.com> <Pine.SOL.3.96.1000925093125.2335A-100000@utah>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Sep 25, 2000 at 10:11:15AM -0700, Jason C. Wells wrote: > On Sun, 24 Sep 2000, Crist J . Clark wrote: > > > > Coming from the environment that I do, I state that there is no such thing > > > as a system failure. > > > > Huh. NOT! The admin's at our office spend a _lot_ of time fixing > > systems that die from hardware failures. Sure, you have backups (data > > Yes, things break. I agree with everything you have said except the "NOT" > part. Allow me to explain my position. My position is: We must hold > _people_ culpable for systems failures and not fall into the trap that the > system is to blame. (I use system here as an abstraction.) > > Did the hardware fail due to a design flaw? Did the hardware fail due to > some error on the installers part? Did the software have a bug? Does the > hardware have an advertised mean time between failures but wasn't replaced > because an organization didn't implement a scheduled obsolecense plan? > There is, or could be, (should even?) a human factor behind each one of > these questions. > > The environment that I came from consistently placed the burden of success > or failure on humans. We always discussed incidents. We examined lessons > learned. Almost invariably, the conclusion one ended up drawing was that > some person or persons made error(s) which led to an incident. > > Yes spurious system failures occured. After they did though, efforts were > made to ensure that it never happened again. This mentality made safe > work of a dangerous business. All of the lessons that produced this > mentality were written in blood. My scholastic training is in chemical engineering. One of the things we had to study was safety. Acctually, one of the first things you learn is that there is no such thing as absolute safety, only relative levels of risk. If you think you can make _anything_ failure free, you are reaching for a fool's dream. It is also makes life _more_ dangerous since in such an environment, people end up putting efforts into concealing risk rather than open efforts to minimize and quantify it. It is also often not fruitful to try to assign blame to problems. People are not prescient, and they cannot see every possible failure mode and plan for it. Fear of blame can again lead to hiding the risks rather than dealing with them in a more productive manner. (Unfortunately, the legal establishment also is often counter-productive, the ol' Risk Free Society falicy.) There is a difference between a mistake made in good faith and negligence. Going back to some of your examples, modern computer hardware fails and often no good reason will be found (or the effort to determine the cause is not worth the cost; replace, clean up, move on). The mean failure time is just what the 'mean' implies, an average, a statistical quantity. An particular item may fail in half the mean time, one quarter of the mean time, or it may run without problems for ten times the design life. You have to accept the risk with which you are comfortable and live with it. [snip] > My point is that is that a human is ultimately responsible. (At a minumum > we end up cleaning up the mess.) This must be the way it is. If we get > to the point where the machine is the excuse, then why bother? This is simply a matter of definition. For the above piece of hardware, if the manufacturere gave you an accurate mean design life, but one particular card fails in half the time, who gets this "ultimate responsibility?" In your model, it is whoever decided that using that item for its design life was acceptable and everyone else who then agreed with that determination. Well, now, we had a failure. We accepted a level of risk, and shucks, we lost the crap shoot. But we had accepted the risk, the hardware may fail before the mean lifetime. No need to make excuses. No need for blame. C'est la vie. > Now I come back to our original posters question. A system administator > is needed for all the reasons you described. A system administrator > should also be making those needed value judgements to prevent system > failure. I hope that we have both provided a variety of reasons that a > system administrator is important. I hope we have answered the question, > "What does a system administrator do anyway?" A system administrator has finite resources, finite time, finite money, finite knowledge, and even a finite set of tools (hardware or software) to choose from. You always need to accept a level of risk, and given the above constraints, you need to decide what level is acceptable given the resources you have and make sure everyone is comfortable. (And if you have anyone that says they want it risk-free... well, there are some out there who just don't get it. When they go outside to smoke a cigarette, talk some sense into them.) Then, when failure does come, have your plans ready, but always assume there are failure modes that you missed. > OBTW. There is a double standard in this respect regarding computers. We > do not accept the failure of the Corvair or the Audi TT or the Boeing 737. > When the Tacoma Narrows falls we don't just say, "Sometime bridges crash. > Get over it." But planes _do_ crash. Bridges _do_ collapse. And there will be more of both in the future. Trying to build a plane with no risk of crashing or a bridge that could never ever collapse is not the mindset of the engineers really working these things. Thanks to HMS Titanic, the folk wisdom of this culture has come to realize there is no such a beast as an unsinkable ship. It still needs to sink in that the same goes for airplanes, cars, nuclear powerplants, medical treatments, computers, and anything else you care to name. > People accept computer failures as a matter of course. That was not a good analogy. If that Boeing 737 failed because of a computer error, you bet it would not be considered acceptable. Remember that Arianne that failed a few years back do to bad flight software? The computers on the space shuttle fail regularly, but they have backups. When a desktop computer or a web server fails, there generally are not lives in the balance. There can be money at stake, but if you loose $2000 of productivity to fix a server during its lifetime, you are better off living with that risk than shelling out and extra $5000 to put fancier and redundant parts in or $20000 to build a dedicated backup. Yes, there are often zero or near-zero cost ways to reduce risk, but only to a certain level and those "zero cost" fixes frequently have hidden costs (your better off spending $10000 a year fixing problems that arise because you don't have a guru-level sysadmin if it's gonna cost you $30000 a year more to have said guru as opposed to a capable-but-non-guru sysadmin). > It doesn't have to > be that way. A human value judgement somewhere along the line leads to > failure. Yes, I am afraid it does have to be this way. It is undoubtably possible to greatly reduce the failures that we see in today's computer systems, but there will always be failures. The human value judgements should not be looked at as "leading to failure," but rather one should make value judgments about what levels of risk (rates of failure) are acceptable for the case at hand. -- Crist J. Clark cjclark@alum.mit.edu To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000926012746.Q59015>