From owner-freebsd-chat Wed Sep 27 0:30:21 2000 Delivered-To: freebsd-chat@freebsd.org Received: from mail01.phoenixdsl.com (mail01.phoenixdsl.com [216.178.151.7]) by hub.freebsd.org (Postfix) with ESMTP id 6427A37B42C for ; Wed, 27 Sep 2000 00:29:54 -0700 (PDT) Received: from phoenixdsl.com ([64.32.158.84]) by mail01.phoenixdsl.com (InterMail vK.4.02.00.05.01 201-232-116-105-101 license da4da6e5fc829a7858725236bede8deb) with ESMTP id <20000927072953.KUCK18281.mail01@phoenixdsl.com>; Wed, 27 Sep 2000 02:29:53 -0500 Message-ID: <39D24B29.EC4EF07C@phoenixdsl.com> Date: Wed, 27 Sep 2000 15:31:53 -0400 From: 1mazda1 <1mazda1@phoenixdsl.com> X-Mailer: Mozilla 4.75 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: cjclark@alum.mit.edu, cjclark@reflexnet.net Cc: "Jason C. Wells" , Haikal Saadh , chat@FreeBSD.ORG Subject: Re: So what do (unix) sysadmins do anyway? References: <20000924224054.H59015@149.211.6.64.reflexcom.com> <20000926012746.Q59015@149.211.6.64.reflexcom.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org cjclark@reflexnet.net You are tapped! what did you wake up on the wrong side of the bed or what??? wow 1mazda1 "Crist J . Clark" wrote: > On Mon, Sep 25, 2000 at 10:11:15AM -0700, Jason C. Wells wrote: > > On Sun, 24 Sep 2000, Crist J . Clark wrote: > > > > > > Coming from the environment that I do, I state that there is no such thing > > > > as a system failure. > > > > > > Huh. NOT! The admin's at our office spend a _lot_ of time fixing > > > systems that die from hardware failures. Sure, you have backups (data > > > > Yes, things break. I agree with everything you have said except the "NOT" > > part. Allow me to explain my position. My position is: We must hold > > _people_ culpable for systems failures and not fall into the trap that the > > system is to blame. (I use system here as an abstraction.) > > > > Did the hardware fail due to a design flaw? Did the hardware fail due to > > some error on the installers part? Did the software have a bug? Does the > > hardware have an advertised mean time between failures but wasn't replaced > > because an organization didn't implement a scheduled obsolecense plan? > > There is, or could be, (should even?) a human factor behind each one of > > these questions. > > > > The environment that I came from consistently placed the burden of success > > or failure on humans. We always discussed incidents. We examined lessons > > learned. Almost invariably, the conclusion one ended up drawing was that > > some person or persons made error(s) which led to an incident. > > > > Yes spurious system failures occured. After they did though, efforts were > > made to ensure that it never happened again. This mentality made safe > > work of a dangerous business. All of the lessons that produced this > > mentality were written in blood. > > My scholastic training is in chemical engineering. One of the things > we had to study was safety. Acctually, one of the first things you > learn is that there is no such thing as absolute safety, only relative > levels of risk. If you think you can make _anything_ failure free, you > are reaching for a fool's dream. It is also makes life _more_ > dangerous since in such an environment, people end up putting efforts > into concealing risk rather than open efforts to minimize and quantify > it. > > It is also often not fruitful to try to assign blame to > problems. People are not prescient, and they cannot see every possible > failure mode and plan for it. Fear of blame can again lead to hiding > the risks rather than dealing with them in a more productive > manner. (Unfortunately, the legal establishment also is often > counter-productive, the ol' Risk Free Society falicy.) There is a > difference between a mistake made in good faith and negligence. > > Going back to some of your examples, modern computer hardware fails > and often no good reason will be found (or the effort to determine the > cause is not worth the cost; replace, clean up, move on). The mean > failure time is just what the 'mean' implies, an average, a > statistical quantity. An particular item may fail in half the mean > time, one quarter of the mean time, or it may run without problems for > ten times the design life. You have to accept the risk with which you > are comfortable and live with it. > > [snip] > > > My point is that is that a human is ultimately responsible. (At a minumum > > we end up cleaning up the mess.) This must be the way it is. If we get > > to the point where the machine is the excuse, then why bother? > > This is simply a matter of definition. For the above piece of > hardware, if the manufacturere gave you an accurate mean design life, > but one particular card fails in half the time, who gets this > "ultimate responsibility?" In your model, it is whoever decided that > using that item for its design life was acceptable and everyone else > who then agreed with that determination. > > Well, now, we had a failure. We accepted a level of risk, and shucks, > we lost the crap shoot. But we had accepted the risk, the hardware may > fail before the mean lifetime. No need to make excuses. No need for > blame. C'est la vie. > > > Now I come back to our original posters question. A system administator > > is needed for all the reasons you described. A system administrator > > should also be making those needed value judgements to prevent system > > failure. I hope that we have both provided a variety of reasons that a > > system administrator is important. I hope we have answered the question, > > "What does a system administrator do anyway?" > > A system administrator has finite resources, finite time, finite > money, finite knowledge, and even a finite set of tools (hardware or > software) to choose from. You always need to accept a level of risk, > and given the above constraints, you need to decide what level is > acceptable given the resources you have and make sure everyone is > comfortable. (And if you have anyone that says they want it > risk-free... well, there are some out there who just don't get > it. When they go outside to smoke a cigarette, talk some sense into > them.) Then, when failure does come, have your plans ready, but always > assume there are failure modes that you missed. > > > OBTW. There is a double standard in this respect regarding computers. We > > do not accept the failure of the Corvair or the Audi TT or the Boeing 737. > > When the Tacoma Narrows falls we don't just say, "Sometime bridges crash. > > Get over it." > > But planes _do_ crash. Bridges _do_ collapse. And there will be more > of both in the future. Trying to build a plane with no risk of > crashing or a bridge that could never ever collapse is not the mindset > of the engineers really working these things. Thanks to HMS Titanic, > the folk wisdom of this culture has come to realize there is no such a > beast as an unsinkable ship. It still needs to sink in that the same > goes for airplanes, cars, nuclear powerplants, medical treatments, > computers, and anything else you care to name. > > > People accept computer failures as a matter of course. > > That was not a good analogy. If that Boeing 737 failed because of a > computer error, you bet it would not be considered > acceptable. Remember that Arianne that failed a few years back do to > bad flight software? The computers on the space shuttle fail > regularly, but they have backups. When a desktop computer or a web > server fails, there generally are not lives in the balance. There can > be money at stake, but if you loose $2000 of productivity to fix a > server during its lifetime, you are better off living with that risk > than shelling out and extra $5000 to put fancier and redundant parts > in or $20000 to build a dedicated backup. Yes, there are often zero or > near-zero cost ways to reduce risk, but only to a certain level and > those "zero cost" fixes frequently have hidden costs (your better off > spending $10000 a year fixing problems that arise because you don't > have a guru-level sysadmin if it's gonna cost you $30000 a year more > to have said guru as opposed to a capable-but-non-guru sysadmin). > > > It doesn't have to > > be that way. A human value judgement somewhere along the line leads to > > failure. > > Yes, I am afraid it does have to be this way. It is undoubtably > possible to greatly reduce the failures that we see in today's computer > systems, but there will always be failures. The human value judgements > should not be looked at as "leading to failure," but rather one should > make value judgments about what levels of risk (rates of failure) are > acceptable for the case at hand. > -- > Crist J. Clark cjclark@alum.mit.edu > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-chat" in the body of the message To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message