Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 27 Sep 2000 15:31:53 -0400
From:      1mazda1 <1mazda1@phoenixdsl.com>
To:        cjclark@alum.mit.edu, cjclark@reflexnet.net
Cc:        "Jason C. Wells" <jcwells@nwlink.com>, Haikal Saadh <wyldephyre2@yahoo.com>, chat@FreeBSD.ORG
Subject:   Re: So what do (unix) sysadmins do anyway?
Message-ID:  <39D24B29.EC4EF07C@phoenixdsl.com>
References:  <20000924224054.H59015@149.211.6.64.reflexcom.com> <Pine.SOL.3.96.1000925093125.2335A-100000@utah> <20000926012746.Q59015@149.211.6.64.reflexcom.com>

next in thread | previous in thread | raw e-mail | index | archive | help
cjclark@reflexnet.net
You are tapped! what did you wake up on the wrong side of the bed or what???

wow

1mazda1

"Crist J . Clark" wrote:

> On Mon, Sep 25, 2000 at 10:11:15AM -0700, Jason C. Wells wrote:
> > On Sun, 24 Sep 2000, Crist J . Clark wrote:
> >
> > > > Coming from the environment that I do, I state that there is no such thing
> > > > as a system failure.
> > >
> > > Huh. NOT! The admin's at our office spend a _lot_ of time fixing
> > > systems that die from hardware failures. Sure, you have backups (data
> >
> > Yes, things break.  I agree with everything you have said except the "NOT"
> > part. Allow me to explain my position.  My position is:  We must hold
> > _people_ culpable for systems failures and not fall into the trap that the
> > system is to blame. (I use system here as an abstraction.)
> >
> > Did the hardware fail due to a design flaw?  Did the hardware fail due to
> > some error on the installers part?  Did the software have a bug?  Does the
> > hardware have an advertised mean time between failures but wasn't replaced
> > because an organization didn't implement a scheduled obsolecense plan?
> > There is, or could be, (should even?) a human factor behind each one of
> > these questions.
> >
> > The environment that I came from consistently placed the burden of success
> > or failure on humans.  We always discussed incidents.  We examined lessons
> > learned.  Almost invariably, the conclusion one ended up drawing was that
> > some person or persons made error(s) which led to an incident.
> >
> > Yes spurious system failures occured.  After they did though, efforts were
> > made to ensure that it never happened again.  This mentality made safe
> > work of a dangerous business.  All of the lessons that produced this
> > mentality were written in blood.
>
> My scholastic training is in chemical engineering. One of the things
> we had to study was safety. Acctually, one of the first things you
> learn is that there is no such thing as absolute safety, only relative
> levels of risk. If you think you can make _anything_ failure free, you
> are reaching for a fool's dream. It is also makes life _more_
> dangerous since in such an environment, people end up putting efforts
> into concealing risk rather than open efforts to minimize and quantify
> it.
>
> It is also often not fruitful to try to assign blame to
> problems. People are not prescient, and they cannot see every possible
> failure mode and plan for it. Fear of blame can again lead to hiding
> the risks rather than dealing with them in a more productive
> manner. (Unfortunately, the legal establishment also is often
> counter-productive, the ol' Risk Free Society falicy.) There is a
> difference between a mistake made in good faith and negligence.
>
> Going back to some of your examples, modern computer hardware fails
> and often no good reason will be found (or the effort to determine the
> cause is not worth the cost; replace, clean up, move on). The mean
> failure time is just what the 'mean' implies, an average, a
> statistical quantity. An particular item may fail in half the mean
> time, one quarter of the mean time, or it may run without problems for
> ten times the design life. You have to accept the risk with which you
> are comfortable and live with it.
>
> [snip]
>
> > My point is that is that a human is ultimately responsible.  (At a minumum
> > we end up cleaning up the mess.)  This must be the way it is.  If we get
> > to the point where the machine is the excuse, then why bother?
>
> This is simply a matter of definition. For the above piece of
> hardware, if the manufacturere gave you an accurate mean design life,
> but one particular card fails in half the time, who gets this
> "ultimate responsibility?" In your model, it is whoever decided that
> using that item for its design life was acceptable and everyone else
> who then agreed with that determination.
>
> Well, now, we had a failure. We accepted a level of risk, and shucks,
> we lost the crap shoot. But we had accepted the risk, the hardware may
> fail before the mean lifetime. No need to make excuses. No need for
> blame. C'est la vie.
>
> > Now I come back to our original posters question.  A system administator
> > is needed for all the reasons you described.  A system administrator
> > should also be making those needed value judgements to prevent system
> > failure.  I hope that we have both provided a variety of reasons that a
> > system administrator is important.  I hope we have answered the question,
> > "What does a system administrator do anyway?"
>
> A system administrator has finite resources, finite time, finite
> money, finite knowledge, and even a finite set of tools (hardware or
> software) to choose from. You always need to accept a level of risk,
> and given the above constraints, you need to decide what level is
> acceptable given the resources you have and make sure everyone is
> comfortable. (And if you have anyone that says they want it
> risk-free... well, there are some out there who just don't get
> it. When they go outside to smoke a cigarette, talk some sense into
> them.) Then, when failure does come, have your plans ready, but always
> assume there are failure modes that you missed.
>
> > OBTW. There is a double standard in this respect regarding computers.  We
> > do not accept the failure of the Corvair or the Audi TT or the Boeing 737.
> > When the Tacoma Narrows falls we don't just say, "Sometime bridges crash.
> > Get over it."
>
> But planes _do_ crash. Bridges _do_ collapse. And there will be more
> of both in the future. Trying to build a plane with no risk of
> crashing or a bridge that could never ever collapse is not the mindset
> of the engineers really working these things. Thanks to HMS Titanic,
> the folk wisdom of this culture has come to realize there is no such a
> beast as an unsinkable ship. It still needs to sink in that the same
> goes for airplanes, cars, nuclear powerplants, medical treatments,
> computers, and anything else you care to name.
>
> > People accept computer failures as a matter of course.
>
> That was not a good analogy. If that Boeing 737 failed because of a
> computer error, you bet it would not be considered
> acceptable. Remember that Arianne that failed a few years back do to
> bad flight software? The computers on the space shuttle fail
> regularly, but they have backups. When a desktop computer or a web
> server fails, there generally are not lives in the balance. There can
> be money at stake, but if you loose $2000 of productivity to fix a
> server during its lifetime, you are better off living with that risk
> than shelling out and extra $5000 to put fancier and redundant parts
> in or $20000 to build a dedicated backup. Yes, there are often zero or
> near-zero cost ways to reduce risk, but only to a certain level and
> those "zero cost" fixes frequently have hidden costs (your better off
> spending $10000 a year fixing problems that arise because you don't
> have a guru-level sysadmin if it's gonna cost you $30000 a year more
> to have said guru as opposed to a capable-but-non-guru sysadmin).
>
> > It doesn't have to
> > be that way.  A human value judgement somewhere along the line leads to
> > failure.
>
> Yes, I am afraid it does have to be this way. It is undoubtably
> possible to greatly reduce the failures that we see in today's computer
> systems, but there will always be failures. The human value judgements
> should not be looked at as "leading to failure," but rather one should
> make value judgments about what levels of risk (rates of failure) are
> acceptable for the case at hand.
> --
> Crist J. Clark                           cjclark@alum.mit.edu
>
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-chat" in the body of the message



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?39D24B29.EC4EF07C>