Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 26 Sep 2000 01:27:46 -0700
From:      "Crist J . Clark" <cjclark@reflexnet.net>
To:        "Jason C. Wells" <jcwells@nwlink.com>
Cc:        cjclark@alum.mit.edu, Haikal Saadh <wyldephyre2@yahoo.com>, chat@FreeBSD.ORG
Subject:   Re: So what do (unix) sysadmins do anyway?
Message-ID:  <20000926012746.Q59015@149.211.6.64.reflexcom.com>
In-Reply-To: <Pine.SOL.3.96.1000925093125.2335A-100000@utah>; from jcwells@nwlink.com on Mon, Sep 25, 2000 at 10:11:15AM -0700
References:  <20000924224054.H59015@149.211.6.64.reflexcom.com> <Pine.SOL.3.96.1000925093125.2335A-100000@utah>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Sep 25, 2000 at 10:11:15AM -0700, Jason C. Wells wrote:
> On Sun, 24 Sep 2000, Crist J . Clark wrote:
> 
> > > Coming from the environment that I do, I state that there is no such thing
> > > as a system failure.
> > 
> > Huh. NOT! The admin's at our office spend a _lot_ of time fixing
> > systems that die from hardware failures. Sure, you have backups (data
> 
> Yes, things break.  I agree with everything you have said except the "NOT"
> part. Allow me to explain my position.  My position is:  We must hold
> _people_ culpable for systems failures and not fall into the trap that the
> system is to blame. (I use system here as an abstraction.)
> 
> Did the hardware fail due to a design flaw?  Did the hardware fail due to
> some error on the installers part?  Did the software have a bug?  Does the
> hardware have an advertised mean time between failures but wasn't replaced
> because an organization didn't implement a scheduled obsolecense plan? 
> There is, or could be, (should even?) a human factor behind each one of
> these questions.
> 
> The environment that I came from consistently placed the burden of success
> or failure on humans.  We always discussed incidents.  We examined lessons
> learned.  Almost invariably, the conclusion one ended up drawing was that
> some person or persons made error(s) which led to an incident.
> 
> Yes spurious system failures occured.  After they did though, efforts were
> made to ensure that it never happened again.  This mentality made safe
> work of a dangerous business.  All of the lessons that produced this
> mentality were written in blood. 

My scholastic training is in chemical engineering. One of the things
we had to study was safety. Acctually, one of the first things you
learn is that there is no such thing as absolute safety, only relative
levels of risk. If you think you can make _anything_ failure free, you
are reaching for a fool's dream. It is also makes life _more_
dangerous since in such an environment, people end up putting efforts
into concealing risk rather than open efforts to minimize and quantify
it.

It is also often not fruitful to try to assign blame to
problems. People are not prescient, and they cannot see every possible
failure mode and plan for it. Fear of blame can again lead to hiding
the risks rather than dealing with them in a more productive
manner. (Unfortunately, the legal establishment also is often
counter-productive, the ol' Risk Free Society falicy.) There is a
difference between a mistake made in good faith and negligence.

Going back to some of your examples, modern computer hardware fails
and often no good reason will be found (or the effort to determine the
cause is not worth the cost; replace, clean up, move on). The mean
failure time is just what the 'mean' implies, an average, a
statistical quantity. An particular item may fail in half the mean
time, one quarter of the mean time, or it may run without problems for
ten times the design life. You have to accept the risk with which you
are comfortable and live with it.

[snip]

> My point is that is that a human is ultimately responsible.  (At a minumum
> we end up cleaning up the mess.)  This must be the way it is.  If we get
> to the point where the machine is the excuse, then why bother? 

This is simply a matter of definition. For the above piece of
hardware, if the manufacturere gave you an accurate mean design life,
but one particular card fails in half the time, who gets this
"ultimate responsibility?" In your model, it is whoever decided that
using that item for its design life was acceptable and everyone else
who then agreed with that determination.

Well, now, we had a failure. We accepted a level of risk, and shucks,
we lost the crap shoot. But we had accepted the risk, the hardware may
fail before the mean lifetime. No need to make excuses. No need for
blame. C'est la vie.

> Now I come back to our original posters question.  A system administator
> is needed for all the reasons you described.  A system administrator
> should also be making those needed value judgements to prevent system
> failure.  I hope that we have both provided a variety of reasons that a
> system administrator is important.  I hope we have answered the question,
> "What does a system administrator do anyway?" 

A system administrator has finite resources, finite time, finite
money, finite knowledge, and even a finite set of tools (hardware or
software) to choose from. You always need to accept a level of risk,
and given the above constraints, you need to decide what level is
acceptable given the resources you have and make sure everyone is
comfortable. (And if you have anyone that says they want it
risk-free... well, there are some out there who just don't get
it. When they go outside to smoke a cigarette, talk some sense into
them.) Then, when failure does come, have your plans ready, but always
assume there are failure modes that you missed.

> OBTW. There is a double standard in this respect regarding computers.  We
> do not accept the failure of the Corvair or the Audi TT or the Boeing 737.
> When the Tacoma Narrows falls we don't just say, "Sometime bridges crash.
> Get over it."

But planes _do_ crash. Bridges _do_ collapse. And there will be more
of both in the future. Trying to build a plane with no risk of
crashing or a bridge that could never ever collapse is not the mindset
of the engineers really working these things. Thanks to HMS Titanic,
the folk wisdom of this culture has come to realize there is no such a
beast as an unsinkable ship. It still needs to sink in that the same
goes for airplanes, cars, nuclear powerplants, medical treatments,
computers, and anything else you care to name.

> People accept computer failures as a matter of course.

That was not a good analogy. If that Boeing 737 failed because of a
computer error, you bet it would not be considered
acceptable. Remember that Arianne that failed a few years back do to
bad flight software? The computers on the space shuttle fail
regularly, but they have backups. When a desktop computer or a web
server fails, there generally are not lives in the balance. There can
be money at stake, but if you loose $2000 of productivity to fix a
server during its lifetime, you are better off living with that risk
than shelling out and extra $5000 to put fancier and redundant parts
in or $20000 to build a dedicated backup. Yes, there are often zero or
near-zero cost ways to reduce risk, but only to a certain level and
those "zero cost" fixes frequently have hidden costs (your better off
spending $10000 a year fixing problems that arise because you don't
have a guru-level sysadmin if it's gonna cost you $30000 a year more
to have said guru as opposed to a capable-but-non-guru sysadmin).

> It doesn't have to
> be that way.  A human value judgement somewhere along the line leads to
> failure. 

Yes, I am afraid it does have to be this way. It is undoubtably
possible to greatly reduce the failures that we see in today's computer
systems, but there will always be failures. The human value judgements
should not be looked at as "leading to failure," but rather one should
make value judgments about what levels of risk (rates of failure) are
acceptable for the case at hand. 
-- 
Crist J. Clark                           cjclark@alum.mit.edu


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20000926012746.Q59015>