From owner-freebsd-questions@FreeBSD.ORG  Fri Jan 21 02:19:10 2005
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id C927D16A4CE
	for <freebsd-questions@freebsd.org>;
	Fri, 21 Jan 2005 02:19:10 +0000 (GMT)
Received: from dexter.starfire.mn.org (starfire.skypoint.net [66.93.17.236])
	by mx1.FreeBSD.org (Postfix) with ESMTP id EFFDE43D1F
	for <freebsd-questions@freebsd.org>;
	Fri, 21 Jan 2005 02:19:09 +0000 (GMT)
	(envelope-from john@dexter.starfire.mn.org)
Received: (from john@localhost)
	by dexter.starfire.mn.org (8.11.3/8.11.3) id j0L2IuC00643;
	Thu, 20 Jan 2005 20:18:56 -0600 (CST)
	(envelope-from john)
Date: Thu, 20 Jan 2005 20:18:56 -0600
From: John <john@starfire.mn.org>
To: David Bear <David.Bear@asu.edu>
Message-ID: <20050120201856.A572@starfire.mn.org>
References: <20050121002113.GH6843@asu.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5.1i
In-Reply-To: <20050121002113.GH6843@asu.edu>;
	from David.Bear@asu.edu on Thu, Jan 20, 2005 at 05:21:13PM -0700
cc: freebsd-questions@freebsd.org
Subject: Re: hard drive errors
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>,
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>,
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 21 Jan 2005 02:19:11 -0000

On Thu, Jan 20, 2005 at 05:21:13PM -0700, David Bear wrote:
> I am receiving the following errors on my hard drive. This appears to
> affect some file in /var/log. My question is twofold. 1) shouldn't ufs
> notice this sector as being unuseable and mark it offlimites? 2) if
> not, is there a way to mark it so manually?
> 
> 
> ad0s1g: hard error reading fsbn 19674311 of 6765124-6765135 (ad0s1 bn
> 19674311; cn 1618 tn 16 sn 41) status=59 error=40
> ad0s1g: hard error reading fsbn 6765124 (ad0s1 bn 6765124; cn 556 tn
> 74 sn 58) status=59 error=40
> ad0s1h: hard error reading fsbn 88412159 of 35809248-35809251 (ad0s1
> bn 88412159; cn 7271 tn 64 sn 38) status=59 error=40
> ad0s1h: hard error reading fsbn 35809251 (ad0s1 bn 35809251; cn 2945
> tn 15 sn 51) status=59 error=40
> ad0s1g: hard error reading fsbn 19674303 of 6765120-6765133 (ad0s1 bn
> 19674303; cn 1618 tn 16 sn 33) status=59 error=40
> ad0s1g: hard error reading fsbn 6765124 (ad0s1 bn 6765124; cn 556 tn
> 74 sn 58) status=59 error=40

Modern disk drives do a lot to manage errors, but things can still
happen that they cannot protect against - this is part of the reason
various RAID schemes are used.

If the drive gets a lot of recoverable (soft) errors, that means that
it can reconstruct the data, even though it was damaged.  Having
reconstructed the data, it can remap the sector.

A hard error means that, by the time the problem was noticed, data
were already unrecoverable.  It can't simply remap the sector
somewhere else, because the data are already gone!  If it were to
map it somewhere else - what would it put there?  It doesn't know,
and neither do I.

You really, really need to back up your data somewhere.  You may
already have lost data which are valuable to you, but that's no
reason to loose more.  After that, go into the BIOS and do a surface
scan of the drive.  That will cause it to remap all the sectors
that are unrecoverable.  Then, remake the affected filesystem, and
restore your data.  If the drive is basically a good drive, you
should be fine again.  If the drive is failing, more hard (and
soft) errors will pop up, and your data are at greater risk.

Fortunately, you say the errors seem to be in /var/log.  Maybe
remaking the /var subsystem and loosing some log files won't
really cause you any pain.  I hope that that is the case.

There used to be filesystem-level code to manage bad sectors.  This
was bad, because when you went to do unit copies (rarely done
anymore), you'd still hit the bad spots.  The ability to manage
disk defects was then pushed down into the driver (bad144 disk
defect management), and then down into the drives themselves.

NONE of those methods can protect you from the sudden and seemingly
spontaneous loss of data!  If you move your system, or it is subject
to shock and vibration, and the heads go bouncing across the surface
- data may be lost.  Sometimes I swear cosmic rays just blast out
some bits (well, it SEEMS like it), and, ultimately, thermodynamics
cannot be beaten - any image, magnetic or otherwise, fades with
time.  The signal-to-noise ratio of the heads and eletronics also
changes over the life of the product, and tiny flecks can come off,
be deposited on, or moved around the disk surface.  All of this
can cause data problems.

Though almost no-one does it, back up your data.  Back up your
data.  Back up your data.  Like the old joke about real estate that
the three most imporant features are location, location, and
location, the three most imporant steps in preserving and protecting
data (short of hardware RAID protection and remote and local
subsystem based replication) are backup, backup, and backup.

I actually have an arrangement with a friend of mine that the most
imporant data on my system are rolled up into a tarball and an
expect script FTP's it to one of his servers every night.  A little
kludgy, but it works as poor-man's remote data replication.
-- 

John Lind
john@starfire.MN.ORG