From owner-freebsd-stable@FreeBSD.ORG  Tue Sep 21 15:10:30 2004
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A7F9416A4CE
	for <freebsd-stable@freebsd.org>;
	Tue, 21 Sep 2004 15:10:30 +0000 (GMT)
Received: from pcwin002.win.tue.nl (pcwin002.win.tue.nl [131.155.71.72])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 175CD43D54
	for <freebsd-stable@freebsd.org>;
	Tue, 21 Sep 2004 15:10:30 +0000 (GMT)
	(envelope-from stijn@pcwin002.win.tue.nl)
Received: from pcwin002.win.tue.nl (localhost [127.0.0.1])
	by pcwin002.win.tue.nl (8.13.1/8.13.1) with ESMTP id i8LFASAW025849;
	Tue, 21 Sep 2004 17:10:28 +0200 (CEST)
	(envelope-from stijn@pcwin002.win.tue.nl)
Received: (from stijn@localhost)
	by pcwin002.win.tue.nl (8.13.1/8.13.1/Submit) id i8LFAS6T025848;
	Tue, 21 Sep 2004 17:10:28 +0200 (CEST)
	(envelope-from stijn)
Date: Tue, 21 Sep 2004 17:10:28 +0200
From: Stijn Hoop <stijn@win.tue.nl>
To: Paul Mather <paul@gromit.dlib.vt.edu>
Message-ID: <20040921151028.GA839@pcwin002.win.tue.nl>
References: <20040920130304.GK827@pcwin002.win.tue.nl>
	<1095694550.99333.20.camel@zappa.Chelsea-Ct.Org>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="opJtzjQTFsWo+cga"
Content-Disposition: inline
In-Reply-To: <1095694550.99333.20.camel@zappa.Chelsea-Ct.Org>
User-Agent: Mutt/1.4.2.1i
X-Bright-Idea: Let's abolish HTML mail!
cc: freebsd-stable@freebsd.org
Subject: Re: [long] ATA timeout problems on -STABLE
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Production branch of FreeBSD source code
	<freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 21 Sep 2004 15:10:30 -0000


--opJtzjQTFsWo+cga
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hi,

thanks for your response.

On Mon, Sep 20, 2004 at 11:35:52AM -0400, Paul Mather wrote:
> FWIW, when I would get those errors on my 4-STABLE system (fallback to
> PIO mode; hard error reading fbsn) it did turn out to be a drive problem
> (and with a Maxtor drive, too).  I was none the wiser until I happened
> to reboot the machine after a security advisory upgrade and was
> surprised to see the boot halted because the S.M.A.R.T. status of the
> drive indicated it was failing!  (Prior to that I'd just been assuming
> it was some kind of OS/peak load problem and had been using atacontrol
> to change the mode back to UDMA100 when it fell back to PIO.)

Interesting. I have also done the same in the few rare cases where the drive
would indeed read/write the block in PIO mode. Most of the time the ata
subsystem would just give up on the drive.

> So, I would suggest running smartctl from the sysutils/smartmontools
> port to see what the SMART status of the drives looks like; in
> particular, whether any of the "worst" values have dropped anywhere
> close to the failure threshold value.  (I have noticed with smartctl
> that some attributes go down and then back up.  I have a system, in
> particular, where the Raw_Read_Error_Rate attribute sometimes drops down
> a few points under heavy disk load [e.g., during the nightly backup or
> cvsup], but increases again after the load has lifted.)
>=20
> Unfortunately, you're running 4.x, so you might have to make a 5.x
> FreeSBIE CD with the smartmontools port included because it requires
> ATAng from 5.x to run.

That's a great suggestion that hadn't crossed my mind.

As the box had another error just this morning I took some time when I had =
to
take it offline to rebuild the RAID array, and put the 4 120G disks (which
definitely generate the most errors) in a 5.x system with the smartmontools
port installed.

Logs of smartctl -a are up at

http://sandcat.nl/~stijn/freebsd/ataproblem/

I don't have a clue how to interpret all these numbers though. A little
googling turns up posts that UNC errors are Bad(TM), however that would
indicate that I have indeed 3(!) failing drives on my hands... Although
certainly possible (they are about 1-2 years old in continuous use), it does
sound improbable.

> You can also use smartctl to run online and offline self-tests.

I didn't have time to run the long tests, but all 4 drives indicated a
'passed' status for the online 'smartctl -t short' test. I take it the
long tests give better results? If so I'll take the time to run them
on the next rebuild downtime.

But anyway if the drives are dying, I'll accept that. I just don't know for
sure how to determine that. Do you have pointers for me to read more about
SMART statistics?

--Stijn

--=20
An Orb is for life, not just for Christmas.

--opJtzjQTFsWo+cga
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (FreeBSD)

iD8DBQFBUERkY3r/tLQmfWcRAgw5AJ9sutBGhKadkSDSkYfR/mBZHgJPvACdEbxr
Yi864wrwLhgT72zZDvtrzjw=
=GTct
-----END PGP SIGNATURE-----

--opJtzjQTFsWo+cga--