From owner-freebsd-stable@FreeBSD.ORG Tue Sep 21 15:10:30 2004 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A7F9416A4CE for ; Tue, 21 Sep 2004 15:10:30 +0000 (GMT) Received: from pcwin002.win.tue.nl (pcwin002.win.tue.nl [131.155.71.72]) by mx1.FreeBSD.org (Postfix) with ESMTP id 175CD43D54 for ; Tue, 21 Sep 2004 15:10:30 +0000 (GMT) (envelope-from stijn@pcwin002.win.tue.nl) Received: from pcwin002.win.tue.nl (localhost [127.0.0.1]) by pcwin002.win.tue.nl (8.13.1/8.13.1) with ESMTP id i8LFASAW025849; Tue, 21 Sep 2004 17:10:28 +0200 (CEST) (envelope-from stijn@pcwin002.win.tue.nl) Received: (from stijn@localhost) by pcwin002.win.tue.nl (8.13.1/8.13.1/Submit) id i8LFAS6T025848; Tue, 21 Sep 2004 17:10:28 +0200 (CEST) (envelope-from stijn) Date: Tue, 21 Sep 2004 17:10:28 +0200 From: Stijn Hoop To: Paul Mather Message-ID: <20040921151028.GA839@pcwin002.win.tue.nl> References: <20040920130304.GK827@pcwin002.win.tue.nl> <1095694550.99333.20.camel@zappa.Chelsea-Ct.Org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="opJtzjQTFsWo+cga" Content-Disposition: inline In-Reply-To: <1095694550.99333.20.camel@zappa.Chelsea-Ct.Org> User-Agent: Mutt/1.4.2.1i X-Bright-Idea: Let's abolish HTML mail! cc: freebsd-stable@freebsd.org Subject: Re: [long] ATA timeout problems on -STABLE X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Sep 2004 15:10:30 -0000 --opJtzjQTFsWo+cga Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi, thanks for your response. On Mon, Sep 20, 2004 at 11:35:52AM -0400, Paul Mather wrote: > FWIW, when I would get those errors on my 4-STABLE system (fallback to > PIO mode; hard error reading fbsn) it did turn out to be a drive problem > (and with a Maxtor drive, too). I was none the wiser until I happened > to reboot the machine after a security advisory upgrade and was > surprised to see the boot halted because the S.M.A.R.T. status of the > drive indicated it was failing! (Prior to that I'd just been assuming > it was some kind of OS/peak load problem and had been using atacontrol > to change the mode back to UDMA100 when it fell back to PIO.) Interesting. I have also done the same in the few rare cases where the drive would indeed read/write the block in PIO mode. Most of the time the ata subsystem would just give up on the drive. > So, I would suggest running smartctl from the sysutils/smartmontools > port to see what the SMART status of the drives looks like; in > particular, whether any of the "worst" values have dropped anywhere > close to the failure threshold value. (I have noticed with smartctl > that some attributes go down and then back up. I have a system, in > particular, where the Raw_Read_Error_Rate attribute sometimes drops down > a few points under heavy disk load [e.g., during the nightly backup or > cvsup], but increases again after the load has lifted.) >=20 > Unfortunately, you're running 4.x, so you might have to make a 5.x > FreeSBIE CD with the smartmontools port included because it requires > ATAng from 5.x to run. That's a great suggestion that hadn't crossed my mind. As the box had another error just this morning I took some time when I had = to take it offline to rebuild the RAID array, and put the 4 120G disks (which definitely generate the most errors) in a 5.x system with the smartmontools port installed. Logs of smartctl -a are up at http://sandcat.nl/~stijn/freebsd/ataproblem/ I don't have a clue how to interpret all these numbers though. A little googling turns up posts that UNC errors are Bad(TM), however that would indicate that I have indeed 3(!) failing drives on my hands... Although certainly possible (they are about 1-2 years old in continuous use), it does sound improbable. > You can also use smartctl to run online and offline self-tests. I didn't have time to run the long tests, but all 4 drives indicated a 'passed' status for the online 'smartctl -t short' test. I take it the long tests give better results? If so I'll take the time to run them on the next rebuild downtime. But anyway if the drives are dying, I'll accept that. I just don't know for sure how to determine that. Do you have pointers for me to read more about SMART statistics? --Stijn --=20 An Orb is for life, not just for Christmas. --opJtzjQTFsWo+cga Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.6 (FreeBSD) iD8DBQFBUERkY3r/tLQmfWcRAgw5AJ9sutBGhKadkSDSkYfR/mBZHgJPvACdEbxr Yi864wrwLhgT72zZDvtrzjw= =GTct -----END PGP SIGNATURE----- --opJtzjQTFsWo+cga--