From owner-freebsd-stable@FreeBSD.ORG Fri Apr 1 20:12:23 2005 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 93F1C16A4CE for ; Fri, 1 Apr 2005 20:12:23 +0000 (GMT) Received: from gen129.n001.c02.escapebox.net (gen129.n001.c02.escapebox.net [213.73.91.129]) by mx1.FreeBSD.org (Postfix) with ESMTP id 03E3D43D41 for ; Fri, 1 Apr 2005 20:12:23 +0000 (GMT) (envelope-from gemini@geminix.org) Message-ID: <424DAB22.30405@geminix.org> Date: Fri, 01 Apr 2005 22:12:18 +0200 From: Uwe Doering Organization: Private UNIX Site User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.6) Gecko/20050326 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Don Bowman References: <2BCEB9A37A4D354AA276774EE13FB8C23A6939@mailserver.sandvine.com> In-Reply-To: <2BCEB9A37A4D354AA276774EE13FB8C23A6939@mailserver.sandvine.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Received: from gemini by geminix.org with asmtp (TLSv1:AES256-SHA:256) (Exim 3.36 #1) id 1DHSVA-000Dl0-00; Fri, 01 Apr 2005 22:12:21 +0200 cc: freebsd-stable@freebsd.org Subject: Re: Adaptec 3210S, 4.9-STABLE, corruption when disk fails X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Apr 2005 20:12:23 -0000 Don Bowman wrote: > From: Uwe Doering [mailto:gemini@geminix.org] > ... > >>As far as I understand this family of controllers the OS >>drivers aren't involved at all in case of a disk drive >>failure. It's strictly the controller's business to deal >>with it internally. The OS just sits there and waits until >>the controller is done with the retries and either drops into >>degraded mode or recovers from the disk error. >> >>That's why I initially speculated that there might be a >>timeout somewhere in PostgreSQL or FreeBSD that leads to data >>loss if the controller is busy for too long. >> >>A somewhat radical way to at least make these failures as >>rare an event as possible would be to deliberately fail all >>remaining old disk drives, one after the other of course, in >>order to get rid of them. And if you are lucky the problem >>won't happen with newer drives anyway, in case the root cause >>is an incompatibility between the controller and the old drives. > > Started that yesterday. I've got one 'old' one left. > Sadly, the one that failed night before last was not one of the > 'old' ones, so this is no guarantee :) > > From the raidutil -e log, I see this type of info. I'm not sure > what the 'unknown' events are. The 'CRC Failure' is probably the > problem? There's also Bad SCSI Status, unit attention, etc. > Perhaps the driver doesn't deal with these properly? In my opinion what the log shows in this case is internal communication between the controller and the disk drives. The OS driver is not involved. In the past I've seen CRC errors like these as a result of bad cabling or contact problems. You may want to check the SCSI cables. They have to be properly terminated and there must not be any sharp kinks given the signal frequencies involved these days. Also, pluggable drive bays can cause this. Every electrical contact is a potential source of trouble. Finally, faulty or overloaded power supplies can cause glitches like these. This can be especially hard to debug. When these hardware issues have been taken care of you may want to start a RAID verification/correction run. If it shows any inconsistencies this may be an indication of former hardware glitches. I'm not sure whether you can trigger that process through 'raidutil'. I've always used the X11 'dptmgr' program. You can terminate it after having started the verification. It continues to run in the background (inside the controller). Uwe -- Uwe Doering | EscapeBox - Managed On-Demand UNIX Servers gemini@geminix.org | http://www.escapebox.net