From owner-freebsd-scsi@freebsd.org Fri Mar 4 08:02:35 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id ACF0C9DA0CA for ; Fri, 4 Mar 2016 08:02:35 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from cu01176b.smtpx.saremail.com (cu01176b.smtpx.saremail.com [195.16.151.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6E863F21 for ; Fri, 4 Mar 2016 08:02:34 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from [172.16.8.96] (izaro.sarenet.es [192.148.167.11]) by proxypop01.sare.net (Postfix) with ESMTPSA id B5CB19DDF16; Fri, 4 Mar 2016 09:02:25 +0100 (CET) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\)) Subject: Re: mpr(4) SAS3008 Repeated Crashing From: Borja Marcos In-Reply-To: Date: Fri, 4 Mar 2016 09:02:25 +0100 Cc: Steven Hartland , FreeBSD-scsi Content-Transfer-Encoding: quoted-printable Message-Id: References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> <56D805FD.50500@multiplay.co.uk> To: Scott Long X-Mailer: Apple Mail (2.3112) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 Mar 2016 08:02:35 -0000 > On 03 Mar 2016, at 18:09, Scott Long wrote: >=20 >=20 > SYNC CACHE seems to have been involved this time, and while it=E2=80=99s= sometimes a source of trouble with SATA disks, I=E2=80=99m very = hesitant to blame it. Given the seemingly random nature of your = problems, I=E2=80=99m not as certain anymore to rule out a fault of the = disk enclosure. This looks to be a different disk than your last = report, and your statement that a sibling system exhibits no problems is = very interesting. Maybe there=E2=80=99s an issue with the power supply, = and the disks are getting under-voltage conditions periodically. If you = can run smartctl against the disks, the output might be useful. Also, = if you=E2=80=99re able, could you make sure that both this system and = the one that is working well are being fed with sufficient and similar = AC power? And if the power supply modules in your enclosures are = swappable, maybe swap them between systems and see if the problem = follows the module? If that doesn=E2=80=99t fix it then I=E2=80=99ll = think of ways to provide more instrumentation. The affected disks are completely random. I didn=E2=80=99t copy a lot of = instances to avoid too much litter, but each time it=E2=80=99s a = different disk. Both systems are in the same datacenter, and yes, the power = infrastructure is working. Swapping modules can be done if the dealer sends us another one because I prefer not to mess with a = working system. The fact that it=E2=80=99s a different disk each time, and that the = other system works perfectly is what makes me quite certain that it=E2=80=99= s a hardware problem. Either some trouble with the backplane or a power problem. I am tempted to go the oscilloscope route (monitoring the internal power = rails). But if the problem is in the power distribution of the backplane = itself I=E2=80=99ll need to destroy a broken disk to build a backplane power = probe :) Borja.