From owner-freebsd-scsi@freebsd.org  Fri Mar  4 08:02:35 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id ACF0C9DA0CA
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Fri,  4 Mar 2016 08:02:35 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from cu01176b.smtpx.saremail.com (cu01176b.smtpx.saremail.com
 [195.16.151.151])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 6E863F21
 for <freebsd-scsi@freebsd.org>; Fri,  4 Mar 2016 08:02:34 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from [172.16.8.96] (izaro.sarenet.es [192.148.167.11])
 by proxypop01.sare.net (Postfix) with ESMTPSA id B5CB19DDF16;
 Fri,  4 Mar 2016 09:02:25 +0100 (CET)
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\))
Subject: Re: mpr(4) SAS3008 Repeated Crashing
From: Borja Marcos <borjam@sarenet.es>
In-Reply-To: <F5E05621-FF84-4BED-B1A7-3252715CD53B@yahoo.com>
Date: Fri, 4 Mar 2016 09:02:25 +0100
Cc: Steven Hartland <killing@multiplay.co.uk>,
 FreeBSD-scsi <freebsd-scsi@freebsd.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <B2147AEC-2831-443C-8FA0-4148B37AAF95@sarenet.es>
References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk>
 <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es>
 <E74F5225-1EA8-4B60-ADDC-7B13E1003184@yahoo.com>
 <D7E0BCCE-EB44-4EF9-8F17-474C162F7D7C@sarenet.es>
 <56D805FD.50500@multiplay.co.uk>
 <F9B68610-12C6-4D32-88CA-A34A185F9AD1@sarenet.es>
 <F5E05621-FF84-4BED-B1A7-3252715CD53B@yahoo.com>
To: Scott Long <scott4long@yahoo.com>
X-Mailer: Apple Mail (2.3112)
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 04 Mar 2016 08:02:35 -0000


> On 03 Mar 2016, at 18:09, Scott Long <scott4long@yahoo.com> wrote:
>=20
>=20
> SYNC CACHE seems to have been involved this time, and while it=E2=80=99s=
 sometimes a source of trouble with SATA disks, I=E2=80=99m very =
hesitant to blame it.  Given the seemingly random nature of your =
problems, I=E2=80=99m not as certain anymore to rule out a fault of the =
disk enclosure.  This looks to be a different disk than your last =
report, and your statement that a sibling system exhibits no problems is =
very interesting.  Maybe there=E2=80=99s an issue with the power supply, =
and the disks are getting under-voltage conditions periodically.  If you =
can run smartctl against the disks, the output might be useful.  Also, =
if you=E2=80=99re able, could you make sure that both this system and =
the one that is working well are being fed with sufficient and similar =
AC power?  And if the power supply modules in your enclosures are =
swappable, maybe swap them between systems and see if the problem =
follows the module?  If that doesn=E2=80=99t fix it then I=E2=80=99ll =
think of ways to provide more instrumentation.

The affected disks are completely random. I didn=E2=80=99t copy a lot of =
instances to avoid too much litter, but each time it=E2=80=99s a =
different disk.

Both systems are in the same datacenter, and yes, the power =
infrastructure is working. Swapping modules can be done if
the dealer sends us another one because I prefer not to mess with a =
working system.

The fact that it=E2=80=99s a different disk each time, and that the =
other system works perfectly is what makes me quite certain that it=E2=80=99=
s a hardware problem. Either some trouble
with the backplane or a power problem.

I am tempted to go the oscilloscope route (monitoring the internal power =
rails). But if the problem is in the power distribution of the backplane =
itself
I=E2=80=99ll need to destroy a broken disk to build a backplane power =
probe :)


Borja.