Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 4 Mar 2016 09:16:22 +0000
From:      Steven Hartland <killing@multiplay.co.uk>
To:        Borja Marcos <borjam@sarenet.es>, Scott Long <scott4long@yahoo.com>
Cc:        FreeBSD-scsi <freebsd-scsi@freebsd.org>
Subject:   Re: mpr(4) SAS3008 Repeated Crashing
Message-ID:  <56D95266.301@multiplay.co.uk>
In-Reply-To: <B2147AEC-2831-443C-8FA0-4148B37AAF95@sarenet.es>
References:  <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es> <E74F5225-1EA8-4B60-ADDC-7B13E1003184@yahoo.com> <D7E0BCCE-EB44-4EF9-8F17-474C162F7D7C@sarenet.es> <56D805FD.50500@multiplay.co.uk> <F9B68610-12C6-4D32-88CA-A34A185F9AD1@sarenet.es> <F5E05621-FF84-4BED-B1A7-3252715CD53B@yahoo.com> <B2147AEC-2831-443C-8FA0-4148B37AAF95@sarenet.es>

next in thread | previous in thread | raw e-mail | index | archive | help
On 04/03/2016 08:02, Borja Marcos wrote:
>> On 03 Mar 2016, at 18:09, Scott Long <scott4long@yahoo.com> wrote:
>>
>>
>> SYNC CACHE seems to have been involved this time, and while it=E2=80=99=
s sometimes a source of trouble with SATA disks, I=E2=80=99m very hesitan=
t to blame it.  Given the seemingly random nature of your problems, I=E2=80=
=99m not as certain anymore to rule out a fault of the disk enclosure.  T=
his looks to be a different disk than your last report, and your statemen=
t that a sibling system exhibits no problems is very interesting.  Maybe =
there=E2=80=99s an issue with the power supply, and the disks are getting=
 under-voltage conditions periodically.  If you can run smartctl against =
the disks, the output might be useful.  Also, if you=E2=80=99re able, cou=
ld you make sure that both this system and the one that is working well a=
re being fed with sufficient and similar AC power?  And if the power supp=
ly modules in your enclosures are swappable, maybe swap them between syst=
ems and see if the problem follows the module?  If that doesn=E2=80=99t f=
ix it then I=E2=80=99ll think of ways to provide more instrumentation.
> The affected disks are completely random. I didn=E2=80=99t copy a lot o=
f instances to avoid too much litter, but each time it=E2=80=99s a differ=
ent disk.
>
> Both systems are in the same datacenter, and yes, the power infrastruct=
ure is working. Swapping modules can be done if
> the dealer sends us another one because I prefer not to mess with a wor=
king system.
>
> The fact that it=E2=80=99s a different disk each time, and that the oth=
er system works perfectly is what makes me quite certain that it=E2=80=99=
s a hardware problem. Either some trouble
> with the backplane or a power problem.
>
> I am tempted to go the oscilloscope route (monitoring the internal powe=
r rails). But if the problem is in the power distribution of the backplan=
e itself
> I=E2=80=99ll need to destroy a broken disk to build a backplane power p=
robe :)
>
Its very rare but we've also seen this type of behaviour from a failing=20
Intel CPU. There was no other indication the CPU had an issue, which one =

might expect, so just wanted to make you aware of the possibility.

That said the most common cause of this we've seen, when its not a=20
common disk or disks, is a bad backplane or cabling to the backplane.

     Regards
     Steve




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?56D95266.301>