Date: Fri, 4 Mar 2016 09:16:22 +0000 From: Steven Hartland <killing@multiplay.co.uk> To: Borja Marcos <borjam@sarenet.es>, Scott Long <scott4long@yahoo.com> Cc: FreeBSD-scsi <freebsd-scsi@freebsd.org> Subject: Re: mpr(4) SAS3008 Repeated Crashing Message-ID: <56D95266.301@multiplay.co.uk> In-Reply-To: <B2147AEC-2831-443C-8FA0-4148B37AAF95@sarenet.es> References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es> <E74F5225-1EA8-4B60-ADDC-7B13E1003184@yahoo.com> <D7E0BCCE-EB44-4EF9-8F17-474C162F7D7C@sarenet.es> <56D805FD.50500@multiplay.co.uk> <F9B68610-12C6-4D32-88CA-A34A185F9AD1@sarenet.es> <F5E05621-FF84-4BED-B1A7-3252715CD53B@yahoo.com> <B2147AEC-2831-443C-8FA0-4148B37AAF95@sarenet.es>
next in thread | previous in thread | raw e-mail | index | archive | help
On 04/03/2016 08:02, Borja Marcos wrote: >> On 03 Mar 2016, at 18:09, Scott Long <scott4long@yahoo.com> wrote: >> >> >> SYNC CACHE seems to have been involved this time, and while it=E2=80=99= s sometimes a source of trouble with SATA disks, I=E2=80=99m very hesitan= t to blame it. Given the seemingly random nature of your problems, I=E2=80= =99m not as certain anymore to rule out a fault of the disk enclosure. T= his looks to be a different disk than your last report, and your statemen= t that a sibling system exhibits no problems is very interesting. Maybe = there=E2=80=99s an issue with the power supply, and the disks are getting= under-voltage conditions periodically. If you can run smartctl against = the disks, the output might be useful. Also, if you=E2=80=99re able, cou= ld you make sure that both this system and the one that is working well a= re being fed with sufficient and similar AC power? And if the power supp= ly modules in your enclosures are swappable, maybe swap them between syst= ems and see if the problem follows the module? If that doesn=E2=80=99t f= ix it then I=E2=80=99ll think of ways to provide more instrumentation. > The affected disks are completely random. I didn=E2=80=99t copy a lot o= f instances to avoid too much litter, but each time it=E2=80=99s a differ= ent disk. > > Both systems are in the same datacenter, and yes, the power infrastruct= ure is working. Swapping modules can be done if > the dealer sends us another one because I prefer not to mess with a wor= king system. > > The fact that it=E2=80=99s a different disk each time, and that the oth= er system works perfectly is what makes me quite certain that it=E2=80=99= s a hardware problem. Either some trouble > with the backplane or a power problem. > > I am tempted to go the oscilloscope route (monitoring the internal powe= r rails). But if the problem is in the power distribution of the backplan= e itself > I=E2=80=99ll need to destroy a broken disk to build a backplane power p= robe :) > Its very rare but we've also seen this type of behaviour from a failing=20 Intel CPU. There was no other indication the CPU had an issue, which one = might expect, so just wanted to make you aware of the possibility. That said the most common cause of this we've seen, when its not a=20 common disk or disks, is a bad backplane or cabling to the backplane. Regards Steve
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?56D95266.301>