Date: Mon, 25 Apr 2016 10:29:04 +0200 From: Borja Marcos <borjam@sarenet.es> To: Steven Hartland <killing@multiplay.co.uk> Cc: Scott Long <scott4long@yahoo.com>, FreeBSD-scsi <freebsd-scsi@freebsd.org> Subject: Re: mpr(4) SAS3008 Repeated Crashing, LSI's spiritual advice would be appreciated Message-ID: <610C4F08-C1A4-4AB4-87B3-1253C45F8C38@sarenet.es> In-Reply-To: <56D96C84.7070507@multiplay.co.uk> References: <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es> <E74F5225-1EA8-4B60-ADDC-7B13E1003184@yahoo.com> <D7E0BCCE-EB44-4EF9-8F17-474C162F7D7C@sarenet.es> <56D805FD.50500@multiplay.co.uk> <F9B68610-12C6-4D32-88CA-A34A185F9AD1@sarenet.es> <F5E05621-FF84-4BED-B1A7-3252715CD53B@yahoo.com> <B2147AEC-2831-443C-8FA0-4148B37AAF95@sarenet.es> <56D95266.301@multiplay.co.uk> <BC3018EA-A1F3-4C7C-A179-58553457A938@sarenet.es> <56D96C84.7070507@multiplay.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
> On 04 Mar 2016, at 12:07, Steven Hartland <killing@multiplay.co.uk> = wrote: >=20 > On 04/03/2016 10:58, Borja Marcos wrote: >>> On 04 Mar 2016, at 10:16, Steven Hartland <killing@multiplay.co.uk> = wrote: >>>=20 >>> Its very rare but we've also seen this type of behaviour from a = failing Intel CPU. There was no other indication the CPU had an issue, = which one might expect, so just wanted to make you aware of the = possibility. >>>=20 >>> That said the most common cause of this we've seen, when its not a = common disk or disks, is a bad backplane or cabling to the backplane. >> Now I=E2=80=99m really curious! >>=20 >> How did you determine that it was the CPU? And what kind of issue was = it causing? Noise in the power rails? Interference? > After a month or so of fixing mfi so it recovered from all bad events = and prevented all the various kernel panics, the machine stayed running = long enough to log an MCA which pointed to a failing CPU cache. >=20 > We we're lucky it was CPU #2 so we disabled all cores for said CPU in = /boot/loader.conf and all the issues disappeared. We replaced the CPU = and no more issues. >=20 > We we're in the same situation as you, two machines identical configs, = one which was constantly panicing in mfi the other was rock solid. An update, long due. After the compliete inaction by IBM=E2=80=99 so = called =E2=80=9Csupport=E2=80=9D who blamed us for using non official = operating systems, we complained quite loudly (and harshly) and they accepted to =E2=80=9Creplace a = backplane for mere reasons of customer satisfaction=E2=80=9D. Despite me = insisting to bring also a HBA because we really didn=C2=B4t know what was wrong.=20 So they sent a technician with one of the three almost passive boards of = the backplane, even though I told them that the issue was spread among = the 24 disks, not just a group of 8. He changed one of them at random (I was on vacation = when he came) and, as I imagined, the issue wasn=E2=80=99t solved at = all. Tired of dealing with them I pulled the SAS3 HBA and installed a classic = LSI2008 card. A nightmare in itself, because the stupid firmware of the = IBM hangs during boot (=E2=80=9Cconnecting RAID adapters and boot devices=E2=80=9D or = something like that, I left it like that for 24 hours just to see if it = eventually exited the loop). I had to erase the boot services flash from the HBA even though I had already disabled BIOS = and UEFI services for the riser PCI card. Anyway I digress. Repeating all of our tests, with the LSI2008 card everything works like = a charm, although I=E2=80=99ve seen some surprising behavior. I spent a = lot of time running benchmarks. I could repeat the error condition in less than an hour = fairly reliably with the LSI3008 card, and I was unable to reproduce the = error with the LSI2008. Of course, these days this is the most sure you can be, unless someone = presents you with a proper oscilloscope and SAS pod. I even suggested = that to IBM, offering to do a serious diagnosis of the problem for them ;) The odd behavior, for which LSI=E2=80=99s spiritual advice would be = welcome, is this: 6 minutes after booting the system, while doing a = scrub in order to generate I/O load, and before beginning to run the error triggering benchmarks, I = saw some surprising messages on /var/log/messages: =E2=80=94=E2=80=94=E2=80=94 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da15,pass16: Element = descriptor: 'SLOT 000' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da15,pass16: SAS Device Slot = Element: 1 Phys at Slot 0, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd99 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da16,pass17: Element = descriptor: 'SLOT 001' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da16,pass17: SAS Device Slot = Element: 1 Phys at Slot 1, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd9a Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da17,pass18: Element = descriptor: 'SLOT 002' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da17,pass18: SAS Device Slot = Element: 1 Phys at Slot 2, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd9b Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da18,pass19: Element = descriptor: 'SLOT 003' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da18,pass19: SAS Device Slot = Element: 1 Phys at Slot 3, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd9c Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da19,pass20: Element = descriptor: 'SLOT 004' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da19,pass20: SAS Device Slot = Element: 1 Phys at Slot 4, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd9d Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da20,pass21: Element = descriptor: 'SLOT 005' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da20,pass21: SAS Device Slot = Element: 1 Phys at Slot 5, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd9e Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da21,pass22: Element = descriptor: 'SLOT 006' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da21,pass22: SAS Device Slot = Element: 1 Phys at Slot 6, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd9f Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da22,pass23: Element = descriptor: 'SLOT 007' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da22,pass23: SAS Device Slot = Element: 1 Phys at Slot 7, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fda0 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da7,pass8: Element = descriptor: 'SLOT 008' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da7,pass8: SAS Device Slot = Element: 1 Phys at Slot 8, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd91 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da8,pass9: Element = descriptor: 'SLOT 009' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da8,pass9: SAS Device Slot = Element: 1 Phys at Slot 9, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd92 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da9,pass10: Element = descriptor: 'SLOT 010' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da9,pass10: SAS Device Slot = Element: 1 Phys at Slot 10, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd93 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da10,pass11: Element = descriptor: 'SLOT 011' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da10,pass11: SAS Device Slot = Element: 1 Phys at Slot 11, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd94 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da11,pass12: Element = descriptor: 'SLOT 012' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da11,pass12: SAS Device Slot = Element: 1 Phys at Slot 12, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd95 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da12,pass13: Element = descriptor: 'SLOT 013' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da12,pass13: SAS Device Slot = Element: 1 Phys at Slot 13, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd96 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da13,pass14: Element = descriptor: 'SLOT 014' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da13,pass14: SAS Device Slot = Element: 1 Phys at Slot 14, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd97 Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da14,pass15: Element = descriptor: 'SLOT 015' Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da14,pass15: SAS Device Slot = Element: 1 Phys at Slot 15, Not All Phys Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent = 500507603ea6fd90 addr 500507603ea6fd98 =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94 And at 17:41, something similar: =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da0,pass0: Element = descriptor: 'SLOT 016' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da0,pass0: SAS Device Slot = Element: 1 Phys at Slot 16, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d721 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da1,pass1: Element = descriptor: 'SLOT 017' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da1,pass1: SAS Device Slot = Element: 1 Phys at Slot 17, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d722 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da2,pass2: Element = descriptor: 'SLOT 018' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da2,pass2: SAS Device Slot = Element: 1 Phys at Slot 18, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d723 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da3,pass3: Element = descriptor: 'SLOT 019' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da3,pass3: SAS Device Slot = Element: 1 Phys at Slot 19, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d724 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da4,pass4: Element = descriptor: 'SLOT 020' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da4,pass4: SAS Device Slot = Element: 1 Phys at Slot 20, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d725 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da5,pass5: Element = descriptor: 'SLOT 021' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da5,pass5: SAS Device Slot = Element: 1 Phys at Slot 21, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d726 Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da6,pass6: Element = descriptor: 'SLOT 022' Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da6,pass6: SAS Device Slot = Element: 1 Phys at Slot 22, Not All Phys Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent = 500507603ea6d720 addr 500507603ea6d727 =E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94 After those events I did a scrub just in case, and no errors were found. = Can it be some expander oddity that somewhat confused the LSI3008 and not the LSI2008? The system is working as a charm anyway, but I wonder if there=E2=80=99s = some non obvious problem waiting to become a time bomb. Regarding IBM, well, unless we can fix this the expensive piece of = hardware it will be scrapped. And I really doubt any piece of kit from IBM/Lenovo (seems that Lenovo is in charge of = support for these servers now) will be purchased here on my watch, ever. Thanks, Borja.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?610C4F08-C1A4-4AB4-87B3-1253C45F8C38>