Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 25 Apr 2016 10:29:04 +0200
From:      Borja Marcos <borjam@sarenet.es>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        Scott Long <scott4long@yahoo.com>, FreeBSD-scsi <freebsd-scsi@freebsd.org>
Subject:   Re: mpr(4) SAS3008 Repeated Crashing, LSI's spiritual advice would be appreciated
Message-ID:  <610C4F08-C1A4-4AB4-87B3-1253C45F8C38@sarenet.es>
In-Reply-To: <56D96C84.7070507@multiplay.co.uk>
References:  <56D5FDB8.8040402@freebsd.org> <56D612FA.6090909@multiplay.co.uk> <A8859ECA-0B58-42A8-AA49-DF6AA3D52CC6@sarenet.es> <E74F5225-1EA8-4B60-ADDC-7B13E1003184@yahoo.com> <D7E0BCCE-EB44-4EF9-8F17-474C162F7D7C@sarenet.es> <56D805FD.50500@multiplay.co.uk> <F9B68610-12C6-4D32-88CA-A34A185F9AD1@sarenet.es> <F5E05621-FF84-4BED-B1A7-3252715CD53B@yahoo.com> <B2147AEC-2831-443C-8FA0-4148B37AAF95@sarenet.es> <56D95266.301@multiplay.co.uk> <BC3018EA-A1F3-4C7C-A179-58553457A938@sarenet.es> <56D96C84.7070507@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help

> On 04 Mar 2016, at 12:07, Steven Hartland <killing@multiplay.co.uk> =
wrote:
>=20
> On 04/03/2016 10:58, Borja Marcos wrote:
>>> On 04 Mar 2016, at 10:16, Steven Hartland <killing@multiplay.co.uk> =
wrote:
>>>=20
>>> Its very rare but we've also seen this type of behaviour from a =
failing Intel CPU. There was no other indication the CPU had an issue, =
which one might expect, so just wanted to make you aware of the =
possibility.
>>>=20
>>> That said the most common cause of this we've seen, when its not a =
common disk or disks, is a bad backplane or cabling to the backplane.
>> Now I=E2=80=99m really curious!
>>=20
>> How did you determine that it was the CPU? And what kind of issue was =
it causing? Noise in the power rails? Interference?
> After a month or so of fixing mfi so it recovered from all bad events =
and prevented all the various kernel panics, the machine stayed running =
long enough to log an MCA which pointed to a failing CPU cache.
>=20
> We we're lucky it was CPU #2 so we disabled all cores for said CPU in =
/boot/loader.conf and all the issues disappeared. We replaced the CPU =
and no more issues.
>=20
> We we're in the same situation as you, two machines identical configs, =
one which was constantly panicing in mfi the other was rock solid.

An update, long due. After the compliete inaction by IBM=E2=80=99 so =
called =E2=80=9Csupport=E2=80=9D who blamed us for using non official =
operating systems, we complained
quite loudly (and harshly) and they accepted to =E2=80=9Creplace a =
backplane for mere reasons of customer satisfaction=E2=80=9D. Despite me =
insisting to bring also
a HBA because we really didn=C2=B4t know what was wrong.=20

So they sent a technician with one of the three almost passive boards of =
the backplane, even though I told them that the issue was spread among =
the 24 disks, not
just a group of 8. He changed one of them at random (I was on vacation =
when he came) and, as I imagined, the issue wasn=E2=80=99t solved at =
all.

Tired of dealing with them I pulled the SAS3 HBA and installed a classic =
LSI2008 card. A nightmare in itself, because the stupid firmware of the =
IBM hangs during
boot (=E2=80=9Cconnecting RAID adapters and boot devices=E2=80=9D or =
something like that, I left it like that for 24 hours just to see if it =
eventually exited the loop). I had to erase the
boot services flash from the HBA even though I had already disabled BIOS =
and UEFI services for the riser PCI card. Anyway I digress.

Repeating all of our tests, with the LSI2008 card everything works like =
a charm, although I=E2=80=99ve seen some surprising behavior. I spent a =
lot of time running
benchmarks. I could repeat the error condition in less than an hour =
fairly reliably with the LSI3008 card, and I was unable to reproduce the =
error with the LSI2008.
Of course, these days this is the most sure you can be, unless someone =
presents you with a proper oscilloscope and SAS pod. I even suggested =
that to IBM,
offering to do a serious diagnosis of the problem for them ;)

The odd behavior, for which LSI=E2=80=99s spiritual advice would be =
welcome, is this: 6 minutes after booting the system, while doing a =
scrub in order to generate
I/O load, and before beginning to run the error triggering benchmarks, I =
saw some surprising messages on /var/log/messages:


=E2=80=94=E2=80=94=E2=80=94
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da15,pass16: Element =
descriptor: 'SLOT 000'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da15,pass16: SAS Device Slot =
Element: 1 Phys at Slot 0, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd99
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da16,pass17: Element =
descriptor: 'SLOT 001'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da16,pass17: SAS Device Slot =
Element: 1 Phys at Slot 1, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd9a
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da17,pass18: Element =
descriptor: 'SLOT 002'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da17,pass18: SAS Device Slot =
Element: 1 Phys at Slot 2, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd9b
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da18,pass19: Element =
descriptor: 'SLOT 003'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da18,pass19: SAS Device Slot =
Element: 1 Phys at Slot 3, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd9c
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da19,pass20: Element =
descriptor: 'SLOT 004'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da19,pass20: SAS Device Slot =
Element: 1 Phys at Slot 4, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd9d
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da20,pass21: Element =
descriptor: 'SLOT 005'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da20,pass21: SAS Device Slot =
Element: 1 Phys at Slot 5, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd9e
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da21,pass22: Element =
descriptor: 'SLOT 006'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da21,pass22: SAS Device Slot =
Element: 1 Phys at Slot 6, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd9f
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da22,pass23: Element =
descriptor: 'SLOT 007'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da22,pass23: SAS Device Slot =
Element: 1 Phys at Slot 7, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fda0
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da7,pass8: Element =
descriptor: 'SLOT 008'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da7,pass8: SAS Device Slot =
Element: 1 Phys at Slot 8, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd91
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da8,pass9: Element =
descriptor: 'SLOT 009'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da8,pass9: SAS Device Slot =
Element: 1 Phys at Slot 9, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd92
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da9,pass10: Element =
descriptor: 'SLOT 010'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da9,pass10: SAS Device Slot =
Element: 1 Phys at Slot 10, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd93
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da10,pass11: Element =
descriptor: 'SLOT 011'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da10,pass11: SAS Device Slot =
Element: 1 Phys at Slot 11, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd94
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da11,pass12: Element =
descriptor: 'SLOT 012'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da11,pass12: SAS Device Slot =
Element: 1 Phys at Slot 12, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd95
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da12,pass13: Element =
descriptor: 'SLOT 013'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da12,pass13: SAS Device Slot =
Element: 1 Phys at Slot 13, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd96
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da13,pass14: Element =
descriptor: 'SLOT 014'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da13,pass14: SAS Device Slot =
Element: 1 Phys at Slot 14, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd97
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da14,pass15: Element =
descriptor: 'SLOT 015'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da14,pass15: SAS Device Slot =
Element: 1 Phys at Slot 15, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent =
500507603ea6fd90 addr 500507603ea6fd98

=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94



And at 17:41, something similar:



=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94


Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da0,pass0: Element =
descriptor: 'SLOT 016'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da0,pass0: SAS Device Slot =
Element: 1 Phys at Slot 16, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent =
500507603ea6d720 addr 500507603ea6d721
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da1,pass1: Element =
descriptor: 'SLOT 017'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da1,pass1: SAS Device Slot =
Element: 1 Phys at Slot 17, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent =
500507603ea6d720 addr 500507603ea6d722
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da2,pass2: Element =
descriptor: 'SLOT 018'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da2,pass2: SAS Device Slot =
Element: 1 Phys at Slot 18, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent =
500507603ea6d720 addr 500507603ea6d723
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da3,pass3: Element =
descriptor: 'SLOT 019'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da3,pass3: SAS Device Slot =
Element: 1 Phys at Slot 19, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent =
500507603ea6d720 addr 500507603ea6d724
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da4,pass4: Element =
descriptor: 'SLOT 020'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da4,pass4: SAS Device Slot =
Element: 1 Phys at Slot 20, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent =
500507603ea6d720 addr 500507603ea6d725
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da5,pass5: Element =
descriptor: 'SLOT 021'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da5,pass5: SAS Device Slot =
Element: 1 Phys at Slot 21, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent =
500507603ea6d720 addr 500507603ea6d726
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da6,pass6: Element =
descriptor: 'SLOT 022'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da6,pass6: SAS Device Slot =
Element: 1 Phys at Slot 22, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent =
500507603ea6d720 addr 500507603ea6d727

=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94=E2=80=94


After those events I did a scrub just in case, and no errors were found. =
Can it be some expander oddity that somewhat
confused the LSI3008 and not the LSI2008?

The system is working as a charm anyway, but I wonder if there=E2=80=99s =
some non obvious problem waiting to become a time bomb.

Regarding IBM, well, unless we can fix this the expensive piece of =
hardware it will be scrapped. And I really doubt
any piece of kit from IBM/Lenovo (seems that Lenovo is in charge of =
support for these servers now) will be purchased here on
my watch, ever.



Thanks,






Borja.





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?610C4F08-C1A4-4AB4-87B3-1253C45F8C38>