Date: Tue, 11 Sep 2018 14:13:36 -0700 From: Dave Robison <davewrobison@gmail.com> To: freebsd-current@freebsd.org Cc: cayford.burrell@fisglobal.com, hiro.matsunami@necam.com, rainier@ultra-secure.de Subject: Routine Panic on HP Proliant G10 Message-ID: <0BFC62B6-FA94-4FA9-9AC9-3200DB7007BF@gmail.com>
next in thread | raw e-mail | index | archive | help
Hiya, I'm currently evaluating two classes of server which we source through NEC. H= owever, the motherboards for these machines are HP. I can routinely panic bo= th of these machines using 12.0-A4, as well as 11.1-R with a shoehorned in S= ES/SMARTPQI driver, and 11.2-R with its native SES/SMARTPQI driver. NEC seem= s to think this is a ZFS issue and they may be correct. If so I suspect ARC,= though as I explain further down, I haven't had a problem on other hardware= .=20 I've managed to get a core dump on 11.1 and 11.2, but on 12.0 when the panic= occurs, I can backtrace and force a panic and the system claims it is writi= ng out a core dump, but on reboot there is no core dump. Machine A: HP ProLiant DL360 Gen 10 with a Xeon Bronze 3106 and 16 gigs RAM a= nd three hard drives. Machine B: HP Proliant DL380 Gen 10 with a Xeon Silver 4114 and 32 gigs RAM a= nd five hard drives. I install 12.0-A4 using ZFS on root. I install with 8 gigs of swap but other= wise it's a standard FreeBSD install. I can panic these machines rather easi= ly in 10-15 minutes by firing up 6 instances of bonnie++ and a few memtester= s, three using 2g and three using 4g. I've done this on the 11.x installs wi= thout memtester and gotten panics within 10-15 minutes. Those gave me core d= umps, but the panic error is different than with 12.0-A4. I have run some te= sts using UFS2 and did not manage to force a panic. At first I thought the problem was the HPE RAID card which uses the SES driv= er, so I put in a recent LSI MegaRAID card using the MRSAS driver, and can p= anic that as well. I've managed to panic Machine B while it was using either= RAID card to create two mirrors and one hot spare, and I've managed to pani= c it when letting the RAID cards pass through the hard drives so I could cre= ate a raidz of 4 drives and one hot spare. I know many people immediately th= ink "Don't use a RAID card with ZFS!" but I've done this for years without a= problem using the LSI MegaRAID in a variety of configurations. It really seems to me that when ARC starts to ramp up and hits a lot of memo= ry contention, a panic occurs. However, I've been running the same test on a= previous generation NEC server with an LSI MegaRAID using the MRSAS driver u= nder 11.2-R and it has been running like clockwork for 11 days. We use this i= teration of server extensively. If this were a problem with ARC, I assume (p= erhaps presumptuously) that I would see the same problems. I also have serve= rs running 11.2-R with ZFS and rather large and very heavily used JBOD array= s and have never had an issue. The HPE RAID card info, from pciconf -lv: smartpqi0@pci0:92:0:0: class=3D0x010700 card=3D0x0654103c chip=3D0x028f9005= rev=3D0x01 hdr=3D0x00 vendor =3D 'Adaptec' device =3D 'Smart Storage PQI 12G SAS/PCIe 3' class =3D mass storage subclass =3D SAS And from dmesg: root@hvm2d:~ # dmesg | grep smartpq smartpqi0: <E208i-a SR Gen10> port 0x8000-0x80ff mem 0xe6c00000-0xe6c07fff a= t device 0.0 on pci9 smartpqi0: using MSI-X interrupts (40 vectors) da0 at smartpqi0 bus 0 scbus0 target 0 lun 0 da1 at smartpqi0 bus 0 scbus0 target 1 lun 0 ses0 at smartpqi0 bus 0 scbus0 target 69 lun 0 pass3 at smartpqi0 bus 0 scbus0 target 1088 lun 0 smartpqi0: <E208i-a SR Gen10> port 0x8000-0x80ff mem 0xe6c00000-0xe6c07fff a= t device 0.0 on pci9 smartpqi0: using MSI-X interrupts (40 vectors) da0 at smartpqi0 bus 0 scbus0 target 0 lun 0 da1 at smartpqi0 bus 0 scbus0 target 1 lun 0 ses0 at smartpqi0 bus 0 scbus0 target 69 lun 0 pass3 at smartpqi0 bus 0 scbus0 target 1088 lun 0 However, since I can panic these with either RAID card, I don't suspect the H= PE RAID card as the culprit. Here is an image with the scant bt info I got from the last panic: https://ibb.co/dzFOn9 This thread from Saturday on -stable sounded all too familiar: https://lists.freebsd.org/pipermail/freebsd-stable/2018-September/089623.htm= l I'm at a loss so I have gathered as much info as I can to predict questions a= nd requests for more info. Hoping someone can point me in the right directio= n for further troubleshooting or at least isolation of the problem to a spec= ific area. Thanks for your time, Dave
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0BFC62B6-FA94-4FA9-9AC9-3200DB7007BF>