Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 11 Sep 2018 14:13:36 -0700
From:      Dave Robison <davewrobison@gmail.com>
To:        freebsd-current@freebsd.org
Cc:        cayford.burrell@fisglobal.com, hiro.matsunami@necam.com, rainier@ultra-secure.de
Subject:   Routine Panic on HP Proliant G10
Message-ID:  <0BFC62B6-FA94-4FA9-9AC9-3200DB7007BF@gmail.com>

next in thread | raw e-mail | index | archive | help
Hiya,

I'm currently evaluating two classes of server which we source through NEC. H=
owever, the motherboards for these machines are HP. I can routinely panic bo=
th of these machines using 12.0-A4, as well as 11.1-R with a shoehorned in S=
ES/SMARTPQI driver, and 11.2-R with its native SES/SMARTPQI driver. NEC seem=
s to think this is a ZFS issue and they may be correct. If so I suspect ARC,=
 though as I explain further down, I haven't had a problem on other hardware=
.=20

I've managed to get a core dump on 11.1 and 11.2, but on 12.0 when the panic=
 occurs, I can backtrace and force a panic and the system claims it is writi=
ng out a core dump, but on reboot there is no core dump.

Machine A: HP ProLiant DL360 Gen 10 with a Xeon Bronze 3106 and 16 gigs RAM a=
nd three hard drives.

Machine B: HP Proliant DL380 Gen 10 with a Xeon Silver 4114 and 32 gigs RAM a=
nd five hard drives.

I install 12.0-A4 using ZFS on root. I install with 8 gigs of swap but other=
wise it's a standard FreeBSD install. I can panic these machines rather easi=
ly in 10-15 minutes by firing up 6 instances of bonnie++ and a few memtester=
s, three using 2g and three using 4g. I've done this on the 11.x installs wi=
thout memtester and gotten panics within 10-15 minutes. Those gave me core d=
umps, but the panic error is different than with 12.0-A4. I have run some te=
sts using UFS2 and did not manage to force a panic.

At first I thought the problem was the HPE RAID card which uses the SES driv=
er, so I put in a recent LSI MegaRAID card using the MRSAS driver, and can p=
anic that as well. I've managed to panic Machine B while it was using either=
 RAID card to create two mirrors and one hot spare, and I've managed to pani=
c it when letting the RAID cards pass through the hard drives so I could cre=
ate a raidz of 4 drives and one hot spare. I know many people immediately th=
ink "Don't use a RAID card with ZFS!" but I've done this for years without a=
 problem using the LSI MegaRAID in a variety of configurations.

It really seems to me that when ARC starts to ramp up and hits a lot of memo=
ry contention, a panic occurs. However, I've been running the same test on a=
 previous generation NEC server with an LSI MegaRAID using the MRSAS driver u=
nder 11.2-R and it has been running like clockwork for 11 days. We use this i=
teration of server extensively. If this were a problem with ARC, I assume (p=
erhaps presumptuously) that I would see the same problems. I also have serve=
rs running 11.2-R with ZFS and rather large and very heavily used JBOD array=
s and have never had an issue.

The HPE RAID card info, from pciconf -lv:

smartpqi0@pci0:92:0:0:  class=3D0x010700 card=3D0x0654103c chip=3D0x028f9005=
 rev=3D0x01 hdr=3D0x00
    vendor     =3D 'Adaptec'
    device     =3D 'Smart Storage PQI 12G SAS/PCIe 3'
    class      =3D mass storage
    subclass   =3D SAS

And from dmesg:

root@hvm2d:~ # dmesg | grep smartpq
smartpqi0: <E208i-a SR Gen10> port 0x8000-0x80ff mem 0xe6c00000-0xe6c07fff a=
t device 0.0 on pci9
smartpqi0: using MSI-X interrupts (40 vectors)
da0 at smartpqi0 bus 0 scbus0 target 0 lun 0
da1 at smartpqi0 bus 0 scbus0 target 1 lun 0
ses0 at smartpqi0 bus 0 scbus0 target 69 lun 0
pass3 at smartpqi0 bus 0 scbus0 target 1088 lun 0
smartpqi0: <E208i-a SR Gen10> port 0x8000-0x80ff mem 0xe6c00000-0xe6c07fff a=
t device 0.0 on pci9
smartpqi0: using MSI-X interrupts (40 vectors)
da0 at smartpqi0 bus 0 scbus0 target 0 lun 0
da1 at smartpqi0 bus 0 scbus0 target 1 lun 0
ses0 at smartpqi0 bus 0 scbus0 target 69 lun 0
pass3 at smartpqi0 bus 0 scbus0 target 1088 lun 0

However, since I can panic these with either RAID card, I don't suspect the H=
PE RAID card as the culprit.

Here is an image with the scant bt info I got from the last panic:

https://ibb.co/dzFOn9

This thread from Saturday on -stable sounded all too familiar:

https://lists.freebsd.org/pipermail/freebsd-stable/2018-September/089623.htm=
l

I'm at a loss so I have gathered as much info as I can to predict questions a=
nd requests for more info. Hoping someone can point me in the right directio=
n for further troubleshooting or at least isolation of the problem to a spec=
ific area.

Thanks for your time,

Dave




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0BFC62B6-FA94-4FA9-9AC9-3200DB7007BF>