From owner-freebsd-current@freebsd.org Tue Sep 11 21:13:40 2018 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 31AF1109DF62 for ; Tue, 11 Sep 2018 21:13:40 +0000 (UTC) (envelope-from davewrobison@gmail.com) Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9AEDF8B0AD for ; Tue, 11 Sep 2018 21:13:39 +0000 (UTC) (envelope-from davewrobison@gmail.com) Received: by mail-pl1-x634.google.com with SMTP id u11-v6so11905751plq.5 for ; Tue, 11 Sep 2018 14:13:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:content-transfer-encoding:mime-version:date:subject:message-id :cc:to; bh=GMYk7PyWnD8bNU6mz3byn6S8mi2MaGvp/62XE2hFqGs=; b=NPzth+odu78qzTlhU11Ae8tPRk99LWbiTufJYDuXRR58WEOJ2F64VlvkKF1GLWr2AE NLxfqbo2M//kANjO54qmg7RpHybx9En+dtTICJRCSjWZwhpNmzZe04xcrfuz44Z/Zg1Y uMRB9n8xz2f6dbnGlDVoMmZPrAZ0rSSC1L8AtSaaAYbTE0EhV7eE8z8p+M5a3N8d1arc LlmpGAnQCndj1W0buztJ5MQIbjnaZT+1Pa/0SoU0eZZ1NvV6FKb1iQqfiSN+hTcQRO2e s6puVl9bEyjZzV19hIlgMWJ9xyIRizl8JkvL23NHBRNhYfINv1RlYVTXx3Z6K4pwMjDz Le5w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:content-transfer-encoding:mime-version:date :subject:message-id:cc:to; bh=GMYk7PyWnD8bNU6mz3byn6S8mi2MaGvp/62XE2hFqGs=; b=jbCAwTpPmh09PGoN/Wi2Sq8LFFAgXrS9mO+q0apDJ6l7rrlrHPAbNRZJfky08uqced lJovbdZzDdZiLd+ecLLeeQjyj7ic7qEHc7+6b7bP/ZowZRy2FzDpiwhvwp8Zuuy8lbcq 0Nnzfl+GYs02UD4QC0wEn/TlS47VT5aE1DyKbYcdqbww7/Mafmx+7bt4clX0CMRR/x/Y YSEmXMFFS2Uh/fQyvcSj7kSXerCSEtAJiGXtFWOORBBLEf3ieU1tBTNapKsnUkjfNbpy BraFnO/ANy5XOfTTNodH1QPPvOBJ4t2cTfN1GYRea9/vfYFFg8w129wrYD0xWA5+yLbD qGGg== X-Gm-Message-State: APzg51A0SN1sb8bUJbmnshrTvGeRoCuUk8QnBN4wwL2AcBFiI2KcVE1W xKze+5d81v6mxCWpUgayD0HSgwUU X-Google-Smtp-Source: ANB0VdbDUzfgMbzbZA5gF+0SLi2LjPMOWy53Euqlj24eVVT8hgnVm93AoAqOJjy1LV8hM+spaB6mxg== X-Received: by 2002:a17:902:7246:: with SMTP id c6-v6mr29332007pll.28.1536700418560; Tue, 11 Sep 2018 14:13:38 -0700 (PDT) Received: from ?IPv6:2600:380:475c:f40d:ddbd:9b16:a47f:de33? ([2600:380:475c:f40d:ddbd:9b16:a47f:de33]) by smtp.gmail.com with ESMTPSA id l79-v6sm36099313pfi.172.2018.09.11.14.13.37 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 11 Sep 2018 14:13:37 -0700 (PDT) From: Dave Robison Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (1.0) Date: Tue, 11 Sep 2018 14:13:36 -0700 Subject: Routine Panic on HP Proliant G10 Message-Id: <0BFC62B6-FA94-4FA9-9AC9-3200DB7007BF@gmail.com> Cc: cayford.burrell@fisglobal.com, hiro.matsunami@necam.com, rainier@ultra-secure.de To: freebsd-current@freebsd.org X-Mailer: iPhone Mail (15G77) X-Mailman-Approved-At: Tue, 11 Sep 2018 21:20:13 +0000 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Sep 2018 21:13:40 -0000 Hiya, I'm currently evaluating two classes of server which we source through NEC. H= owever, the motherboards for these machines are HP. I can routinely panic bo= th of these machines using 12.0-A4, as well as 11.1-R with a shoehorned in S= ES/SMARTPQI driver, and 11.2-R with its native SES/SMARTPQI driver. NEC seem= s to think this is a ZFS issue and they may be correct. If so I suspect ARC,= though as I explain further down, I haven't had a problem on other hardware= .=20 I've managed to get a core dump on 11.1 and 11.2, but on 12.0 when the panic= occurs, I can backtrace and force a panic and the system claims it is writi= ng out a core dump, but on reboot there is no core dump. Machine A: HP ProLiant DL360 Gen 10 with a Xeon Bronze 3106 and 16 gigs RAM a= nd three hard drives. Machine B: HP Proliant DL380 Gen 10 with a Xeon Silver 4114 and 32 gigs RAM a= nd five hard drives. I install 12.0-A4 using ZFS on root. I install with 8 gigs of swap but other= wise it's a standard FreeBSD install. I can panic these machines rather easi= ly in 10-15 minutes by firing up 6 instances of bonnie++ and a few memtester= s, three using 2g and three using 4g. I've done this on the 11.x installs wi= thout memtester and gotten panics within 10-15 minutes. Those gave me core d= umps, but the panic error is different than with 12.0-A4. I have run some te= sts using UFS2 and did not manage to force a panic. At first I thought the problem was the HPE RAID card which uses the SES driv= er, so I put in a recent LSI MegaRAID card using the MRSAS driver, and can p= anic that as well. I've managed to panic Machine B while it was using either= RAID card to create two mirrors and one hot spare, and I've managed to pani= c it when letting the RAID cards pass through the hard drives so I could cre= ate a raidz of 4 drives and one hot spare. I know many people immediately th= ink "Don't use a RAID card with ZFS!" but I've done this for years without a= problem using the LSI MegaRAID in a variety of configurations. It really seems to me that when ARC starts to ramp up and hits a lot of memo= ry contention, a panic occurs. However, I've been running the same test on a= previous generation NEC server with an LSI MegaRAID using the MRSAS driver u= nder 11.2-R and it has been running like clockwork for 11 days. We use this i= teration of server extensively. If this were a problem with ARC, I assume (p= erhaps presumptuously) that I would see the same problems. I also have serve= rs running 11.2-R with ZFS and rather large and very heavily used JBOD array= s and have never had an issue. The HPE RAID card info, from pciconf -lv: smartpqi0@pci0:92:0:0: class=3D0x010700 card=3D0x0654103c chip=3D0x028f9005= rev=3D0x01 hdr=3D0x00 vendor =3D 'Adaptec' device =3D 'Smart Storage PQI 12G SAS/PCIe 3' class =3D mass storage subclass =3D SAS And from dmesg: root@hvm2d:~ # dmesg | grep smartpq smartpqi0: port 0x8000-0x80ff mem 0xe6c00000-0xe6c07fff a= t device 0.0 on pci9 smartpqi0: using MSI-X interrupts (40 vectors) da0 at smartpqi0 bus 0 scbus0 target 0 lun 0 da1 at smartpqi0 bus 0 scbus0 target 1 lun 0 ses0 at smartpqi0 bus 0 scbus0 target 69 lun 0 pass3 at smartpqi0 bus 0 scbus0 target 1088 lun 0 smartpqi0: port 0x8000-0x80ff mem 0xe6c00000-0xe6c07fff a= t device 0.0 on pci9 smartpqi0: using MSI-X interrupts (40 vectors) da0 at smartpqi0 bus 0 scbus0 target 0 lun 0 da1 at smartpqi0 bus 0 scbus0 target 1 lun 0 ses0 at smartpqi0 bus 0 scbus0 target 69 lun 0 pass3 at smartpqi0 bus 0 scbus0 target 1088 lun 0 However, since I can panic these with either RAID card, I don't suspect the H= PE RAID card as the culprit. Here is an image with the scant bt info I got from the last panic: https://ibb.co/dzFOn9 This thread from Saturday on -stable sounded all too familiar: https://lists.freebsd.org/pipermail/freebsd-stable/2018-September/089623.htm= l I'm at a loss so I have gathered as much info as I can to predict questions a= nd requests for more info. Hoping someone can point me in the right directio= n for further troubleshooting or at least isolation of the problem to a spec= ific area. Thanks for your time, Dave