From owner-freebsd-stable@freebsd.org Mon Dec 17 15:52:30 2018 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2D96E1343DA2 for ; Mon, 17 Dec 2018 15:52:30 +0000 (UTC) (envelope-from Mark.Martinec+freebsd@ijs.si) Received: from mail.ijs.si (mail.ijs.si [IPv6:2001:1470:ff80::25]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4AFF3741D7 for ; Mon, 17 Dec 2018 15:52:29 +0000 (UTC) (envelope-from Mark.Martinec+freebsd@ijs.si) Received: from amavis-ori.ijs.si (localhost [IPv6:::1]) by mail.ijs.si (Postfix) with ESMTP id 43JQdl0kS5z7Xs for ; Mon, 17 Dec 2018 16:52:27 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=ijs.si; h= user-agent:message-id:organization:subject:subject:from:from :date:date:content-transfer-encoding:content-type:content-type :mime-version:received:received:received:received; s=jakla4; t= 1545061943; x=1547653944; bh=TNC8DuSHEgteBfD4ykdWXkpfBPXIliFK9YD Ep+XB9ZY=; b=VeCn/Yo4D539iRTURnEF8Xhza21eUZwABI6IX4wOKWTu2AaEyD0 /NHdK5Iej+XFXK1da/xb+t5TfkjUcvVQUE62CSUqn8my2jlQSH1uiW9k/lTINRfm Nq8ZYQ1o/whfho0r1/hOM1yEuyB+4Xura8uJA8J7SmKekANnD73n+5Ak= X-Virus-Scanned: amavisd-new at ijs.si Received: from mail.ijs.si ([IPv6:::1]) by amavis-ori.ijs.si (mail.ijs.si [IPv6:::1]) (amavisd-new, port 10026) with LMTP id Twr1tndcImhW for ; Mon, 17 Dec 2018 16:52:23 +0100 (CET) Received: from mildred.ijs.si (mailbox.ijs.si [IPv6:2001:1470:ff80::143:1]) by mail.ijs.si (Postfix) with ESMTP id 43JQdg2HBnz7Xr for ; Mon, 17 Dec 2018 16:52:22 +0100 (CET) Received: from nabiralnik.ijs.si (nabiralnik.ijs.si [IPv6:2001:1470:ff80::80:16]) by mildred.ijs.si (Postfix) with ESMTP id 43JQdf5SdBzmV for ; Mon, 17 Dec 2018 16:52:22 +0100 (CET) Received: from neli.ijs.si (2001:1470:ff80:88:21c:c0ff:feb1:8c91) by nabiralnik.ijs.si with HTTP (HTTP/1.1 POST); Mon, 17 Dec 2018 16:52:22 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Mon, 17 Dec 2018 16:52:22 +0100 From: Mark Martinec To: freebsd-stable@freebsd.org Subject: mps and LSI SAS2308: controller resets on 12.0 - IOC Fault 0x40000d04, Resetting Organization: Jozef Stefan Institute Message-ID: <515deae15368aaa8c8deb241e71f87db@ijs.si> X-Sender: Mark.Martinec+freebsd@ijs.si User-Agent: Roundcube Webmail/1.3.1 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2018 15:52:30 -0000 One of our servers that was upgraded from 11.2 to 12.0 (to RC2 initially, then to RC3 and lastly to a 12.0-RELEASE) is suffering severe instability of a disk controller, resetting itself a couple of times a day, usually associated with high disk usage (like poudriere buils or zfs scrub or nightly file system scans). The same setup was rock-solid under 11.2 (and still/again is). The disk controller is LSI SAS2308. It has four disks attached as JBODs, one pair of SSDs and one pair of hard disks, each pair forming its own zpool. A controller reset can occur regardless of which pair is in heavy use. The following can be found in logs, just before machine becomes unusable (although not logged always, as disks may be dropped before syslog has a chance of writing anything): xxx kernel: [2382] mps0: IOC Fault 0x40000d04, Resetting xxx kernel: [2382] mps0: Reinitializing controller xxx kernel: [2383] mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd xxx kernel: [2383] mps0: IOCCapabilities: 5a85c xxx kernel: [2383] (da0:mps0:0:0:0): Invalidating pack The IOC Fault location is always the same. Apparently the disk controller resets, all disk devices are dropped and ZFS finds itself with no disks. The machine still responds to ping, and if logged-in during the event and running zpool status -v 1, zfs reports loss of all devices for each pool: pool: data0 state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://illumos.org/msg/ZFS-8000-HC scan: scrub repaired 0 in 0 days 03:53:41 with 0 errors on Sat Nov 17 00:22:38 2018 config: NAME STATE READ WRITE CKSUM data0 UNAVAIL 0 0 0 mirror-0 UNAVAIL 0 24 0 2396428274137360341 REMOVED 0 0 0 was /dev/gpt/da2-PN1334PCKAKD4S 16738407333921736610 REMOVED 0 0 0 was /dev/gpt/da3-PN2338P4GJ1XYC (and similar for the other pool) At this point the machine is unusable and needs to be hard-reset. My guess is that after the controller resets, disk devices come up again (according to the report seen on the console, stating 'periph destroyed' first, then listing full info on each disk) - but zfs ignores them. I don't see any mention of changes of the mps driver in the 12.0 release notes, although diff-ing its sources between 11.2 and 12.0 shows plenty of nontrivial changes. After suffering this instability for some time, I finally downgraded the OS to 11.2, and things are back to normal again! This downgrade path was nontrivial, as I have foolishly upgraded pool features to what comes with 12.0, so downgrading involved hacking with dismantling both zfs mirror pools, recreating pools without the two new features, zfs send/receive copying, while having a machine hang during some of these operations. Not something for the faint at heart. I know, foolish of me to upgrade pools after just one day of uptime with 12.0. Some info on the controller: kernel: mps0: port 0xf000-0xf0ff mem 0xfbe40000- 0xfbe4ffff,0xfbe00000-0xfbe3ffff irq 64 at device 0.0 numa-domain 1 on pci11 kernel: mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd mpsutil shows: mps0 Adapter: Board Name: LSI2308-IT Board Assembly: Chip Name: LSISAS2308 Chip Revision: ALL BIOS Revision: 7.39.00.00 Firmware Revision: 20.00.02.00 Integrated RAID: no So, what has changed in the mps driver for this to be happening? Would it be possible to take mps driver sources from 11.2, transplant them to 12.0, recompile, and use that? Could the new mps driver be using some new feature of the controller and hits a firmware bug? I have resisted upgrading SAS2308 firmware and its BIOS, as it is working very well under 11.2. Anyone else seen problems with mps driver and LSI SAS2308 controller? (btw, on another machine the mps driver with LSI SAS2004 is working just fine under 12.0) Mark