Date: Tue, 7 Feb 2012 11:01:38 GMT From: Johannes Reinhard <johannes.reinhard@physik.uni-erlangen.de> To: freebsd-gnats-submit@FreeBSD.org Subject: kern/164844: [zfs] [mpt] Kernel Panic with ZFS and LSI Logic SAS/SATA controller Message-ID: <201202071101.q17B1ch8082640@red.freebsd.org> Resent-Message-ID: <201202071110.q17BABwF083606@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 164844 >Category: kern >Synopsis: [zfs] [mpt] Kernel Panic with ZFS and LSI Logic SAS/SATA controller >Confidential: no >Severity: serious >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Tue Feb 07 11:10:11 UTC 2012 >Closed-Date: >Last-Modified: >Originator: Johannes Reinhard >Release: FreeBSD 8.1 >Organization: FAU Erlangen-Nürnberg >Environment: FreeBSD fileserv 8.1-RELEASE-p5 FreeBSD 8.1-RELEASE-p5 #0: Tue Sep 27 16:49:00 UTC 2011 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 >Description: We are using a SUN Fire v40z with two different LSI controllers both managed by the mpt driver. One is used for the system drives (mpt0), the other one (mpt1) for an external storage enclosure in JBOD mode (Sun J4200). # pciconf -lvv | grep <...> mpt0@pci0:2:4:0: class=0x010000 card=0x002017c2 chip=0x00301000 rev=0x08 hdr=0x00 vendor = 'LSI Logic (Was: Symbios Logic, NCR)' device = 'PCI-X to Ultra320 SCSI Controller (LSI53C1020/1030)' class = mass storage subclass = SCSI mpt1@pci0:36:1:0: class=0x010000 card=0x30e01000 chip=0x00541000 rev=0x02 hdr=0x00 vendor = 'LSI Logic (Was: Symbios Logic, NCR)' device = 'SAS 3000 series, 8-port with 1068 -StorPort' class = mass storage subclass = SCSI # mptutil -u 0 show adapter mpt0 Adapter: Board Name: 0 Board Assembly: 0 Chip Name: C1030 Chip Revision: 0 RAID Levels: RAID1, RAID1E # mptutil -u 1 show adapter mpt1 Adapter: Board Name: SAS3801X Board Assembly: L3-00146-02D Chip Name: C1068 Chip Revision: UNUSED RAID Levels: none On the JBOD we are using ZFS. When the kernel panic occurs, typical symptoms in /var/log/messages look like this .. Feb 5 15:14:55 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=654675777536 size=14848 Feb 5 15:14:55 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da2 offset=654676050432 size=14848 Feb 5 15:14:55 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da9 offset=654676182016 size=14848 Feb 5 15:15:12 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=655119295488 size=14848 Feb 5 15:15:12 fileserv kernel: mpt1: mpt_intr: no target cmd ptrs Feb 5 15:15:16 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=461695998976 size=14848 Feb 5 15:15:50 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da10 offset=655741660160 size=14336 Feb 5 15:15:50 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=655741674496 size=14336 Feb 5 15:15:52 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=655768189440 size=14848 Feb 5 15:15:52 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=655768204288 size=14848 .. or immediately before failure like this (captured via remote console, normally most of the log is lost after reboot) .. Feb 7 00:21:01 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=909219043840 size=14848 Feb 7 00:21:01 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=909219087872 size=14848 Feb 7 00:21:01 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=909219293184 size=14848 Feb 7 00:21:03 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=323225315328 size=14336 Feb 7 00:21:03 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da11 offset=323225417728 size=14336 Feb 7 00:21:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=323291898880 size=14336 Feb 7 00:21:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=323291869184 size=14336 Feb 7 00:21:27 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=516894794752 size=14336 Feb 7 00:21:27 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=516894795264 size=14336 Feb 7 00:21:31 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da9 offset=517280401920 size=14336 Feb 7 00:21:31 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da6 offset=517280489984 size=14848 Feb 7 00:21:31 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=517280182272 size=14336 Feb 7 00:21:33 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da2 offset=813396810240 size=14848 Feb 7 00:21:37 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da12 offset=813434911744 size=14336 Feb 7 00:21:37 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=813435424768 size=14336 Feb 7 00:21:38 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=517349635584 size=14336 Feb 7 00:21:38 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=517349723648 size=14848 Feb 7 00:21:38 fileserv kernel: mpt1: Context Reply 0x00000003? Feb 7 00:21:38 fileserv kernel: Feb 7 00:21:40 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da2 offset=517380763648 size=14848 Feb 7 00:21:40 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=517380807680 size=14848 Feb 7 00:21:44 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=517434250752 size=14848 Feb 7 00:21:44 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=517434368512 size=14336 Feb 7 00:21:44 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da6 offset=517434265600 size=14336 Feb 7 00:21:45 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=813524531200 size=14336 Feb 7 00:21:45 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da2 offset=813524502016 size=14336 Feb 7 00:22:38 fileserv kernel: mpt1: request 0xffffff80003dc610:58518 timed out for ccb 0xffffff0011cf1000 (req->ccb 0xffffff0011cf1000) Feb 7 00:22:38 fileserv kernel: mpt1: Feb 7 00:22:38 fileserv kernel: attempting to abort req 0xffffff80003dc610:58518 function 0 Feb 7 00:22:39 fileserv kernel: mpt1: mpt_wait_req(1) timed out Feb 7 00:22:39 fileserv kernel: Feb 7 00:22:39 fileserv kernel: mpt1: mpt_recover_commands: abort timed-out. Resetting controller Feb 7 00:22:39 fileserv kernel: mpt1: mpt_cam_event: 0x0 Feb 7 00:22:39 fileserv kernel: mpt1: mpt_cam_event: 0x0 Feb 7 00:22:39 fileserv kernel: mpt1: completing timedout/aborted req 0xffffff80003dc610:58518 Feb 7 00:22:51 fileserv kernel: mpt1: mpt_cam_event: 0x16 Feb 7 00:22:51 fileserv kernel: mpt1: mpt_cam_event: 0x12 Feb 7 00:22:51 fileserv kernel: mpt1: mpt_cam_event: 0x1b Feb 7 00:22:51 fileserv kernel: mpt1: mpt_cam_event: 0x12 Feb 7 00:22:51 fileserv last message repeated 19 times Feb 7 00:22:51 fileserv kernel: mpt1: mpt_cam_event: 0x16 Feb 7 00:22:51 fileserv last message repeated 2 times Feb 7 00:22:51 fileserv kernel: (da5:mpt1:0:4:0): READ(10). CDB: 28 0 3c 3a 6b 64 0 0 72 0 Feb 7 00:22:51 fileserv kernel: (da5:mpt1:0:4:0): CAM status: SCSI Status Error Feb 7 00:22:51 fileserv kernel: (da5:mpt1:0:4:0): SCSI status: Check Condition Feb 7 00:22:51 fileserv kernel: (da5:mpt1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Feb 7 00:22:51 fileserv kernel: (da6:mpt1:0:5:0): READ(10). CDB: 28 0 a 4e dd 3 0 0 1c 0 Feb 7 00:22:54 fileserv kernel: Feb 7 00:22:54 fileserv kernel: (da6:mpt1:0:5:0): CAM status: SCSI Status Error Feb 7 00:22:54 fileserv kernel: (da6:mpt1:0:5:0): SCSI status: Check Condition Feb 7 00:22:54 fileserv kernel: (da6:mpt1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Feb 7 00:22:54 fileserv kernel: (da3:mpt1:0:2:0): READ(10). CDB: 28 0 a 4e dd 3 0 0 1c 0 Feb 7 00:22:54 fileserv kernel: (da3:mpt1:0:2:0): CAM status: SCSI Status Error Feb 7 00:22:54 fileserv kernel: (da3:mpt1:0:2:0): SCSI status: Check Condition Feb 7 00:22:54 fileserv kernel: (da3:mpt1:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Feb 7 00:22:54 fileserv kernel: (da2:mpt1:0:1:0): READ(10). CDB: 28 0 a 4e dd 3 0 0 1c 0 Feb 7 00:22:54 fileserv kernel: (da2:mpt1:0:1:0): CAM status: SCSI Status Error Feb 7 00:22:54 fileserv kernel: (da2:mpt1:0:1:0): SCSI status: Check Condition Feb 7 00:22:54 fileserv kernel: (da2:mpt1:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Feb 7 00:22:54 fileserv kernel: (da1:mpt1:0:0:0): READ(10). CDB: 28 0 a 4e dd 3 0 0 1c 0 Feb 7 00:22:54 fileserv kernel: (da1:mpt1:0:0:0): CAM status: SCSI Status Error Feb 7 00:22:54 fileserv kernel: (da1:mpt1:0:0:0): SCSI status: Check Condition Feb 7 00:22:54 fileserv kernel: (da1:mpt1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Feb 7 00:22:54 fileserv kernel: (da8:mpt1:0:7:0): READ(10). CDB: 28 0 a 4e dd 2 0 0 1d 0 Feb 7 00:22:54 fileserv kernel: (da8:mpt1:0:7:0): CAM status: SCSI Status Error Feb 7 00:22:54 fileserv kernel: (da8:mpt1:0:7:0): SCSI status: Check Condition Feb 7 00:22:54 fileserv kernel: (da8:mpt1:0:7:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Feb 7 00:22:54 fileserv kernel: (da4:mpt1:0:3:0): READ(10). CDB: 28 0 a 4e dd 2 0 0 1d 0 Feb 7 00:22:54 fileserv kernel: (da4:mpt1:0:3:0): CAM status: SCSI Status Error Feb 7 00:22:54 fileserv kernel: (da4:mpt1:0:3:0): SCSI status: Check Condition Feb 7 00:22:54 fileserv kernel: (da4:mpt1:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Feb 7 00:22:54 fileserv kernel: (da9:mpt1:0:8:0): READ(10). CDB: 28 0 a 4e dd 2 0 0 1d 0 Feb 7 00:22:54 fileserv kernel: (da9:mpt1:0:8:0): CAM status: SCSI Status Error Feb 7 00:22:54 fileserv kernel: (da9:mpt1:0:8:0): SCSI status: Check Condition Feb 7 00:22:54 fileserv kernel: (da9:mpt1:0:8:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Feb 7 00:22:54 fileserv kernel: (da11:mpt1:0:10:0): READ(10). CDB: 28 0 a 4e dd 2 0 0 1d 0 Feb 7 00:22:54 fileserv kernel: (da11:mpt1:0:10:0): CAM status: SCSI Status Error Feb 7 00:22:54 fileserv kernel: (da11:mpt1:0:10:0): SCSI status: Check Condition Feb 7 00:22:54 fileserv kernel: (da11:mpt1:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Feb 7 00:22:55 fileserv kernel: (da10:mpt1:0:9:0): READ(10). CDB: 28 0 a 4e dc e6 0 0 1c 0 Feb 7 00:22:55 fileserv kernel: (da10:mpt1:0:9:0): CAM status: SCSI Status Error Feb 7 00:22:55 fileserv kernel: (da10:mpt1:0:9:0): SCSI status: Check Condition Feb 7 00:22:55 fileserv kernel: (da10:mpt1:0:9:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Feb 7 00:22:55 fileserv kernel: (da12:mpt1:0:11:0): READ(10). CDB: 28 0 a 4e dc e6 0 0 1c 0 Feb 7 00:22:55 fileserv kernel: (da12:mpt1:0:11:0): CAM status: SCSI Status Error Feb 7 00:22:55 fileserv kernel: (da12:mpt1:0:11:0): SCSI status: Check Condition Feb 7 00:22:55 fileserv kernel: (da12:mpt1:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Feb 7 00:22:55 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=517352114176 size=14848 Feb 7 00:23:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=813768227840 size=14848 Feb 7 00:23:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=813768081408 size=14336 Feb 7 00:23:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da10 offset=813768580096 size=14336 Feb 7 00:23:33 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=193050448896 size=14336 Feb 7 00:23:39 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=193222309888 size=14336 Feb 7 00:23:39 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=193222353408 size=14336 Feb 7 00:23:44 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da2 offset=193329220096 size=14848 Feb 7 00:23:44 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=193329454592 size=14336 Feb 7 00:23:44 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da11 offset=193329483776 size=14848 Feb 7 00:23:54 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=188087916032 size=14848 Feb 7 00:23:59 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=188941305344 size=14848 Feb 7 00:23:59 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da10 offset=188941773312 size=14848 Feb 7 00:23:59 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=188941847552 size=14336 Feb 7 00:23:59 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=188941861888 size=14848 Feb 7 00:23:59 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da9 offset=188942083584 size=14336 Feb 7 00:24:01 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da11 offset=189180443648 size=14848 Feb 7 00:24:01 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=189179795456 size=14336 Feb 7 00:24:05 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=189496972800 size=14848 Feb 7 00:24:05 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da10 offset=189496565248 size=14848 Feb 7 00:24:05 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=189497314816 size=14848 Feb 7 00:24:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da12 offset=187531636736 size=14848 Feb 7 00:24:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=187531988480 size=14848 Feb 7 00:24:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=187528953344 size=14848 Feb 7 00:24:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da11 offset=187528983040 size=14848 Feb 7 00:24:12 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=188422677504 size=14336 Feb 7 00:24:12 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da6 offset=188423581184 size=14336 Feb 7 00:24:12 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da12 offset=188424466432 size=14336 Feb 7 00:24:13 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=189423034880 size=14848 Feb 7 00:24:13 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da11 offset=189424520192 size=14848 .. We updated the firmware to the latest version, but unfortunately the problem still occured. We debugged a lot and the problem boiled down to probably being a driver issue. The fail typically occurs when both drives are used heavily at the same time, e.g. when a scrub is running on the zpool. Our current workaround is to avoid these moments (e.g. by chosing backup schedule in an appropriate way) and everything works fine so far. Worst thing that can happen is the need to reboot -- at least no data loss. However, this is still not very satisfying and assuring. >How-To-Repeat: Start zpool scrub. Additionally put some load on the system drives. Wait. >Fix: >Release-Note: >Audit-Trail: >Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201202071101.q17B1ch8082640>