Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 7 Feb 2012 11:01:38 GMT
From:      Johannes Reinhard <johannes.reinhard@physik.uni-erlangen.de>
To:        freebsd-gnats-submit@FreeBSD.org
Subject:   kern/164844: [zfs] [mpt] Kernel Panic with ZFS and LSI Logic SAS/SATA controller
Message-ID:  <201202071101.q17B1ch8082640@red.freebsd.org>
Resent-Message-ID: <201202071110.q17BABwF083606@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help

>Number:         164844
>Category:       kern
>Synopsis:       [zfs] [mpt] Kernel Panic with ZFS and LSI Logic SAS/SATA controller
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Tue Feb 07 11:10:11 UTC 2012
>Closed-Date:
>Last-Modified:
>Originator:     Johannes Reinhard
>Release:        FreeBSD 8.1
>Organization:
FAU Erlangen-Nürnberg
>Environment:
FreeBSD fileserv 8.1-RELEASE-p5 FreeBSD 8.1-RELEASE-p5 #0: Tue Sep 27 16:49:00 UTC 2011     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64
>Description:
We are using a SUN Fire v40z with two different LSI controllers both
managed by the mpt driver. One is used for the system drives (mpt0), the
other one (mpt1) for an external storage enclosure in JBOD mode (Sun J4200).

# pciconf -lvv | grep <...>
mpt0@pci0:2:4:0:        class=0x010000 card=0x002017c2 chip=0x00301000 rev=0x08 hdr=0x00
    vendor     = 'LSI Logic (Was: Symbios Logic, NCR)'
    device     = 'PCI-X to Ultra320 SCSI Controller (LSI53C1020/1030)'
    class      = mass storage
    subclass   = SCSI
mpt1@pci0:36:1:0:       class=0x010000 card=0x30e01000 chip=0x00541000 rev=0x02 hdr=0x00
    vendor     = 'LSI Logic (Was: Symbios Logic, NCR)'
    device     = 'SAS 3000 series, 8-port with 1068 -StorPort'
    class      = mass storage
    subclass   = SCSI

# mptutil -u 0 show adapter
mpt0 Adapter:
       Board Name: 0
   Board Assembly: 0
        Chip Name: C1030
    Chip Revision: 0
      RAID Levels: RAID1, RAID1E

# mptutil -u 1 show adapter
mpt1 Adapter:
       Board Name: SAS3801X
   Board Assembly: L3-00146-02D
        Chip Name: C1068
    Chip Revision: UNUSED
      RAID Levels: none


On the JBOD we are using ZFS. When the kernel panic occurs, typical
symptoms in /var/log/messages look like this

..
Feb  5 15:14:55 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=654675777536 size=14848
Feb  5 15:14:55 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da2 offset=654676050432 size=14848
Feb  5 15:14:55 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da9 offset=654676182016 size=14848
Feb  5 15:15:12 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=655119295488 size=14848
Feb  5 15:15:12 fileserv kernel: mpt1: mpt_intr: no target cmd ptrs
Feb  5 15:15:16 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=461695998976 size=14848
Feb  5 15:15:50 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da10 offset=655741660160 size=14336
Feb  5 15:15:50 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=655741674496 size=14336
Feb  5 15:15:52 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=655768189440 size=14848
Feb  5 15:15:52 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=655768204288 size=14848
..

or immediately before failure like this (captured via remote console,
normally most of the log is lost after reboot)

..
Feb  7 00:21:01 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=909219043840 size=14848
Feb  7 00:21:01 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=909219087872 size=14848
Feb  7 00:21:01 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=909219293184 size=14848
Feb  7 00:21:03 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=323225315328 size=14336
Feb  7 00:21:03 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da11 offset=323225417728 size=14336
Feb  7 00:21:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=323291898880 size=14336
Feb  7 00:21:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=323291869184 size=14336
Feb  7 00:21:27 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=516894794752 size=14336
Feb  7 00:21:27 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=516894795264 size=14336
Feb  7 00:21:31 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da9 offset=517280401920 size=14336
Feb  7 00:21:31 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da6 offset=517280489984 size=14848
Feb  7 00:21:31 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=517280182272 size=14336
Feb  7 00:21:33 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da2 offset=813396810240 size=14848
Feb  7 00:21:37 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da12 offset=813434911744 size=14336
Feb  7 00:21:37 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=813435424768 size=14336
Feb  7 00:21:38 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=517349635584 size=14336
Feb  7 00:21:38 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=517349723648 size=14848
Feb  7 00:21:38 fileserv kernel: mpt1: Context Reply 0x00000003?
Feb  7 00:21:38 fileserv kernel:
Feb  7 00:21:40 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da2 offset=517380763648 size=14848
Feb  7 00:21:40 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=517380807680 size=14848
Feb  7 00:21:44 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=517434250752 size=14848
Feb  7 00:21:44 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=517434368512 size=14336
Feb  7 00:21:44 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da6 offset=517434265600 size=14336
Feb  7 00:21:45 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=813524531200 size=14336
Feb  7 00:21:45 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da2 offset=813524502016 size=14336
Feb  7 00:22:38 fileserv kernel: mpt1: request 0xffffff80003dc610:58518 timed out for ccb 0xffffff0011cf1000 (req->ccb 0xffffff0011cf1000)
Feb  7 00:22:38 fileserv kernel: mpt1:
Feb  7 00:22:38 fileserv kernel: attempting to abort req 0xffffff80003dc610:58518 function 0
Feb  7 00:22:39 fileserv kernel: mpt1: mpt_wait_req(1) timed out
Feb  7 00:22:39 fileserv kernel:
Feb  7 00:22:39 fileserv kernel: mpt1: mpt_recover_commands: abort timed-out. Resetting controller
Feb  7 00:22:39 fileserv kernel: mpt1: mpt_cam_event: 0x0
Feb  7 00:22:39 fileserv kernel: mpt1: mpt_cam_event: 0x0
Feb  7 00:22:39 fileserv kernel: mpt1: completing timedout/aborted req 0xffffff80003dc610:58518
Feb  7 00:22:51 fileserv kernel: mpt1: mpt_cam_event: 0x16
Feb  7 00:22:51 fileserv kernel: mpt1: mpt_cam_event: 0x12
Feb  7 00:22:51 fileserv kernel: mpt1: mpt_cam_event: 0x1b
Feb  7 00:22:51 fileserv kernel: mpt1: mpt_cam_event: 0x12
Feb  7 00:22:51 fileserv last message repeated 19 times
Feb  7 00:22:51 fileserv kernel: mpt1: mpt_cam_event: 0x16
Feb  7 00:22:51 fileserv last message repeated 2 times
Feb  7 00:22:51 fileserv kernel: (da5:mpt1:0:4:0): READ(10). CDB: 28 0 3c 3a 6b 64 0 0 72 0
Feb  7 00:22:51 fileserv kernel: (da5:mpt1:0:4:0): CAM status: SCSI Status Error
Feb  7 00:22:51 fileserv kernel: (da5:mpt1:0:4:0): SCSI status: Check Condition
Feb  7 00:22:51 fileserv kernel: (da5:mpt1:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb  7 00:22:51 fileserv kernel: (da6:mpt1:0:5:0): READ(10). CDB: 28 0 a 4e dd 3 0 0 1c 0
Feb  7 00:22:54 fileserv kernel:
Feb  7 00:22:54 fileserv kernel: (da6:mpt1:0:5:0): CAM status: SCSI Status Error
Feb  7 00:22:54 fileserv kernel: (da6:mpt1:0:5:0): SCSI status: Check Condition
Feb  7 00:22:54 fileserv kernel: (da6:mpt1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb  7 00:22:54 fileserv kernel: (da3:mpt1:0:2:0): READ(10). CDB: 28 0 a 4e dd 3 0 0 1c 0
Feb  7 00:22:54 fileserv kernel: (da3:mpt1:0:2:0): CAM status: SCSI Status Error
Feb  7 00:22:54 fileserv kernel: (da3:mpt1:0:2:0): SCSI status: Check Condition
Feb  7 00:22:54 fileserv kernel: (da3:mpt1:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb  7 00:22:54 fileserv kernel: (da2:mpt1:0:1:0): READ(10). CDB: 28 0 a 4e dd 3 0 0 1c 0
Feb  7 00:22:54 fileserv kernel: (da2:mpt1:0:1:0): CAM status: SCSI Status Error
Feb  7 00:22:54 fileserv kernel: (da2:mpt1:0:1:0): SCSI status: Check Condition
Feb  7 00:22:54 fileserv kernel: (da2:mpt1:0:1:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb  7 00:22:54 fileserv kernel: (da1:mpt1:0:0:0): READ(10). CDB: 28 0 a 4e dd 3 0 0 1c 0
Feb  7 00:22:54 fileserv kernel: (da1:mpt1:0:0:0): CAM status: SCSI Status Error
Feb  7 00:22:54 fileserv kernel: (da1:mpt1:0:0:0): SCSI status: Check Condition
Feb  7 00:22:54 fileserv kernel: (da1:mpt1:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb  7 00:22:54 fileserv kernel: (da8:mpt1:0:7:0): READ(10). CDB: 28 0 a 4e dd 2 0 0 1d 0
Feb  7 00:22:54 fileserv kernel: (da8:mpt1:0:7:0): CAM status: SCSI Status Error
Feb  7 00:22:54 fileserv kernel: (da8:mpt1:0:7:0): SCSI status: Check Condition
Feb  7 00:22:54 fileserv kernel: (da8:mpt1:0:7:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb  7 00:22:54 fileserv kernel: (da4:mpt1:0:3:0): READ(10). CDB: 28 0 a 4e dd 2 0 0 1d 0
Feb  7 00:22:54 fileserv kernel: (da4:mpt1:0:3:0): CAM status: SCSI Status Error
Feb  7 00:22:54 fileserv kernel: (da4:mpt1:0:3:0): SCSI status: Check Condition
Feb  7 00:22:54 fileserv kernel: (da4:mpt1:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb  7 00:22:54 fileserv kernel: (da9:mpt1:0:8:0): READ(10). CDB: 28 0 a 4e dd 2 0 0 1d 0
Feb  7 00:22:54 fileserv kernel: (da9:mpt1:0:8:0): CAM status: SCSI Status Error
Feb  7 00:22:54 fileserv kernel: (da9:mpt1:0:8:0): SCSI status: Check Condition
Feb  7 00:22:54 fileserv kernel: (da9:mpt1:0:8:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb  7 00:22:54 fileserv kernel: (da11:mpt1:0:10:0): READ(10). CDB: 28 0 a 4e dd 2 0 0 1d 0
Feb  7 00:22:54 fileserv kernel: (da11:mpt1:0:10:0): CAM status: SCSI Status Error
Feb  7 00:22:54 fileserv kernel: (da11:mpt1:0:10:0): SCSI status: Check Condition
Feb  7 00:22:54 fileserv kernel: (da11:mpt1:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb  7 00:22:55 fileserv kernel: (da10:mpt1:0:9:0): READ(10). CDB: 28 0 a 4e dc e6 0 0 1c 0
Feb  7 00:22:55 fileserv kernel: (da10:mpt1:0:9:0): CAM status: SCSI Status Error
Feb  7 00:22:55 fileserv kernel: (da10:mpt1:0:9:0): SCSI status: Check Condition
Feb  7 00:22:55 fileserv kernel: (da10:mpt1:0:9:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb  7 00:22:55 fileserv kernel: (da12:mpt1:0:11:0): READ(10). CDB: 28 0 a 4e dc e6 0 0 1c 0
Feb  7 00:22:55 fileserv kernel: (da12:mpt1:0:11:0): CAM status: SCSI Status Error
Feb  7 00:22:55 fileserv kernel: (da12:mpt1:0:11:0): SCSI status: Check Condition
Feb  7 00:22:55 fileserv kernel: (da12:mpt1:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb  7 00:22:55 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=517352114176 size=14848
Feb  7 00:23:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=813768227840 size=14848
Feb  7 00:23:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=813768081408 size=14336
Feb  7 00:23:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da10 offset=813768580096 size=14336
Feb  7 00:23:33 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=193050448896 size=14336
Feb  7 00:23:39 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=193222309888 size=14336
Feb  7 00:23:39 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=193222353408 size=14336
Feb  7 00:23:44 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da2 offset=193329220096 size=14848
Feb  7 00:23:44 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=193329454592 size=14336
Feb  7 00:23:44 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da11 offset=193329483776 size=14848
Feb  7 00:23:54 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=188087916032 size=14848
Feb  7 00:23:59 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=188941305344 size=14848
Feb  7 00:23:59 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da10 offset=188941773312 size=14848
Feb  7 00:23:59 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=188941847552 size=14336
Feb  7 00:23:59 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da1 offset=188941861888 size=14848
Feb  7 00:23:59 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da9 offset=188942083584 size=14336
Feb  7 00:24:01 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da11 offset=189180443648 size=14848
Feb  7 00:24:01 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=189179795456 size=14336
Feb  7 00:24:05 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=189496972800 size=14848
Feb  7 00:24:05 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da10 offset=189496565248 size=14848
Feb  7 00:24:05 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=189497314816 size=14848
Feb  7 00:24:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da12 offset=187531636736 size=14848
Feb  7 00:24:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da4 offset=187531988480 size=14848
Feb  7 00:24:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da8 offset=187528953344 size=14848
Feb  7 00:24:09 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da11 offset=187528983040 size=14848
Feb  7 00:24:12 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da3 offset=188422677504 size=14336
Feb  7 00:24:12 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da6 offset=188423581184 size=14336
Feb  7 00:24:12 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da12 offset=188424466432 size=14336
Feb  7 00:24:13 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da5 offset=189423034880 size=14848
Feb  7 00:24:13 fileserv root: ZFS: checksum mismatch, zpool=bigpond path=/dev/da11 offset=189424520192 size=14848
..

We updated the firmware to the latest version, but unfortunately the
problem still occured.

We debugged a lot and the problem boiled down to probably being a
driver issue. The fail typically occurs when both drives are used
heavily at the same time, e.g. when a scrub is running on the zpool.

Our current workaround is to avoid these moments (e.g. by chosing
backup schedule in an appropriate way) and everything works fine so
far. Worst thing that can happen is the need to reboot -- at least no
data loss. However, this is still not very satisfying and assuring.
>How-To-Repeat:
Start zpool scrub. Additionally put some load on the system drives. Wait.
>Fix:


>Release-Note:
>Audit-Trail:
>Unformatted:



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201202071101.q17B1ch8082640>