From owner-freebsd-stable@FreeBSD.ORG  Sat Oct 20 21:39:30 2012
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 295C8660
 for <freebsd-stable@FreeBSD.org>; Sat, 20 Oct 2012 21:39:30 +0000 (UTC)
 (envelope-from bra@fsn.hu)
Received: from people.fsn.hu (people.fsn.hu [195.228.252.137])
 by mx1.freebsd.org (Postfix) with ESMTP id 95ABE8FC0A
 for <freebsd-stable@FreeBSD.org>; Sat, 20 Oct 2012 21:39:28 +0000 (UTC)
Received: by people.fsn.hu (Postfix, from userid 1001)
 id 51FC3EAB237; Sat, 20 Oct 2012 23:39:21 +0200 (CEST)
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MF-ACE0E1EA [pR:
 11.3548]
X-CRM114-CacheID: sfid-20121020_23392_885364C2 
X-CRM114-Status: Good  ( pR: 11.3548 )
X-DSPAM-Result: Whitelisted
X-DSPAM-Processed: Sat Oct 20 23:39:21 2012
X-DSPAM-Confidence: 0.9937
X-DSPAM-Probability: 0.0000
X-DSPAM-Signature: 50831a09251761733311295
X-DSPAM-Factors: 27, From*Attila Nagy <bra@fsn.hu>, 0.00010, boot, 0.00402,
 boot, 0.00402, 01+00, 0.00417, To*FreeBSD.org, 0.00422,
 dump, 0.00452, I+get, 0.00508, the+machine, 0.00542,
 the+machine, 0.00542, 02+00, 0.00542, 1+20, 0.00602,
 driver, 0.00650, root, 0.00656, SCSI, 0.00676, SCSI, 0.00676,
 ZFS, 0.00676, ZFS, 0.00676, command, 0.00706,
 command, 0.00706, 0), 0.00757, 0), 0.00757, Sun, 0.00782,
 20+0, 0.00900, IO, 0.00900, IO, 0.00900, verbose, 0.00900,
X-Spambayes-Classification: ham; 0.00
Received: from [192.168.3.2] (japan.t-online.co.hu [195.228.243.99])
 by people.fsn.hu (Postfix) with ESMTPSA id 73661EAB22C
 for <freebsd-stable@FreeBSD.org>; Sat, 20 Oct 2012 23:39:20 +0200 (CEST)
Message-ID: <50831A07.20803@fsn.hu>
Date: Sat, 20 Oct 2012 23:39:19 +0200
From: Attila Nagy <bra@fsn.hu>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US;
 rv:1.8.1.23) Gecko/20090817 Thunderbird/2.0.0.23 Mnenhy/0.7.6.0
MIME-Version: 1.0
To: freebsd-stable@FreeBSD.org
Subject: mpt doesn't propagate read errors and dies on a single sector?
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 20 Oct 2012 21:39:30 -0000

Hi,

I have a Sun X4540 with LSI C1068E based SAS controllers (FW version: 
1.27.02.00-IT).
My problem is if one drive starts to fail with read errors, the machine 
becomes completely unusable (running stable/9 with ZFS), because -it 
seems- ZFS can't see that there are read errors on a device, the mpt 
driver (controller, kernel?) wants to re-issue the operation endlessly.

Here is a verbose (dev.mpt.0.debug=7 level) dump:
mpt0: Address Reply:
SCSI IO Request Reply @ 0xffffff87ffcfdc00
         IOC Status    Success
         IOCLogInfo    0x00000000
         MsgLength     0x09
         MsgFlags      0x00
         MsgContext    0x000200eb
         Bus:          0
         TargetID      3
         CDBLength     10
         SCSI Status:  Check Condition
         SCSI State:   (0x00000001)AutoSense_Valid
         TransferCnt   0x20000
         SenseCnt      0x0012
         ResponseInfo  0x00000000
(da3:mpt0:0:3:0): READ(10). CDB: 28 0 3a 38 5d e 0 1 0 0
(da3:mpt0:0:3:0): CAM status: SCSI Status Error
(da3:mpt0:0:3:0): SCSI status: Check Condition
(da3:mpt0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da3:mpt0:0:3:0): Info: 0x3a385d1a
(da3:mpt0:0:3:0): Error 5, Unretryable error
SCSI IO Request @ 0xffffff80003046f0
         Chain Offset  0x00
         MsgFlags      0x00
         MsgContext    0x000200ea
         Bus:                0
         TargetID            3
         SenseBufferLength   32
         LUN:              0x0
         Control           0x02000000  READ  SIMPLEQ
         DataLength      0x00020000
         SenseBufAddr    0x0c65d5e0
         CDB[0:10]       28 00 3a 38 5e 0e 00 01 00 00
         SE64 0xffffff87ffd1c430: Addr=0x000000010e858000 
FlagsLength=0xd3020000
          64_BIT_ADDRESSING LAST_ELEMENT END_OF_BUFFER END_OF_LIST
mpt0: Address Reply:
SCSI IO Request Reply @ 0xffffff87ffcfdd00
         IOC Status    Success
         IOCLogInfo    0x00000000
         MsgLength     0x09
         MsgFlags      0x00
         MsgContext    0x000200ea
         Bus:          0
         TargetID      3
         CDBLength     10
         SCSI Status:  Check Condition
         SCSI State:   (0x00000001)AutoSense_Valid
         TransferCnt   0x20000
         SenseCnt      0x0012
         ResponseInfo  0x00000000

And I get these check condition SCSI errors endlessly. If ZFS is enabled 
at boot, the machine can't even start because of this (zpool import 
never finishes), if I boot without ZFS, and try to import, the zpool 
command stucks in the vdev_g state:
  1163 root          1  20    0 35440K  5200K vdev_g  6   0:01 0.10% zpool
procstat -k 1163
   PID    TID COMM             TDNAME KSTACK
  1163 100116 zpool            -                mi_switch 
sleepq_timedwait _sleep biowait vdev_geom_read_guid vdev_geom_open 
vdev_open vdev_open_children vdev_raidz_open vdev_open 
vdev_open_children vdev_root_open vdev_open spa_load spa_tryimport 
zfs_ioc_pool_tryimport zfsdev_ioctl devfs_ioctl_f

Could it be that GEOM/ZFS doesn't receive this read error and waits 
indefinitely for the command to complete?