From owner-freebsd-fs@FreeBSD.ORG  Sun Sep 25 16:59:50 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 47EE51065670
	for <freebsd-fs@freebsd.org>; Sun, 25 Sep 2011 16:59:50 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta01.westchester.pa.mail.comcast.net
	(qmta02.westchester.pa.mail.comcast.net [76.96.62.24])
	by mx1.freebsd.org (Postfix) with ESMTP id EAC328FC15
	for <freebsd-fs@freebsd.org>; Sun, 25 Sep 2011 16:59:49 +0000 (UTC)
Received: from omta24.westchester.pa.mail.comcast.net ([76.96.62.76])
	by qmta01.westchester.pa.mail.comcast.net with comcast
	id d3me1h0031ei1Bg514zqeD; Sun, 25 Sep 2011 16:59:50 +0000
Received: from koitsu.dyndns.org ([67.180.84.87])
	by omta24.westchester.pa.mail.comcast.net with comcast
	id d4zn1h01G1t3BNj3k4zob9; Sun, 25 Sep 2011 16:59:49 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 868E8102C31; Sun, 25 Sep 2011 09:59:46 -0700 (PDT)
Date: Sun, 25 Sep 2011 09:59:46 -0700
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: Adam Nowacki <nowakpl@platinum.linux.pl>
Message-ID: <20110925165946.GA42447@icarus.home.lan>
References: <4E7F49A7.1020909@platinum.linux.pl>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4E7F49A7.1020909@platinum.linux.pl>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-fs@freebsd.org
Subject: Re: ZFS and 3ware controller resets
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 25 Sep 2011 16:59:50 -0000

On Sun, Sep 25, 2011 at 05:32:55PM +0200, Adam Nowacki wrote:
> I have a 20 disk storage system, every now and then a disk dies and
> causes 3ware controller to reset because of disk timeouts. This cuts
> out ZFS from all disks, even healthy ones and the system requires a
> hard reset.
> Two issues here:
> 1) Why the controller has to reset? Thats a completely insane way of
> dealing with drive timeout.
> 2) ZFS not reopening the disk after controller reset.
> 
> FreeBSD version: 8.1-RELEASE-p1
> 
> /c0 Driver Version = 3.80.06.003
> /c0 Model = 9650SE-16ML
> /c0 Available Memory = 224MB
> /c0 Firmware Version = FE9X 4.10.00.007
> /c0 Bios Version = BE9X 4.08.00.002
> /c0 Boot Loader Version = BL9X 3.08.00.001
> 
>   pool: zp2
>  state: ONLINE
>  scrub: none requested
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         zp2         ONLINE       0     0     0
>           raidz2    ONLINE       0     0     0
>             da1p1   ONLINE       0     0     0
>             da2p1   ONLINE       0     0     0
>             da3p1   ONLINE       0     0     0
>             da4p1   ONLINE       0     0     0
>             da5p1   ONLINE       0     0     0
>             da6p1   ONLINE       0     0     0
>             da7p1   ONLINE       0     0     0
>             da9p1   ONLINE       0     0     0
>             da8p1   ONLINE       0     0     0
>             da10p1  ONLINE       0     0     0
> 
> 
> Then when disk starts behaving:
> 
> 
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> (da3:twa0:0:3:0): READ(10). CDB: 28 0 a3 f4 e7 60 0 0 8 0
> (da3:twa0:0:3:0): CAM status: SCSI Status Error
> (da3:twa0:0:3:0): SCSI status: Check Condition
> (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> (da3:twa0:0:3:0): READ(10). CDB: 28 0 a5 4 83 80 0 0 80 0
> (da3:twa0:0:3:0): CAM status: SCSI Status Error
> (da3:twa0:0:3:0): SCSI status: Check Condition
> (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> (da3:twa0:0:3:0): READ(10). CDB: 28 0 a5 4 83 80 0 0 80 0
> (da3:twa0:0:3:0): CAM status: SCSI Status Error
> (da3:twa0:0:3:0): SCSI status: Check Condition
> (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> (da3:twa0:0:3:0): READ(10). CDB: 28 0 a5 4 83 80 0 0 80 0
> (da3:twa0:0:3:0): CAM status: SCSI Status Error
> (da3:twa0:0:3:0): SCSI status: Check Condition
> (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> (da3:twa0:0:3:0): READ(10). CDB: 28 0 a5 4 83 80 0 0 80 0
> (da3:twa0:0:3:0): CAM status: SCSI Status Error
> (da3:twa0:0:3:0): SCSI status: Check Condition
> (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> (da3:twa0:0:3:0): READ(10). CDB: 28 0 a5 4 83 80 0 0 80 0
> (da3:twa0:0:3:0): CAM status: SCSI Status Error
> (da3:twa0:0:3:0): SCSI status: Check Condition
> (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> (da3:twa0:0:3:0): READ(10). CDB: 28 0 cb 7c 43 b8 0 0 10 0
> (da3:twa0:0:3:0): CAM status: SCSI Status Error
> (da3:twa0:0:3:0): SCSI status: Check Condition
> (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> (da3:twa0:0:3:0): READ(10). CDB: 28 0 ce e5 ca 30 0 0 20 0
> (da3:twa0:0:3:0): CAM status: SCSI Status Error
> (da3:twa0:0:3:0): SCSI status: Check Condition
> (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> (da3:twa0:0:3:0): READ(10). CDB: 28 0 a4 2d 2d f8 0 0 8 0
> (da3:twa0:0:3:0): CAM status: SCSI Status Error
> (da3:twa0:0:3:0): SCSI status: Check Condition
> (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2
> (da3:twa0:0:3:0): READ(10). CDB: 28 0 cb 91 7c f8 0 0 20 0
> (da3:twa0:0:3:0): CAM status: SCSI Status Error
> (da3:twa0:0:3:0): SCSI status: Check Condition
> (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> twa0: Request 72 timed out!
> twa0: INFO: (0x16: 0x1108): Resetting controller...:
> twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=0
> twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=3
> twa0: INFO: (0x04: 0x0001): Controller reset occurred: resets=1
> twa0: [ITHREAD]
> (da1:twa0:0:1:0): lost device
> (da2:twa0:0:2:0): lost device
> (da3:twa0:0:3:0): lost device
> (da4:twa0:0:4:0): lost device
> (da5:twa0:0:5:0): lost device
> (da6:twa0:0:6:0): lost device
> (da7:twa0:0:7:0): lost device
> (da8:twa0:0:8:0): lost device
> (da9:twa0:0:9:0): lost device
> (da10:twa0:0:10:0): lost device
> (da11:twa0:0:11:0): lost device
> (da12:twa0:0:12:0): lost device
> (da13:twa0:0:13:0): lost device
> (da1:twa0:0:1:0): removing device entry
> da1 at twa0 bus 0 scbus0 target 1 lun 0
> da1: <AMCC 9650SE-16M DISK 4.10> Fixed Direct Access SCSI-5 device
> da1: 100.000MB/s transfers
> da1: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> (da2:twa0:0:2:0): removing device entry
> da2 at twa0 bus 0 scbus0 target 2 lun 0
> da2: <AMCC 9650SE-16M DISK 4.10> Fixed Direct Access SCSI-5 device
> da2: 100.000MB/s transfers
> da2: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> (da3:twa0:0:3:0): removing device entry
> da3 at twa0 bus 0 scbus0 target 3 lun 0
> da3: <AMCC 9650SE-16M DISK 4.10> Fixed Direct Access SCSI-5 device
> da3: 100.000MB/s transfers
> da3: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> (da4:twa0:0:4:0): removing device entry
> da4 at twa0 bus 0 scbus0 target 4 lun 0
> da4: <AMCC 9650SE-16M DISK 4.10> Fixed Direct Access SCSI-5 device
> da4: 100.000MB/s transfers
> da4: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> (da5:twa0:0:5:0): removing device entry
> da5 at twa0 bus 0 scbus0 target 5 lun 0
> da5: <AMCC 9650SE-16M DISK 4.10> Fixed Direct Access SCSI-5 device
> da5: 100.000MB/s transfers
> da5: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> (da6:twa0:0:6:0): removing device entry
> da6 at twa0 bus 0 scbus0 target 6 lun 0
> da6: <AMCC 9650SE-16M DISK 4.10> Fixed Direct Access SCSI-5 device
> da6: 100.000MB/s transfers
> da6: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> (da7:twa0:0:7:0): removing device entry
> da7 at twa0 bus 0 scbus0 target 7 lun 0
> da7: <AMCC 9650SE-16M DISK 4.10> Fixed Direct Access SCSI-5 device
> da7: 100.000MB/s transfers
> da7: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> (da8:twa0:0:8:0): removing device entry
> da8 at twa0 bus 0 scbus0 target 8 lun 0
> da8: <AMCC 9650SE-16M DISK 4.10> Fixed Direct Access SCSI-5 device
> da8: 100.000MB/s transfers
> da8: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> (da9:twa0:0:9:0): removing device entry
> da9 at twa0 bus 0 scbus0 target 9 lun 0
> da9: <AMCC 9650SE-16M DISK 4.10> Fixed Direct Access SCSI-5 device
> da9: 100.000MB/s transfers
> da9: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> (da10:twa0:0:10:0): removing device entry
> da10 at twa0 bus 0 scbus0 target 10 lun 0
> da10: <AMCC 9650SE-16M DISK 4.10> Fixed Direct Access SCSI-5 device
> da10: 100.000MB/s transfers
> da10: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> (da11:twa0:0:11:0): removing device entry
> da11 at twa0 bus 0 scbus0 target 11 lun 0
> da11: <AMCC 9650SE-16M DISK 4.10> Fixed Direct Access SCSI-5 device
> da11: 100.000MB/s transfers
> da11: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> (da12:twa0:0:12:0): removing device entry
> da12 at twa0 bus 0 scbus0 target 12 lun 0
> da12: <AMCC 9650SE-16M DISK 4.10> Fixed Direct Access SCSI-5 device
> da12: 100.000MB/s transfers
> da12: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> (da13:twa0:0:13:0): removing device entry
> da13 at twa0 bus 0 scbus0 target 13 lun 0
> da13: <AMCC 9650SE-16M DISK 4.10> Fixed Direct Access SCSI-5 device
> da13: 100.000MB/s transfers
> da13: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
> 
>   pool: zp2
>  state: ONLINE
> status: One or more devices are faulted in response to IO failures.
> action: Make sure the affected devices are connected, then run
> 'zpool clear'.
>    see: http://www.sun.com/msg/ZFS-8000-HC
>  scrub: none requested
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         zp2         ONLINE       7    11     0
>           raidz2    ONLINE      16    32     0
>             da1p1   ONLINE       4    10     0
>             da2p1   ONLINE       4    10     0
>             da3p1   ONLINE       5   642     1
>             da4p1   ONLINE       3     8     0
>             da5p1   ONLINE       3    12     0
>             da6p1   ONLINE       3    12     0
>             da7p1   ONLINE       3    12     0
>             da9p1   ONLINE       3    12     0
>             da8p1   ONLINE       3    14     0
>             da10p1  ONLINE       3    10     0
> 
> errors: 10 data errors, use '-v' for a list

The behaviour here seems to match something reported here:

http://www.freebsd.org/cgi/query-pr.cgi?pr=149968

Now before someone flames me and says "that's a different issue", one
has to look closely at the driver diff.  It seems that a different type
of controller reset is implemented (soft vs. hard), amongst some other
details.  I am very inclined to believe an updated twa(4) driver will
address your problem.

I would suggest you try FreeBSD 8.2-STABLE instead.  Do not try
8.2-RELEASE, as it will not have this fix; 8.2-RELEASE is from July
2010, while this commit was done September 2010.

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/twa/

Otherwise you can try to build src/sys/dev/twa from a RELENG_8
checkout on your 8.1 box, but I make no guarantees this will work.

As for your comments about "why is a reset required, insane blah blah",
this is often done when a single port itself cannot be reset (e.g. the
controller firmware, or silicon itself, does not truly have a way of
"hard resetting" a single port).

Finally, I do not understand what you mean by "ZFS not reopening the
disk after controller reset".  You'll need to explain what you mean by
that.

And besides, when an underlying storage controller says "this disk is
having problems" and drops it from the bus (which is what should be
happening -- see beginning of my comments, your complaint, etc.), you
**do not** want the OS to re-attach the same disk it just dropped, else
you end up in this infinite loop where the controller is dropping a
drive from the bus and reattaching, over and over.  Makes no sense, even
if the issue is bad cabling or otherwise.  Administrator intervention is
always required in this situation.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |