From owner-freebsd-fs@FreeBSD.ORG Sun Sep 25 16:59:50 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 47EE51065670 for ; Sun, 25 Sep 2011 16:59:50 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta01.westchester.pa.mail.comcast.net (qmta02.westchester.pa.mail.comcast.net [76.96.62.24]) by mx1.freebsd.org (Postfix) with ESMTP id EAC328FC15 for ; Sun, 25 Sep 2011 16:59:49 +0000 (UTC) Received: from omta24.westchester.pa.mail.comcast.net ([76.96.62.76]) by qmta01.westchester.pa.mail.comcast.net with comcast id d3me1h0031ei1Bg514zqeD; Sun, 25 Sep 2011 16:59:50 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta24.westchester.pa.mail.comcast.net with comcast id d4zn1h01G1t3BNj3k4zob9; Sun, 25 Sep 2011 16:59:49 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 868E8102C31; Sun, 25 Sep 2011 09:59:46 -0700 (PDT) Date: Sun, 25 Sep 2011 09:59:46 -0700 From: Jeremy Chadwick To: Adam Nowacki Message-ID: <20110925165946.GA42447@icarus.home.lan> References: <4E7F49A7.1020909@platinum.linux.pl> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4E7F49A7.1020909@platinum.linux.pl> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org Subject: Re: ZFS and 3ware controller resets X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 25 Sep 2011 16:59:50 -0000 On Sun, Sep 25, 2011 at 05:32:55PM +0200, Adam Nowacki wrote: > I have a 20 disk storage system, every now and then a disk dies and > causes 3ware controller to reset because of disk timeouts. This cuts > out ZFS from all disks, even healthy ones and the system requires a > hard reset. > Two issues here: > 1) Why the controller has to reset? Thats a completely insane way of > dealing with drive timeout. > 2) ZFS not reopening the disk after controller reset. > > FreeBSD version: 8.1-RELEASE-p1 > > /c0 Driver Version = 3.80.06.003 > /c0 Model = 9650SE-16ML > /c0 Available Memory = 224MB > /c0 Firmware Version = FE9X 4.10.00.007 > /c0 Bios Version = BE9X 4.08.00.002 > /c0 Boot Loader Version = BL9X 3.08.00.001 > > pool: zp2 > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > zp2 ONLINE 0 0 0 > raidz2 ONLINE 0 0 0 > da1p1 ONLINE 0 0 0 > da2p1 ONLINE 0 0 0 > da3p1 ONLINE 0 0 0 > da4p1 ONLINE 0 0 0 > da5p1 ONLINE 0 0 0 > da6p1 ONLINE 0 0 0 > da7p1 ONLINE 0 0 0 > da9p1 ONLINE 0 0 0 > da8p1 ONLINE 0 0 0 > da10p1 ONLINE 0 0 0 > > > Then when disk starts behaving: > > > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > (da3:twa0:0:3:0): READ(10). CDB: 28 0 a3 f4 e7 60 0 0 8 0 > (da3:twa0:0:3:0): CAM status: SCSI Status Error > (da3:twa0:0:3:0): SCSI status: Check Condition > (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > (da3:twa0:0:3:0): READ(10). CDB: 28 0 a5 4 83 80 0 0 80 0 > (da3:twa0:0:3:0): CAM status: SCSI Status Error > (da3:twa0:0:3:0): SCSI status: Check Condition > (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > (da3:twa0:0:3:0): READ(10). CDB: 28 0 a5 4 83 80 0 0 80 0 > (da3:twa0:0:3:0): CAM status: SCSI Status Error > (da3:twa0:0:3:0): SCSI status: Check Condition > (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > (da3:twa0:0:3:0): READ(10). CDB: 28 0 a5 4 83 80 0 0 80 0 > (da3:twa0:0:3:0): CAM status: SCSI Status Error > (da3:twa0:0:3:0): SCSI status: Check Condition > (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > (da3:twa0:0:3:0): READ(10). CDB: 28 0 a5 4 83 80 0 0 80 0 > (da3:twa0:0:3:0): CAM status: SCSI Status Error > (da3:twa0:0:3:0): SCSI status: Check Condition > (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > (da3:twa0:0:3:0): READ(10). CDB: 28 0 a5 4 83 80 0 0 80 0 > (da3:twa0:0:3:0): CAM status: SCSI Status Error > (da3:twa0:0:3:0): SCSI status: Check Condition > (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > (da3:twa0:0:3:0): READ(10). CDB: 28 0 cb 7c 43 b8 0 0 10 0 > (da3:twa0:0:3:0): CAM status: SCSI Status Error > (da3:twa0:0:3:0): SCSI status: Check Condition > (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > (da3:twa0:0:3:0): READ(10). CDB: 28 0 ce e5 ca 30 0 0 20 0 > (da3:twa0:0:3:0): CAM status: SCSI Status Error > (da3:twa0:0:3:0): SCSI status: Check Condition > (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) > (da3:twa0:0:3:0): READ(10). CDB: 28 0 a4 2d 2d f8 0 0 8 0 > (da3:twa0:0:3:0): CAM status: SCSI Status Error > (da3:twa0:0:3:0): SCSI status: Check Condition > (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > twa0: ERROR: (0x04: 0x0009): Drive timeout detected: port=2 > (da3:twa0:0:3:0): READ(10). CDB: 28 0 cb 91 7c f8 0 0 20 0 > (da3:twa0:0:3:0): CAM status: SCSI Status Error > (da3:twa0:0:3:0): SCSI status: Check Condition > (da3:twa0:0:3:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) > twa0: Request 72 timed out! > twa0: INFO: (0x16: 0x1108): Resetting controller...: > twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=0 > twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=3 > twa0: INFO: (0x04: 0x0001): Controller reset occurred: resets=1 > twa0: [ITHREAD] > (da1:twa0:0:1:0): lost device > (da2:twa0:0:2:0): lost device > (da3:twa0:0:3:0): lost device > (da4:twa0:0:4:0): lost device > (da5:twa0:0:5:0): lost device > (da6:twa0:0:6:0): lost device > (da7:twa0:0:7:0): lost device > (da8:twa0:0:8:0): lost device > (da9:twa0:0:9:0): lost device > (da10:twa0:0:10:0): lost device > (da11:twa0:0:11:0): lost device > (da12:twa0:0:12:0): lost device > (da13:twa0:0:13:0): lost device > (da1:twa0:0:1:0): removing device entry > da1 at twa0 bus 0 scbus0 target 1 lun 0 > da1: Fixed Direct Access SCSI-5 device > da1: 100.000MB/s transfers > da1: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C) > (da2:twa0:0:2:0): removing device entry > da2 at twa0 bus 0 scbus0 target 2 lun 0 > da2: Fixed Direct Access SCSI-5 device > da2: 100.000MB/s transfers > da2: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C) > (da3:twa0:0:3:0): removing device entry > da3 at twa0 bus 0 scbus0 target 3 lun 0 > da3: Fixed Direct Access SCSI-5 device > da3: 100.000MB/s transfers > da3: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C) > (da4:twa0:0:4:0): removing device entry > da4 at twa0 bus 0 scbus0 target 4 lun 0 > da4: Fixed Direct Access SCSI-5 device > da4: 100.000MB/s transfers > da4: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C) > (da5:twa0:0:5:0): removing device entry > da5 at twa0 bus 0 scbus0 target 5 lun 0 > da5: Fixed Direct Access SCSI-5 device > da5: 100.000MB/s transfers > da5: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C) > (da6:twa0:0:6:0): removing device entry > da6 at twa0 bus 0 scbus0 target 6 lun 0 > da6: Fixed Direct Access SCSI-5 device > da6: 100.000MB/s transfers > da6: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C) > (da7:twa0:0:7:0): removing device entry > da7 at twa0 bus 0 scbus0 target 7 lun 0 > da7: Fixed Direct Access SCSI-5 device > da7: 100.000MB/s transfers > da7: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C) > (da8:twa0:0:8:0): removing device entry > da8 at twa0 bus 0 scbus0 target 8 lun 0 > da8: Fixed Direct Access SCSI-5 device > da8: 100.000MB/s transfers > da8: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C) > (da9:twa0:0:9:0): removing device entry > da9 at twa0 bus 0 scbus0 target 9 lun 0 > da9: Fixed Direct Access SCSI-5 device > da9: 100.000MB/s transfers > da9: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C) > (da10:twa0:0:10:0): removing device entry > da10 at twa0 bus 0 scbus0 target 10 lun 0 > da10: Fixed Direct Access SCSI-5 device > da10: 100.000MB/s transfers > da10: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C) > (da11:twa0:0:11:0): removing device entry > da11 at twa0 bus 0 scbus0 target 11 lun 0 > da11: Fixed Direct Access SCSI-5 device > da11: 100.000MB/s transfers > da11: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C) > (da12:twa0:0:12:0): removing device entry > da12 at twa0 bus 0 scbus0 target 12 lun 0 > da12: Fixed Direct Access SCSI-5 device > da12: 100.000MB/s transfers > da12: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C) > (da13:twa0:0:13:0): removing device entry > da13 at twa0 bus 0 scbus0 target 13 lun 0 > da13: Fixed Direct Access SCSI-5 device > da13: 100.000MB/s transfers > da13: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C) > > pool: zp2 > state: ONLINE > status: One or more devices are faulted in response to IO failures. > action: Make sure the affected devices are connected, then run > 'zpool clear'. > see: http://www.sun.com/msg/ZFS-8000-HC > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > zp2 ONLINE 7 11 0 > raidz2 ONLINE 16 32 0 > da1p1 ONLINE 4 10 0 > da2p1 ONLINE 4 10 0 > da3p1 ONLINE 5 642 1 > da4p1 ONLINE 3 8 0 > da5p1 ONLINE 3 12 0 > da6p1 ONLINE 3 12 0 > da7p1 ONLINE 3 12 0 > da9p1 ONLINE 3 12 0 > da8p1 ONLINE 3 14 0 > da10p1 ONLINE 3 10 0 > > errors: 10 data errors, use '-v' for a list The behaviour here seems to match something reported here: http://www.freebsd.org/cgi/query-pr.cgi?pr=149968 Now before someone flames me and says "that's a different issue", one has to look closely at the driver diff. It seems that a different type of controller reset is implemented (soft vs. hard), amongst some other details. I am very inclined to believe an updated twa(4) driver will address your problem. I would suggest you try FreeBSD 8.2-STABLE instead. Do not try 8.2-RELEASE, as it will not have this fix; 8.2-RELEASE is from July 2010, while this commit was done September 2010. http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/twa/ Otherwise you can try to build src/sys/dev/twa from a RELENG_8 checkout on your 8.1 box, but I make no guarantees this will work. As for your comments about "why is a reset required, insane blah blah", this is often done when a single port itself cannot be reset (e.g. the controller firmware, or silicon itself, does not truly have a way of "hard resetting" a single port). Finally, I do not understand what you mean by "ZFS not reopening the disk after controller reset". You'll need to explain what you mean by that. And besides, when an underlying storage controller says "this disk is having problems" and drops it from the bus (which is what should be happening -- see beginning of my comments, your complaint, etc.), you **do not** want the OS to re-attach the same disk it just dropped, else you end up in this infinite loop where the controller is dropping a drive from the bus and reattaching, over and over. Makes no sense, even if the issue is bad cabling or otherwise. Administrator intervention is always required in this situation. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |