From owner-freebsd-fs@FreeBSD.ORG Sat Feb 23 03:57:10 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A65CB649 for ; Sat, 23 Feb 2013 03:57:10 +0000 (UTC) (envelope-from jdc@koitsu.org) Received: from qmta03.emeryville.ca.mail.comcast.net (qmta03.emeryville.ca.mail.comcast.net [IPv6:2001:558:fe2d:43:76:96:30:32]) by mx1.freebsd.org (Postfix) with ESMTP id 6D85FA29 for ; Sat, 23 Feb 2013 03:57:10 +0000 (UTC) Received: from omta21.emeryville.ca.mail.comcast.net ([76.96.30.88]) by qmta03.emeryville.ca.mail.comcast.net with comcast id 3cG11l0021u4NiLA3fx9F3; Sat, 23 Feb 2013 03:57:09 +0000 Received: from koitsu.strangled.net ([67.180.84.87]) by omta21.emeryville.ca.mail.comcast.net with comcast id 3fx81l00c1t3BNj8hfx9RQ; Sat, 23 Feb 2013 03:57:09 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id D8FB673A31; Fri, 22 Feb 2013 19:57:08 -0800 (PST) Date: Fri, 22 Feb 2013 19:57:08 -0800 From: Jeremy Chadwick To: Alexander Motin Subject: Re: disk "flipped" - a known problem? Message-ID: <20130223035708.GA23614@icarus.home.lan> References: <20130121221617.GA23909@icarus.home.lan> <50FED818.7070704@FreeBSD.org> <20130125083619.GA51096@icarus.home.lan> <20130125211232.GA3037@icarus.home.lan> <20130125212559.GA1772@icarus.home.lan> <20130125213209.GA1858@icarus.home.lan> <20130126011754.GA1806@icarus.home.lan> <51267055.3040500@FreeBSD.org> <20130221233609.GA92249@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130221233609.GA92249@icarus.home.lan> User-Agent: Mutt/1.5.21 (2010-09-15) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20121106; t=1361591829; bh=CXCEcgFQ1KOgXnPMZpxlAAIvGfe8VMQgvzd8wdtnrXU=; h=Received:Received:Received:Date:From:To:Subject:Message-ID: MIME-Version:Content-Type; b=MDvDZxv6U98jkGZVE5C70i74I4EH5tS9nUwGmzmYDv00uJ9vqcVpgkPon5H/BbF/X QwvLPL1anb6N6WrJGPi05CYRNMmEQCg8Glij3qmQBMGOIvAGDTLTqFszh7t29uXT47 tzP9u0Ry3DOxvNuS5A3ejtk3sWLrKBkK2Ng/va7G+xTzGOkX8Y0leR6CJMY4UtdM4x UjC24+YBglX+cBkmBCWkq7CXWFzY29OTAm4zvub3eLEU9pI9ZuBhtnnFKfbagFNE6b TSIucXBNNaTn9VzpXLUjz6cFYkxARczAJTOIR+7rQBNJnrHLL1C1g0w0kmueZd/acI c9bYAb/da5iRA== Cc: freebsd-fs@freebsd.org, avg@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 23 Feb 2013 03:57:10 -0000 On Thu, Feb 21, 2013 at 03:36:09PM -0800, Jeremy Chadwick wrote: > On Thu, Feb 21, 2013 at 09:07:01PM +0200, Alexander Motin wrote: > > On 26.01.2013 03:17, Jeremy Chadwick wrote: > > > Okay, I've figured out the exact, 100% reproducible condition that > > > causes the situation. It took me a lot of tries and a digital pocket > > > recorder to take verbal notes (there are just too many things to look at > > > simultaneously), but I've figured it out. > > > > > > I'm sorry for the verbosity, but it's necessary. > > > > > > Assume the disk we're talking about is /dev/ada5. > > > > > > 1. Prior to any issues, we have this: > > > > > > root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* > > > crw-r----- 1 root operator 0x8c Jan 25 16:41 /dev/ada5 > > > crw------- 1 root operator 0x75 Jan 25 16:35 /dev/pass5 > > > crw------- 1 root operator 0x51 Jan 25 16:35 /dev/xpt0 > > > > > > 2. ada5 begins experiencing issues -- ATA commands (CDBs) submit do not > > > get a response (not going to discuss how/why that can happen). > > > > > > 3. These types of messages are seen on console (naturally the CDB and > > > request type will vary -- in this case it was because I was doing the dd > > > zero'ing, thus tickling the bad sector/naughty firmware on the drive): > > > > > > Jan 25 16:29:28 icarus kernel: ahcich5: Timeout on slot 0 port 0 > > > Jan 25 16:29:28 icarus kernel: ahcich5: is 00000000 cs 00000000 ss 00000001 rs 00000001 tfd 40 serr 00000000 cmd 0004c017 > > > Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset... > > > Jan 25 16:29:28 icarus kernel: ahcich5: SATA connect time=1000us status=00000113 > > > Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset: device found > > > Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 80 80 77 01 40 00 00 00 00 00 00 > > > Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout > > > Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): Retrying command > > > > > > 4. Any I/O submit to ada5 during this time blocks (this is normal). > > > > > > 5. **While this situation is happening**, something using xpt(4) > > > attempts to submit a CDB to the disk (ex. smartctl -a /dev/ada5). > > > This request also blocks (again, normal). > > > > > > 6. Physical device falls off bus, or CAM kicks the disk off the bus. > > > Doesn't matter which. We see messages resembling this (boy am I tired > > > of this interspersed output problem): > > > > > > Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): lost device > > > Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): lost device > > > Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): removing device entry > > > Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): passdevgonecb: devfs entry is gone > > > > > > 7. Standard I/O requests fail with errno=6 "Device not configured". > > > xpt(4) requests also fail with the same errno. > > > > > > 8. Device-wise, at this stage all we have is: > > > > > > root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* > > > crw------- 1 root operator 0x51 Jan 25 16:35 /dev/xpt0 > > > > > > 9. Device comes back online for whatever reason. FreeBSD sees the disk, > > > blah blah blah: > > > > > > Jan 25 16:30:16 icarus kernel: GEOM: new disk ada5 > > > Jan 25 16:30:16 icarus kernel: ada5: ATA-7 SATA 1.x device > > > Jan 25 16:30:16 icarus kernel: ada5: Serial Number WD-WMAP41573589 > > > Jan 25 16:30:16 icarus kernel: ada5: 150.000MB/s transfers (SATA 1.x, UDMA6, PIO 8192bytes) > > > Jan 25 16:30:16 icarus kernel: ada5: Command Queueing enabled > > > Jan 25 16:30:16 icarus kernel: ada5: 143089MB (293046768 512 byte sectors: 16H 63S/T 16383C) > > > Jan 25 16:30:16 icarus kernel: ada5: Previously was known as ad14 > > > > > > ...um, where's pass5? > > > > > > 10. /dev/pass5 is now completely (permanently) missing: > > > > > > root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* > > > crw-r----- 1 root operator 0x99 Jan 25 16:42 /dev/ada5 > > > crw------- 1 root operator 0x51 Jan 25 16:35 /dev/xpt0 > > > > > > 11. Any further attempts to communicate via xpt(4) with ada5 fail. > > > Detaching and reattaching the disk does not fix the issue; the only fix > > > is to reboot the system. > > > > > > 12. "camcontrol debug -IPXp scbus5" results in tons and tons of output > > > all pertaining to xpt(4). It looks like xpt(4) is in some kind of > > > loop. > > > > > > Below is my verbose boot (with non-kernel things removed), which > > > also includes "camcontrol debug" output once things are in a bad state: > > > > > > http://jdc.koitsu.org/freebsd/xpt_oddity.log > > > > > > In this log you'll see that after 1 CAM timeout I yanked the drive, then > > > roughly 30 seconds later reinserted it. > > > > > > If you need me to turn on CAM debugging *prior* to the above, I can do > > > that, just let me know. > > > > > > The important step is #5. Without that, the problem shown in #9/10/11 > > > does not happen. > > > > > > It's a good thing I don't run smartd(8) -- most users I see using that > > > software set the interval to something like 180s or 60s. Imagine this > > > frustration: "okay so the disk fell off the bus, but what, now I can't > > > talk to it with SMART? Uhhh... Err, works now? Whatever". > > > > I think, the problem may already be fixed in HEAD by r244014 by ken@. > > I've just merged it to 9-STABLE at r247115. So if it is still possible > > to reproduce the situation, it would be good to try. > > Yep, I saw the commit per svn-src-stable-9@freebsd.org, along with > a bunch of others; I wasn't sure if r247114 or r247115 fixed it, so ws > waiting for a follow-up from you. :-) > > I'll rebuild world/kernel and try it out + report back. Thank you (and > ken@ too!) for the work on this. Got around to this today -- I can confirm as of r247132 on stable/9 the above problem is gone. Verification details below, for those who care: Initial attachment: Feb 22 19:40:32 icarus kernel: ada5 at ahcich5 bus 0 scbus5 target 0 lun 0 Feb 22 19:40:32 icarus kernel: ada5: ATA-8 SATA 2.x device Feb 22 19:40:32 icarus kernel: ada5: 300.000MB/s transfers (SATA 2.x, UDMA5, PIO 8192bytes) Feb 22 19:40:32 icarus kernel: ada5: Command Queueing enabled Feb 22 19:40:32 icarus kernel: ada5: 715404MB (1465149168 512 byte sectors: 16H 63S/T 16383C) Feb 22 19:40:32 icarus kernel: ada5: Previously was known as ad14 Ran dd if=/dev/zero of=/dev/ada5 bs=64k. Timeouts occurring due to physical issues with the disk itself: Feb 22 19:44:01 icarus kernel: ahcich5: Timeout on slot 0 port 0 Feb 22 19:44:01 icarus kernel: ahcich5: is 00000000 cs 00000000 ss 00000001 rs 00000001 tfd 40 serr 00000000 cmd 0004c017 Feb 22 19:44:01 icarus kernel: (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 80 00 0c 00 40 00 00 00 00 00 00 Feb 22 19:44:01 icarus kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout Feb 22 19:44:01 icarus kernel: (ada5:ahcich5:0:0:0): Retrying command Feb 22 19:44:33 icarus kernel: ahcich5: Timeout on slot 0 port 0 Feb 22 19:44:33 icarus kernel: ahcich5: is 00000000 cs 00000000 ss 00000001 rs 00000001 tfd 40 serr 00000000 cmd 0004c017 Feb 22 19:44:33 icarus kernel: (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 80 00 18 00 40 00 00 00 00 00 00 Feb 22 19:44:33 icarus kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout Feb 22 19:44:33 icarus kernel: (ada5:ahcich5:0:0:0): Retrying command Initiated smartctl -a /dev/ada5, which blocked as expected. Timeouts still happening, and to speed up the process I yanked the disk: Feb 22 19:45:32 icarus kernel: ahcich5: Timeout on slot 0 port 0 Feb 22 19:45:32 icarus kernel: ahcich5: is 00000000 cs 00000000 ss 00000001 rs 00000001 tfd 40 serr 00000000 cmd 0004c017 Feb 22 19:45:32 icarus kernel: (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 80 80 25 00 40 00 00 00 00 00 00 Feb 22 19:45:32 icarus kernel: (ada5:ahcich5:0:0:0): CAM status: Command timeout Feb 22 19:45:32 icarus kernel: (ada5:ahcich5:0:0:0): Retrying command Feb 22 19:45:55 icarus kernel: (ada5:ahcich5:0:0:0): lost device Feb 22 19:45:55 icarus kernel: (ada5:ahcich5:0:0:0): removing device entry After yanking: root@icarus:~ # smartctl -a /dev/ada5 smartctl 6.0 2012-10-10 r3643 [FreeBSD 9.1-STABLE amd64] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org (pass5:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 01 00 (pass5:ahcich5:0:0:0): CAM status: Unconditionally Re-queue Request smartctl: cam_send_ccb: Device not configured Checking devices: root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* crw------- 1 root operator 0x52 Feb 21 21:37 /dev/xpt0 Reinserted disk: Feb 22 19:47:53 icarus kernel: ada5 at ahcich5 bus 0 scbus5 target 0 lun 0 Feb 22 19:47:53 icarus kernel: ada5: ATA-8 SATA 2.x device Feb 22 19:47:53 icarus kernel: ada5: 300.000MB/s transfers (SATA 2.x, UDMA5, PIO 8192bytes) Feb 22 19:47:53 icarus kernel: ada5: Command Queueing enabled Feb 22 19:47:53 icarus kernel: ada5: 715404MB (1465149168 512 byte sectors: 16H 63S/T 16383C) Feb 22 19:47:53 icarus kernel: ada5: Previously was known as ad14 Devices look good: root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* crw-r----- 1 root operator 0x98 Feb 22 19:47 /dev/ada5 crw------- 1 root operator 0x96 Feb 22 19:47 /dev/pass5 crw------- 1 root operator 0x52 Feb 21 21:37 /dev/xpt0 And smartctl works fine. :-) (Footnote for readers: The previous WD disk I was testing with went belly up in an even worse way so I couldn't use it, but thankfully I've lots of bad disks that exhibit repeated timeouts during I/O. *pats his angry ST3750630AS*) -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |