From owner-freebsd-fs@FreeBSD.ORG Fri Mar 1 20:43:53 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 2DC6CCCC; Fri, 1 Mar 2013 20:43:53 +0000 (UTC) (envelope-from utisoft@gmail.com) Received: from mail-ia0-x22d.google.com (mail-ia0-x22d.google.com [IPv6:2607:f8b0:4001:c02::22d]) by mx1.freebsd.org (Postfix) with ESMTP id E3B8B172A; Fri, 1 Mar 2013 20:43:52 +0000 (UTC) Received: by mail-ia0-f173.google.com with SMTP id h37so3043435iak.18 for ; Fri, 01 Mar 2013 12:43:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=2g9kPW/UNrvBQ2y31CE+9hcRlAGhq335WEw+3Nrux7U=; b=mtWTcfOt5u88QYbcR1TfnYITgplpKRngh+tX1j3IyJ9Ddgnj8QrjOAQKyuH3VQhR5N X7SBcot2NCug+7abb5BdSiXRo/Zmf8hOYe8BegYoL9IsbGqIqmRphUJbejMsGPjaA5vb zcPjSXESytJLwGdTy33JSWfc7X/Yq2dvHbEfWGbgc2O2cn1giKZA3J2QuWRR9PYfQSw7 pI553v+fvGwVDluF0JuDjbJNJTdrchXlySfDB6xV3rqp0YCsxVjzCU9PTyvkbKC4pbXE cC3vRYbwjHxa4GzLBiLMKG5Fq7WQFYg74Vox5QpyXdwnhgbSkG37aE3AsA2qRuYFN0uU bMSA== X-Received: by 10.42.126.133 with SMTP id e5mr7497058ics.17.1362170632573; Fri, 01 Mar 2013 12:43:52 -0800 (PST) MIME-Version: 1.0 Received: by 10.64.63.12 with HTTP; Fri, 1 Mar 2013 12:43:22 -0800 (PST) In-Reply-To: References: <20130121221617.GA23909@icarus.home.lan> <50FED818.7070704@FreeBSD.org> <20130125083619.GA51096@icarus.home.lan> <20130125211232.GA3037@icarus.home.lan> <20130125212559.GA1772@icarus.home.lan> <20130125213209.GA1858@icarus.home.lan> <20130126011754.GA1806@icarus.home.lan> <51267055.3040500@FreeBSD.org> From: Chris Rees Date: Fri, 1 Mar 2013 20:43:22 +0000 Message-ID: Subject: Re: disk "flipped" - a known problem? To: mav@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Cc: Jeremy Chadwick , "freebsd-fs@freebsd.org" , Andriy Gapon X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Mar 2013 20:43:53 -0000 On 23 February 2013 09:39, Chris Rees wrote: > > On 21 Feb 2013 19:07, "Alexander Motin" wrote: >> >> On 26.01.2013 03:17, Jeremy Chadwick wrote: >> > Okay, I've figured out the exact, 100% reproducible condition that >> > causes the situation. It took me a lot of tries and a digital pocket >> > recorder to take verbal notes (there are just too many things to look at >> > simultaneously), but I've figured it out. >> > >> > I'm sorry for the verbosity, but it's necessary. >> > >> > Assume the disk we're talking about is /dev/ada5. >> > >> > 1. Prior to any issues, we have this: >> > >> > root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* >> > crw-r----- 1 root operator 0x8c Jan 25 16:41 /dev/ada5 >> > crw------- 1 root operator 0x75 Jan 25 16:35 /dev/pass5 >> > crw------- 1 root operator 0x51 Jan 25 16:35 /dev/xpt0 >> > >> > 2. ada5 begins experiencing issues -- ATA commands (CDBs) submit do not >> > get a response (not going to discuss how/why that can happen). >> > >> > 3. These types of messages are seen on console (naturally the CDB and >> > request type will vary -- in this case it was because I was doing the dd >> > zero'ing, thus tickling the bad sector/naughty firmware on the drive): >> > >> > Jan 25 16:29:28 icarus kernel: ahcich5: Timeout on slot 0 port 0 >> > Jan 25 16:29:28 icarus kernel: ahcich5: is 00000000 cs 00000000 ss >> > 00000001 rs 00000001 tfd 40 serr 00000000 cmd 0004c017 >> > Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset... >> > Jan 25 16:29:28 icarus kernel: ahcich5: SATA connect time=1000us >> > status=00000113 >> > Jan 25 16:29:28 icarus kernel: ahcich5: AHCI reset: device found >> > Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. >> > ACB: 61 80 80 77 01 40 00 00 00 00 00 00 >> > Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): CAM status: Command >> > timeout >> > Jan 25 16:29:28 icarus kernel: (ada5:ahcich5:0:0:0): Retrying command >> > >> > 4. Any I/O submit to ada5 during this time blocks (this is normal). >> > >> > 5. **While this situation is happening**, something using xpt(4) >> > attempts to submit a CDB to the disk (ex. smartctl -a /dev/ada5). >> > This request also blocks (again, normal). >> > >> > 6. Physical device falls off bus, or CAM kicks the disk off the bus. >> > Doesn't matter which. We see messages resembling this (boy am I tired >> > of this interspersed output problem): >> > >> > Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): lost device >> > Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): lost device >> > Jan 25 16:29:32 icarus kernel: (ada5:ahcich5:0:0:0): removing device >> > entry >> > Jan 25 16:29:32 icarus kernel: (pass5:ahcich5:0:0:0): passdevgonecb: >> > devfs entry is gone >> > >> > 7. Standard I/O requests fail with errno=6 "Device not configured". >> > xpt(4) requests also fail with the same errno. >> > >> > 8. Device-wise, at this stage all we have is: >> > >> > root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* >> > crw------- 1 root operator 0x51 Jan 25 16:35 /dev/xpt0 >> > >> > 9. Device comes back online for whatever reason. FreeBSD sees the disk, >> > blah blah blah: >> > >> > Jan 25 16:30:16 icarus kernel: GEOM: new disk ada5 >> > Jan 25 16:30:16 icarus kernel: ada5: >> > ATA-7 SATA 1.x device >> > Jan 25 16:30:16 icarus kernel: ada5: Serial Number WD-WMAP41573589 >> > Jan 25 16:30:16 icarus kernel: ada5: 150.000MB/s transfers (SATA 1.x, >> > UDMA6, PIO 8192bytes) >> > Jan 25 16:30:16 icarus kernel: ada5: Command Queueing enabled >> > Jan 25 16:30:16 icarus kernel: ada5: 143089MB (293046768 512 byte >> > sectors: 16H 63S/T 16383C) >> > Jan 25 16:30:16 icarus kernel: ada5: Previously was known as ad14 >> > >> > ...um, where's pass5? >> > >> > 10. /dev/pass5 is now completely (permanently) missing: >> > >> > root@icarus:~ # ls -l /dev/ada5* /dev/xpt* /dev/pass5* >> > crw-r----- 1 root operator 0x99 Jan 25 16:42 /dev/ada5 >> > crw------- 1 root operator 0x51 Jan 25 16:35 /dev/xpt0 >> > >> > 11. Any further attempts to communicate via xpt(4) with ada5 fail. >> > Detaching and reattaching the disk does not fix the issue; the only fix >> > is to reboot the system. >> > >> > 12. "camcontrol debug -IPXp scbus5" results in tons and tons of output >> > all pertaining to xpt(4). It looks like xpt(4) is in some kind of >> > loop. >> > >> > Below is my verbose boot (with non-kernel things removed), which >> > also includes "camcontrol debug" output once things are in a bad state: >> > >> > http://jdc.koitsu.org/freebsd/xpt_oddity.log >> > >> > In this log you'll see that after 1 CAM timeout I yanked the drive, then >> > roughly 30 seconds later reinserted it. >> > >> > If you need me to turn on CAM debugging *prior* to the above, I can do >> > that, just let me know. >> > >> > The important step is #5. Without that, the problem shown in #9/10/11 >> > does not happen. >> > >> > It's a good thing I don't run smartd(8) -- most users I see using that >> > software set the interval to something like 180s or 60s. Imagine this >> > frustration: "okay so the disk fell off the bus, but what, now I can't >> > talk to it with SMART? Uhhh... Err, works now? Whatever". >> >> I think, the problem may already be fixed in HEAD by r244014 by ken@. >> I've just merged it to 9-STABLE at r247115. So if it is still possible >> to reproduce the situation, it would be good to try. > > I think I've been having the same troubles since upgrading from 9.0, so I'm > going to try applying that to 9.1-R and I'll also give feedback. Yup, I no longer get weird disconnects after this patch (5 days later now). Thank you very much! Chris