Date: Mon, 04 Oct 2010 17:03:47 +0300 From: Alexander Motin <mav@FreeBSD.org> To: Alexander Leidinger <Alexander@Leidinger.net> Cc: freebsd-stable <freebsd-stable@freebsd.org>, Steve Polyack <korvus@comcast.net>, Jeremy Chadwick <freebsd@jdc.parodius.com>, Dan Langille <dan@langille.org> Subject: Re: out of HDD space - zfs degraded Message-ID: <4CA9DEC3.1000302@FreeBSD.org> In-Reply-To: <20101003110338.00004197@unknown> References: <4CA73702.5080203@langille.org> <20101002141921.GC70283@icarus.home.lan> <4CA7AD95.9040703@langille.org> <20101002223626.GB78136@icarus.home.lan> <4CA7BEE4.9050201@langille.org> <20101002235024.GA80643@icarus.home.lan> <4CA7E4AE.4060607@langille.org> <4CA7E98E.3040701@comcast.net> <20101003110338.00004197@unknown>
next in thread | previous in thread | raw e-mail | index | archive | help
Alexander Leidinger wrote: > On Sat, 02 Oct 2010 22:25:18 -0400 Steve Polyack <korvus@comcast.net> > wrote: > >> I thin its worth it to think about TLER (or the absence of it) here - >> http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery . Your >> consumer / SATA Hitachi drives likely do not put a limit on the time >> the drive may block on a command while handling inernal errors. If >> we consider that gpt/gisk06-live encountered some kind of error and >> had to relocate a significant number of blocks or perform some other >> error recovery, then it very well may have timed out long enough for >> siis(4) to drop the device. I have no idea what the timeouts are set >> to in the siis(4) driver, nor does anything in your SMART report >> stick out to me (though I'm certainly no expert with SMART data, and >> my understanding is that many drive manufacturers report the various >> parameters in different ways). Timeouts for commands usually defined by ada(4) peripheral driver and ATA transport layer of CAM. Most of timeouts set to 30 seconds. Only time value defined by siis(4) is hard reset time - 15 seconds now. As soon as drive didn't reappeared after `camcontrol reset/rescan ...` done after significant period of time, but required power cycle, I have doubt that any timeout value could help it. It may be also theoretically possible that it was controller firmware stuck, not drive. It would be interesting to power cycle specific drive if problem repeats. > IIRC mav@ (CCed) made a commit regarding this to -current in the not so > distant past. I do not know about the MFC status of this, or if it may > have helped or not in this situation. My last commit to siis(4) 2 weeks ago (merged recently) fixed specific bug in timeout handling, leading to system crash. I don't see alike symptoms here. If there was any messages before "Oct 2 00:50:53 kraken kernel: (ada0:siisch0:0:0:0): lost device", they could give some hints about original problem. Messages after it could be consequence. Enabling verbose kernel messages could give some more information about what happened there. -- Alexander Motin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4CA9DEC3.1000302>