From owner-freebsd-geom@FreeBSD.ORG Mon Jan 30 17:07:43 2012 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EFC3B1065670 for ; Mon, 30 Jan 2012 17:07:43 +0000 (UTC) (envelope-from lee@dilkie.com) Received: from data.snhdns.com (data.snhdns.com [208.76.82.136]) by mx1.freebsd.org (Postfix) with ESMTP id B75578FC0C for ; Mon, 30 Jan 2012 17:07:42 +0000 (UTC) Received: from 66-46-196-229.dedicated.allstream.net ([66.46.196.229] helo=[127.0.0.1]) by data.snhdns.com with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.69) (envelope-from ) id 1Rruhf-0002Ke-66; Mon, 30 Jan 2012 12:07:39 -0500 Message-ID: <4F26CE5C.20003@dilkie.com> Date: Mon, 30 Jan 2012 12:07:40 -0500 From: Lee Dilkie User-Agent: Mozilla/5.0 (Windows NT 5.2; WOW64; rv:9.0) Gecko/20111222 Thunderbird/9.0.1 MIME-Version: 1.0 To: Miroslav Lachman <000.fbsd@quip.cz> References: <4F24785F.20607@dilkie.com> <4F247D69.6000105@dilkie.com> <4F249997.1010502@quip.cz> In-Reply-To: <4F249997.1010502@quip.cz> X-Enigmail-Version: 1.3.5 X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - data.snhdns.com X-AntiAbuse: Original Domain - freebsd.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - dilkie.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-geom Subject: Re: gmirror question, drive missing X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Jan 2012 17:07:44 -0000 On 1/28/2012 7:57 PM, Miroslav Lachman wrote: > Lee Dilkie wrote: >> additional. >> >> like I said, the original setup had the drives swapped. ad10, now ad11, >> was the source of the failure. >> >> from the log files.. >> >> +ad10: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=1740673583 >> +ad10: FAILURE - READ_DMA48 status=51 >> error=10 LBA=1740673583 >> +GEOM_MIRROR: Request failed (error=5). >> ad10[READ(offset=891224874496, length=4096)] >> +GEOM_MIRROR: Device gm0: provider ad10 disconnected. > > Your problem is exactly the above error - the disk is marked as BROKEN > by gmirror and thus not connected (re-synchronized) to gm0 anymore. > > If you are really sure you want to re-add this broken disk into > gmirror gm0, you must clear metadata on it, then remove info about > this provider from gmirror configuration and then insert it in to > gmirror again. > > Example (if ad11 is the broken disk): > gmirror clear -v ad11 > gmirror forget -v gm0 > gmirror insert -v gm0 ad11 > > Maybe you can use gmirror remove instead of clear and forget, I am not > sure. > > PS: I recommend you tu check the disk with sysutils/smartmontools: > > smartctl -a /dev/ad11 Thanks Miroslav (sorry for the late reply, my home internet connection went down on the weekend and hasn't recovered). Yes, I did do a smartctl "long" test on the drive and it came back clean so I'm not sure it was a drive failure. It was a very odd failure actually. The system should have continued to run with the one drive gone but it didn't It stayed up and I was able to access it using ssh for a while, but then it became clear that the filesystem had "gone". Running applications couldn't access the filesystem and eventually even ssh refused connections. Didn't happen all at once though, first indication was imap complaining that it couldn't access user mailboxes... when I got someone to go in and reboot, the server wouldn't come up, the failed drive, ad10 at the time, had no boot loader available... don't know what happened to that drive but it was corrupted somehow (still is in the same state if someone has an idea of what I could look for?). I swapped drive cables to the other drive from the mirror and it came right up and has been running fine ever since (I did a manual fsck to fix the unclean shutdown). what I *think* happened was some sort of system h/w failure (maybe) that made both drives not work properly. There's no indication in the logs but the logs seem to have stopped after the one indicated so I think the remaining drive was no longer writable. just guessing. -lee