From owner-freebsd-geom@FreeBSD.ORG  Mon Jan 30 17:07:43 2012
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
Delivered-To: freebsd-geom@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EFC3B1065670
	for <freebsd-geom@freebsd.org>; Mon, 30 Jan 2012 17:07:43 +0000 (UTC)
	(envelope-from lee@dilkie.com)
Received: from data.snhdns.com (data.snhdns.com [208.76.82.136])
	by mx1.freebsd.org (Postfix) with ESMTP id B75578FC0C
	for <freebsd-geom@freebsd.org>; Mon, 30 Jan 2012 17:07:42 +0000 (UTC)
Received: from 66-46-196-229.dedicated.allstream.net ([66.46.196.229]
	helo=[127.0.0.1])
	by data.snhdns.com with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.69)
	(envelope-from <lee@dilkie.com>)
	id 1Rruhf-0002Ke-66; Mon, 30 Jan 2012 12:07:39 -0500
Message-ID: <4F26CE5C.20003@dilkie.com>
Date: Mon, 30 Jan 2012 12:07:40 -0500
From: Lee Dilkie <lee@dilkie.com>
User-Agent: Mozilla/5.0 (Windows NT 5.2; WOW64;
	rv:9.0) Gecko/20111222 Thunderbird/9.0.1
MIME-Version: 1.0
To: Miroslav Lachman <000.fbsd@quip.cz>
References: <4F24785F.20607@dilkie.com> <4F247D69.6000105@dilkie.com>
	<4F249997.1010502@quip.cz>
In-Reply-To: <4F249997.1010502@quip.cz>
X-Enigmail-Version: 1.3.5
X-AntiAbuse: This header was added to track abuse,
	please include it with any abuse report
X-AntiAbuse: Primary Hostname - data.snhdns.com
X-AntiAbuse: Original Domain - freebsd.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - dilkie.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: freebsd-geom <freebsd-geom@freebsd.org>
Subject: Re: gmirror question, drive missing
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: GEOM-specific discussions and implementations
	<freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
	<mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
	<mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 30 Jan 2012 17:07:44 -0000


On 1/28/2012 7:57 PM, Miroslav Lachman wrote:
> Lee Dilkie wrote:
>> additional.
>>
>> like I said, the original setup had the drives swapped. ad10, now ad11,
>> was the source of the failure.
>>
>> from the log files..
>>
>> +ad10: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=1740673583
>> +ad10: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> 
>> error=10<NID_NOT_FOUND>  LBA=1740673583
>> +GEOM_MIRROR: Request failed (error=5).
>> ad10[READ(offset=891224874496, length=4096)]
>> +GEOM_MIRROR: Device gm0: provider ad10 disconnected.
>
> Your problem is exactly the above error - the disk is marked as BROKEN
> by gmirror and thus not connected (re-synchronized) to gm0 anymore.
>
> If you are really sure you want to re-add this broken disk into
> gmirror gm0, you must clear metadata on it, then remove info about
> this provider from gmirror configuration and then insert it in to
> gmirror again.
>
> Example (if ad11 is the broken disk):
> gmirror clear -v ad11
> gmirror forget -v gm0
> gmirror insert -v gm0 ad11
>
> Maybe you can use gmirror remove instead of clear and forget, I am not
> sure.
>
> PS: I recommend you tu check the disk with sysutils/smartmontools:
>
> smartctl -a /dev/ad11

Thanks Miroslav (sorry for the late reply, my home internet connection
went down on the weekend and hasn't recovered).

Yes, I did do a smartctl "long" test on the drive and it came back clean
so I'm not sure it was a drive failure.

It was a very odd failure actually. The system should have continued to
run with the one drive gone but it didn't It stayed up and I was able to
access it using ssh for a while, but then it became clear that the
filesystem had "gone". Running applications couldn't access the
filesystem and eventually even ssh refused connections. Didn't happen
all at once though, first indication was imap complaining that it
couldn't access user mailboxes...

when I got someone to go in and reboot, the server wouldn't come up, the
failed drive, ad10 at the time, had no boot loader available... don't
know what happened to that drive but it was corrupted somehow (still is
in the same state if someone has an idea of what I could look for?).

I swapped drive cables to the other drive from the mirror and it came
right up and has been running fine ever since (I did a manual fsck to
fix the unclean shutdown).


what I *think* happened was some sort of system h/w failure (maybe) that
made both drives not work properly. There's no indication in the logs
but the logs seem to have stopped after the one indicated so I think the
remaining drive was no longer writable. just guessing.

-lee