From owner-freebsd-geom@FreeBSD.ORG  Thu Aug  3 09:12:46 2006
Return-Path: <owner-freebsd-geom@FreeBSD.ORG>
X-Original-To: freebsd-geom@freebsd.org
Delivered-To: freebsd-geom@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 91D4F16A4E1
	for <freebsd-geom@freebsd.org>; Thu,  3 Aug 2006 09:12:46 +0000 (UTC)
	(envelope-from 000.fbsd@quip.cz)
Received: from home.quip.cz (grimm.quip.cz [213.220.192.218])
	by mx1.FreeBSD.org (Postfix) with ESMTP id BD6DE43D53
	for <freebsd-geom@freebsd.org>; Thu,  3 Aug 2006 09:12:45 +0000 (GMT)
	(envelope-from 000.fbsd@quip.cz)
Received: from [192.168.1.2] (qwork.quip.test [192.168.1.2])
	by home.quip.cz (Postfix) with ESMTP id B7774527E;
	Thu,  3 Aug 2006 11:12:43 +0200 (CEST)
Message-ID: <44D1BE0B.9090709@quip.cz>
Date: Thu, 03 Aug 2006 11:12:43 +0200
From: Miroslav Lachman <000.fbsd@quip.cz>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
	rv:1.7.12) Gecko/20050915
X-Accept-Language: cs, cz, en, en-us
MIME-Version: 1.0
To: rick-freebsd@kiwi-computer.com
References: <44D06650.1030803@quip.cz>
	<20060802183001.GA14279@megan.kiwi-computer.com>
	<44D10D1D.9040700@quip.cz>
	<20060802210709.GA15310@megan.kiwi-computer.com>
	<44D126EF.9070503@quip.cz> <44D12A80.9040802@quip.cz>
	<20060802233255.GB16385@megan.kiwi-computer.com>
In-Reply-To: <20060802233255.GB16385@megan.kiwi-computer.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Cc: freebsd-geom@freebsd.org
Subject: Re: gmirror Cannot add disk ad5 to gm0 (error=22)
X-BeenThere: freebsd-geom@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: GEOM-specific discussions and implementations
	<freebsd-geom.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
	<mailto:freebsd-geom-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-geom>
List-Post: <mailto:freebsd-geom@freebsd.org>
List-Help: <mailto:freebsd-geom-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-geom>,
	<mailto:freebsd-geom-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Aug 2006 09:12:46 -0000

Rick C. Petty wrote:

> On Thu, Aug 03, 2006 at 12:43:12AM +0200, Miroslav Lachman wrote:
> 
>>Something is definitely wrong. Gmirror status still shows 0% after 
>>couple of minutes (normaly synchronization progress is about 1% per minute)
> 
> 
> Under what conditions do you define "normally"?  I think you can tweak
> the numbers to make it go faster or slower, and I think it's dependent
> upon (disk) idle time.

normally = few days ago, same HW, same BIOS settings etc. Whole 
synchronization of 250GB disks was done after about 90 minutes.

>>systat -vmstat shows less then 1MB/s instead of usual 40MB/s, but 100% busy.
>>
>>Disks   ad4   ad5
>>KB/t    121   128
>>tps       4     4
>>MB/s   0.45  0.45
>>% busy   83   103
> 
> 
> What other activity is happening on the box?  Are you in the middle of a
> background fsck?

Almost no other activities, system has installed apache, mysql, postfix 
etc., but not serving any requests. Fsck was not running.

> What does the output of "atacontrol mode ad4" (and ad5) show?  Are you
> sure your "normal" synchronization happened when you were in IDE mode
> instead of AHCI?

Yes, "normal" synchronization was with IDE mode. IDE mode was set more 
then week ago and as I play with gmirror I run synchronization many times.

# atacontrol mode ad4
current mode = SATA150
# atacontrol mode ad5
current mode = SATA150

>>Is there any chance to found source of problems without step by step 
>>replacement of each component?
> 
> 
> That depends upon the problems.  To diagnose anything, you need to be
> able to reliably bring down the mirror-- e.g. heavy disk activity.
> 
> 
>>I can't believe that I have bad cables in 
>>4 new machines or bad hard drives in each machine... :o(
> 
> 
> I bought identical machines (cpus, boards, disks, cables, etc.) and had
> different results on each.  Especially when you buy identical stuff,
> there is a small probability that they'll all have the same problems--
> for example, a bad batch of disks.  In your case, I'd investigate which
> steps you have to preform to repeatably cause the failures.  On my
> systems, the heavier the disk load, the higher the probability of failure.
> Upgrading to the latest 6.1-STABLE might help in some cases.

Same here - heavier disk load, more often failures. After few crashes, 
disks disappeared in the middle of gmirror synchronization (heavy disk 
load). The disk was replaced with new one without success, then the 
whole server was replaced and running fine for about 1 week under heavy 
test load (concurrent copying of ports tree in infinete loop). Now the 
mentioned problem occured.

Now it seems that it is disk problem this time. Synchronization was 
running whole night with tens or hunderds of messages like this:
ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=9719424
ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - 
completing request directly
ad5: error issuing SETFEATURES SET TRANSFER MODE command

After six hours I got message from smartd
Device: /dev/ad5, FAILED SMART self-check. BACK UP DATA NOW!
Device: /dev/ad5, 52 Currently unreadable (pending) sectors
Device: /dev/ad5, 52 Offline uncorrectable sectors

90 minutes later, system reboot itself, trying rebuild provider ad5 and 
/var/log/messeges is full of
ad5: FAILURE - SETFEATURES SET TRANSFER MODE 
status=71<READY,DMA_READY,DSC,ERROR> error=4<ABORTED>
ad5: FAILURE - SETFEATURES ENABLE RCACHE 
status=71<READY,DMA_READY,DSC,ERROR> error=4<ABORTED>
ad5: FAILURE - SETFEATURES ENABLE WCACHE 
status=71<READY,DMA_READY,DSC,ERROR> error=4<ABORTED>
ad5: FAILURE - SET_MULTI status=71<READY,DMA_READY,DSC,ERROR> 
error=4<ABORTED>
ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=1

1 hour later
ad5: FAILURE - ATA_IDENTIFY status=71<READY,DMA_READY,DSC,ERROR> 
error=4<ABORTED> LBA=0
ad5: FAILURE - ATAPI_IDENTIFY status=71<READY,DMA_READY,DSC,ERROR> 
error=4<ABORTED> LBA=0

smartd[506]: Device: /dev/ad5, failed to read SMART Attribute Data

In MRTG graphs I got disk temperature (38°C) and Reallocated Sector 
Count which is increasing from time of synchronization start and after 5 
hours the number of reallocated sectors goes above 2000! (out of range 
of the graph)

After manual reboot, there is no ad5 device. I hope new drive helps, but 
I am still nervous, because I have similar troubles with 2 machines 
(both replaced with new one - so I played with 4 machines)...

Thank you for your help.

Miroslav Lachman