From owner-freebsd-geom@FreeBSD.ORG Thu Aug 3 09:12:46 2006 Return-Path: X-Original-To: freebsd-geom@freebsd.org Delivered-To: freebsd-geom@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 91D4F16A4E1 for ; Thu, 3 Aug 2006 09:12:46 +0000 (UTC) (envelope-from 000.fbsd@quip.cz) Received: from home.quip.cz (grimm.quip.cz [213.220.192.218]) by mx1.FreeBSD.org (Postfix) with ESMTP id BD6DE43D53 for ; Thu, 3 Aug 2006 09:12:45 +0000 (GMT) (envelope-from 000.fbsd@quip.cz) Received: from [192.168.1.2] (qwork.quip.test [192.168.1.2]) by home.quip.cz (Postfix) with ESMTP id B7774527E; Thu, 3 Aug 2006 11:12:43 +0200 (CEST) Message-ID: <44D1BE0B.9090709@quip.cz> Date: Thu, 03 Aug 2006 11:12:43 +0200 From: Miroslav Lachman <000.fbsd@quip.cz> User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 X-Accept-Language: cs, cz, en, en-us MIME-Version: 1.0 To: rick-freebsd@kiwi-computer.com References: <44D06650.1030803@quip.cz> <20060802183001.GA14279@megan.kiwi-computer.com> <44D10D1D.9040700@quip.cz> <20060802210709.GA15310@megan.kiwi-computer.com> <44D126EF.9070503@quip.cz> <44D12A80.9040802@quip.cz> <20060802233255.GB16385@megan.kiwi-computer.com> In-Reply-To: <20060802233255.GB16385@megan.kiwi-computer.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Cc: freebsd-geom@freebsd.org Subject: Re: gmirror Cannot add disk ad5 to gm0 (error=22) X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 Aug 2006 09:12:46 -0000 Rick C. Petty wrote: > On Thu, Aug 03, 2006 at 12:43:12AM +0200, Miroslav Lachman wrote: > >>Something is definitely wrong. Gmirror status still shows 0% after >>couple of minutes (normaly synchronization progress is about 1% per minute) > > > Under what conditions do you define "normally"? I think you can tweak > the numbers to make it go faster or slower, and I think it's dependent > upon (disk) idle time. normally = few days ago, same HW, same BIOS settings etc. Whole synchronization of 250GB disks was done after about 90 minutes. >>systat -vmstat shows less then 1MB/s instead of usual 40MB/s, but 100% busy. >> >>Disks ad4 ad5 >>KB/t 121 128 >>tps 4 4 >>MB/s 0.45 0.45 >>% busy 83 103 > > > What other activity is happening on the box? Are you in the middle of a > background fsck? Almost no other activities, system has installed apache, mysql, postfix etc., but not serving any requests. Fsck was not running. > What does the output of "atacontrol mode ad4" (and ad5) show? Are you > sure your "normal" synchronization happened when you were in IDE mode > instead of AHCI? Yes, "normal" synchronization was with IDE mode. IDE mode was set more then week ago and as I play with gmirror I run synchronization many times. # atacontrol mode ad4 current mode = SATA150 # atacontrol mode ad5 current mode = SATA150 >>Is there any chance to found source of problems without step by step >>replacement of each component? > > > That depends upon the problems. To diagnose anything, you need to be > able to reliably bring down the mirror-- e.g. heavy disk activity. > > >>I can't believe that I have bad cables in >>4 new machines or bad hard drives in each machine... :o( > > > I bought identical machines (cpus, boards, disks, cables, etc.) and had > different results on each. Especially when you buy identical stuff, > there is a small probability that they'll all have the same problems-- > for example, a bad batch of disks. In your case, I'd investigate which > steps you have to preform to repeatably cause the failures. On my > systems, the heavier the disk load, the higher the probability of failure. > Upgrading to the latest 6.1-STABLE might help in some cases. Same here - heavier disk load, more often failures. After few crashes, disks disappeared in the middle of gmirror synchronization (heavy disk load). The disk was replaced with new one without success, then the whole server was replaced and running fine for about 1 week under heavy test load (concurrent copying of ports tree in infinete loop). Now the mentioned problem occured. Now it seems that it is disk problem this time. Synchronization was running whole night with tens or hunderds of messages like this: ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=9719424 ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad5: error issuing SETFEATURES SET TRANSFER MODE command After six hours I got message from smartd Device: /dev/ad5, FAILED SMART self-check. BACK UP DATA NOW! Device: /dev/ad5, 52 Currently unreadable (pending) sectors Device: /dev/ad5, 52 Offline uncorrectable sectors 90 minutes later, system reboot itself, trying rebuild provider ad5 and /var/log/messeges is full of ad5: FAILURE - SETFEATURES SET TRANSFER MODE status=71 error=4 ad5: FAILURE - SETFEATURES ENABLE RCACHE status=71 error=4 ad5: FAILURE - SETFEATURES ENABLE WCACHE status=71 error=4 ad5: FAILURE - SET_MULTI status=71 error=4 ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=1 1 hour later ad5: FAILURE - ATA_IDENTIFY status=71 error=4 LBA=0 ad5: FAILURE - ATAPI_IDENTIFY status=71 error=4 LBA=0 smartd[506]: Device: /dev/ad5, failed to read SMART Attribute Data In MRTG graphs I got disk temperature (38°C) and Reallocated Sector Count which is increasing from time of synchronization start and after 5 hours the number of reallocated sectors goes above 2000! (out of range of the graph) After manual reboot, there is no ad5 device. I hope new drive helps, but I am still nervous, because I have similar troubles with 2 machines (both replaced with new one - so I played with 4 machines)... Thank you for your help. Miroslav Lachman