From owner-freebsd-stable@FreeBSD.ORG Wed Aug 16 01:29:50 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8BC4016A4DE for ; Wed, 16 Aug 2006 01:29:50 +0000 (UTC) (envelope-from johan@stromnet.org) Received: from pne-smtpout1-sn1.fre.skanova.net (pne-smtpout1-sn1.fre.skanova.net [81.228.11.98]) by mx1.FreeBSD.org (Postfix) with ESMTP id D110743D46 for ; Wed, 16 Aug 2006 01:29:49 +0000 (GMT) (envelope-from johan@stromnet.org) Received: from elfi.stromnet.org (213.67.205.103) by pne-smtpout1-sn1.fre.skanova.net (7.2.075) id 44A1364D00A1104F for freebsd-stable@freebsd.org; Wed, 16 Aug 2006 03:29:49 +0200 Received: from localhost (localhost [127.0.0.1]) by elfi.stromnet.org (Postfix) with ESMTP id 4DA3F61D70 for ; Wed, 16 Aug 2006 03:29:48 +0200 (CEST) X-Virus-Scanned: amavisd-new at stromnet.org Received: from elfi.stromnet.org ([127.0.0.1]) by localhost (elfi.stromnet.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3o6tvttsH9kb for ; Wed, 16 Aug 2006 03:29:46 +0200 (CEST) Received: from [IPv6:2001:16d8:ff20:2:217:f2ff:fe41:3f1b] (unknown [IPv6:2001:16d8:ff20:2:217:f2ff:fe41:3f1b]) by elfi.stromnet.org (Postfix) with ESMTP id F02E261D69 for ; Wed, 16 Aug 2006 03:29:45 +0200 (CEST) Mime-Version: 1.0 (Apple Message framework v752.2) In-Reply-To: <0B43BAB0-BBF0-4E2C-875D-6E1E00BAB1D4@stromnet.org> References: <8D08DDB6-6AC1-45B6-B2CE-08782F54968A@stromnet.org> <884C01BC-3E97-46EC-AA8B-E70C3931F3A4@stromnet.org> <36895211-2796-4213-B336-6279AB3AC3CB@stromnet.org> <20060713132357.Y61840@fledge.watson.org> <44B7EA39.4060509@quip.cz> <6.2.3.4.0.20060716185019.12a29240@64.7.153.2> <44BBAF52.9080007@quip.cz> <0B43BAB0-BBF0-4E2C-875D-6E1E00BAB1D4@stromnet.org> Content-Type: text/plain; charset=ISO-8859-1; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: quoted-printable From: =?ISO-8859-1?Q?Johan_Str=F6m?= Date: Wed, 16 Aug 2006 03:28:27 +0200 To: freebsd-stable@freebsd.org X-Mailer: Apple Mail (2.752.2) Subject: Re: ATA problems again ... This time system froze! X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Aug 2006 01:29:50 -0000 On Jul 28, 2006, at 13:15 , Johan Str=F6m wrote: > > On 17 jul 2006, at 17.40, Miroslav Lachman wrote: > >> Mike Tancsa wrote: >> [..] >>> Install the smartmontools from >>> /usr/ports/sysutils/smartmontools/ >>> and post the output of >>> smartctl -a /dev/ad8 >> >> smartmontools was previously installed and running as daemon =20 >> without any bad reports. >> I can not run "smartctl -a /dev/ad8" now, because my server =20 >> housing provider replaced HDD with the new one and after an hour =20 >> of synchronization "ad8: FAILURE - device detached". So provider =20 >> replaced whole server, only ad4 is original piece of HW. >> On new server synchronization was much faster then in previous =20 >> server (1:30 hour compared to 5 hours in previous server) - so I =20 >> think it was HW problem. >> Now I am running stresstest with copying /usr/ports to another =20 >> partition in infinite loop. >> I will post results later. (On bad server, test failed after about =20= >> 30 minutes. On another server the test is running fine second day, =20= >> so I think if disk will not fail after 1 day, problem is solved) >> >> At last - now I think this was not GEOM/gmirror related. I tried =20 >> remove ad8 provider from gmirror (gm0), boot up system from gm0 =20 >> with one provider (ad4) and test ad8 mounted separately - ad8 =20 >> failed again. > > Just got another one.. > > Jul 25 13:30:47 elfi kernel: ad4: FAILURE - device detached > Jul 25 13:30:47 elfi kernel: subdisk4: detached > Jul 25 13:30:47 elfi kernel: ad4: detached > Jul 25 13:30:47 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20 > ad4s1 disconnected. > Jul 25 13:30:47 elfi kernel: g_vfs_done():mirror/gm0s1f[READ=20 > (offset=3D46318008320, length=3D2048)]error =3D 6 > Jul 25 13:30:47 elfi kernel: g_vfs_done():mirror/gm0s1f[READ=20 > (offset=3D77269614592, length=3D16384)]error =3D 6 > > 6 days uptime when this occured... Both disks are tested with =20 > PowerMax without a single problem (same with smartctl), both SATA =20 > cables are new. So the only hwproblem that I cant rule out would be =20= > the mobo, but that is quite new too... > > Solutions? Try RELENG_6 as recommended earlier? Okay still on 6.1-RELEASE: FreeBSD elfi.stromnet.org 6.1-RELEASE FreeBSD 6.1-RELEASE #3: Tue =20 May 9 20:40:23 CEST 2006 johan@elfi.stromnet.org:/usr/obj/usr/=20 src/sys/GENERIC i386 Uptime approx 12 days since last reboot for raid fix... Just got home =20= to meet a box which doesnt respond to SSH.. monitor tells me it has =20 crashed totaly. =46rom /var/log/message: Aug 16 00:58:37 elfi kernel: ad4: FAILURE - device detached Aug 16 00:58:37 elfi kernel: subdisk4: detached Aug 16 00:58:37 elfi kernel: ad4: detached Aug 16 00:58:37 elfi kernel: GEOM_MIRROR: Cannot write metadata on =20 ad4s1 (device=3Dgm0s1, error=3D6). Aug 16 00:58:37 elfi kernel: GEOM_MIRROR: Cannot update metadata on =20 disk ad4s1 (error=3D6). Aug 16 00:58:37 elfi last message repeated 2 times Aug 16 00:58:37 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20 ad4s1 disconnected. Aug 16 00:58:37 elfi kernel: g_vfs_done():mirror/gm0s1f[READ=20 (offset=3D112910630912, length=3D32768)]error =3D 6 Aug 16 00:58:37 labdator kernel: nfs: server 192.168.1.2 not =20 responding, still trying Aug 16 00:58:37 labdator kernel: nfs: server 192.168.1.2 OK Aug 16 03:04:21 elfi syslogd: kernel boot file is /boot/kernel/kernel Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2325168128, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2325184512, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2325200896, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2325217280, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2325233664, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2325250048, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2319169536, length=3D2048)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2312404992, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: Copyright (c) 1992-2006 The FreeBSD =20 Project. Aug 16 03:04:21 elfi kernel: Copyright (c) 1979, 1980, 1983, 1986, =20 1988, 1989, 1991, 1992, 1993, 1994 Aug 16 03:04:21 elfi kernel: The Regents of the University of =20 California. All rights reserved. Aug 16 03:04:21 elfi kernel: FreeBSD 6.1-RELEASE #3: Tue May 9 =20 20:40:23 CEST 2006 ...(regular boot stuff)... (labdator is a box with a elfi nfs export mounted) dmesg shows me some other stuff not in messages: ad4: FAILURE - device detached subdisk4: detached ad4: detached GEOM_MIRROR: Cannot write metadata on ad4s1 (device=3Dgm0s1, error=3D6). GEOM_MIRROR: Cannot update metadata on disk ad4s1 (error=3D6). GEOM_MIRROR: Cannot update metadata on disk ad4s1 (error=3D6). GEOM_MIRROR: Cannot update metadata on disk ad4s1 (error=3D6). GEOM_MIRROR: Device gm0s1: provider ad4s1 disconnected. g_vfs_done():mirror/gm0s1f[READ(offset=3D112910630912, length=3D32768)]=20= error =3D 6 ad6: FAILURE - device detached subdisk6: detached ad6: detached GEOM_MIRROR: Cannot write metadata on ad6s1 (device=3Dgm0s1, error=3D6). GEOM_MIRROR: Cannot update metadata on disk ad6s1 (error=3D6). GEOM_MIRROR: Device gm0s1: provider ad6s1 disconnected. GEOM_MIRROR: Device gm0s1: provider mirror/gm0s1 destroyed. GEOM_MIRROR: Device gm0s1 destroyed. g_vfs_done():mirror/gm0s1f[READ(offset=3D27868381184, length=3D32768)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[READ(offset=3D2324807680, length=3D16384)]=20 error =3D 6 g_vfs_done():mirror/gm0s1d[READ(offset=3D2324824064, length=3D16384)]=20 error =3D 6 g_vfs_done():mirror/gm0s1d[READ(offset=3D2324840448, length=3D16384)]=20 error =3D 6 g_vfs_done():mirror/gm0s1d[READ(offset=3D2324856832, length=3D16384)]=20 error =3D 6 g_vfs_done():mirror/gm0s1d[READ(offset=3D2324873216, length=3D16384)]=20 error =3D 6 g_vfs_done():mirror/gm0s1f[READ(offset=3D17173594112, length=3D32768)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325168128, length=3D16384)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325184512, length=3D16384)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325200896, length=3D16384)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325217280, length=3D16384)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325233664, length=3D16384)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325250048, length=3D16384)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2319169536, length=3D2048)]=20 error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2312404992, length=3D16384)]=20= error =3D 6 Copyright (c) 1992-2006 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights =20 reserved. FreeBSD 6.1-RELEASE #3: Tue May 9 20:40:23 CEST 2006 (...boot..) 03:04 was when i got home, from other sources i've been told the box =20 died around ~01:21 (IRC pinged out, maybe this was just logs that =20 failed to write to disk which froze irssi or something). Ok so this time it didnt just fail the raid (which it have done =20 before, a reboot and it started to rebuild..), this time it took the =20 whole box down with it.. This is the first time it has happened since =20= I got that new motherboard (read earlier thread).. Later in boot: Aug 16 03:04:21 elfi kernel: ad4: 286188MB =20 at ata2-master SATA150 Aug 16 03:04:21 elfi kernel: ad6: 286188MB =20 at ata3-master SATA150 Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1 created =20 (id=3D4118114647). Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20 ad4s1 detected. Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20 ad6s1 detected. Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Component ad4s1 (device =20 gm0s1) broken, skipping. Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20 ad6s1 activated. Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20 mirror/gm0s1 launched. Usually when the box has been rebooted before the failed component =20 has been rebuilt automaticly.. Solved with: $ gmirror forget $ gmirror insert gm0s1 ad4s1 And now its rebuilding ad4 again... Any new hints? Should i try RELENG_6 instead? Johan