Date: Wed, 16 Aug 2006 03:28:27 +0200 From: =?ISO-8859-1?Q?Johan_Str=F6m?= <johan@stromnet.org> To: freebsd-stable@freebsd.org Subject: Re: ATA problems again ... This time system froze! Message-ID: <A30712DF-85D8-4393-AA88-D45732147BFA@stromnet.org> In-Reply-To: <0B43BAB0-BBF0-4E2C-875D-6E1E00BAB1D4@stromnet.org> References: <DAFCD4DC-D2D4-4574-ACBF-367D642D9729@stromnet.org> <8D08DDB6-6AC1-45B6-B2CE-08782F54968A@stromnet.org> <884C01BC-3E97-46EC-AA8B-E70C3931F3A4@stromnet.org> <36895211-2796-4213-B336-6279AB3AC3CB@stromnet.org> <20060713132357.Y61840@fledge.watson.org> <44B7EA39.4060509@quip.cz> <6.2.3.4.0.20060716185019.12a29240@64.7.153.2> <44BBAF52.9080007@quip.cz> <0B43BAB0-BBF0-4E2C-875D-6E1E00BAB1D4@stromnet.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Jul 28, 2006, at 13:15 , Johan Str=F6m wrote: > > On 17 jul 2006, at 17.40, Miroslav Lachman wrote: > >> Mike Tancsa wrote: >> [..] >>> Install the smartmontools from >>> /usr/ports/sysutils/smartmontools/ >>> and post the output of >>> smartctl -a /dev/ad8 >> >> smartmontools was previously installed and running as daemon =20 >> without any bad reports. >> I can not run "smartctl -a /dev/ad8" now, because my server =20 >> housing provider replaced HDD with the new one and after an hour =20 >> of synchronization "ad8: FAILURE - device detached". So provider =20 >> replaced whole server, only ad4 is original piece of HW. >> On new server synchronization was much faster then in previous =20 >> server (1:30 hour compared to 5 hours in previous server) - so I =20 >> think it was HW problem. >> Now I am running stresstest with copying /usr/ports to another =20 >> partition in infinite loop. >> I will post results later. (On bad server, test failed after about =20= >> 30 minutes. On another server the test is running fine second day, =20= >> so I think if disk will not fail after 1 day, problem is solved) >> >> At last - now I think this was not GEOM/gmirror related. I tried =20 >> remove ad8 provider from gmirror (gm0), boot up system from gm0 =20 >> with one provider (ad4) and test ad8 mounted separately - ad8 =20 >> failed again. > > Just got another one.. > > Jul 25 13:30:47 elfi kernel: ad4: FAILURE - device detached > Jul 25 13:30:47 elfi kernel: subdisk4: detached > Jul 25 13:30:47 elfi kernel: ad4: detached > Jul 25 13:30:47 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20 > ad4s1 disconnected. > Jul 25 13:30:47 elfi kernel: g_vfs_done():mirror/gm0s1f[READ=20 > (offset=3D46318008320, length=3D2048)]error =3D 6 > Jul 25 13:30:47 elfi kernel: g_vfs_done():mirror/gm0s1f[READ=20 > (offset=3D77269614592, length=3D16384)]error =3D 6 > > 6 days uptime when this occured... Both disks are tested with =20 > PowerMax without a single problem (same with smartctl), both SATA =20 > cables are new. So the only hwproblem that I cant rule out would be =20= > the mobo, but that is quite new too... > > Solutions? Try RELENG_6 as recommended earlier? Okay still on 6.1-RELEASE: FreeBSD elfi.stromnet.org 6.1-RELEASE FreeBSD 6.1-RELEASE #3: Tue =20 May 9 20:40:23 CEST 2006 johan@elfi.stromnet.org:/usr/obj/usr/=20 src/sys/GENERIC i386 Uptime approx 12 days since last reboot for raid fix... Just got home =20= to meet a box which doesnt respond to SSH.. monitor tells me it has =20 crashed totaly. =46rom /var/log/message: Aug 16 00:58:37 elfi kernel: ad4: FAILURE - device detached Aug 16 00:58:37 elfi kernel: subdisk4: detached Aug 16 00:58:37 elfi kernel: ad4: detached Aug 16 00:58:37 elfi kernel: GEOM_MIRROR: Cannot write metadata on =20 ad4s1 (device=3Dgm0s1, error=3D6). Aug 16 00:58:37 elfi kernel: GEOM_MIRROR: Cannot update metadata on =20 disk ad4s1 (error=3D6). Aug 16 00:58:37 elfi last message repeated 2 times Aug 16 00:58:37 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20 ad4s1 disconnected. Aug 16 00:58:37 elfi kernel: g_vfs_done():mirror/gm0s1f[READ=20 (offset=3D112910630912, length=3D32768)]error =3D 6 Aug 16 00:58:37 labdator kernel: nfs: server 192.168.1.2 not =20 responding, still trying Aug 16 00:58:37 labdator kernel: nfs: server 192.168.1.2 OK Aug 16 03:04:21 elfi syslogd: kernel boot file is /boot/kernel/kernel Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2325168128, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2325184512, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2325200896, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2325217280, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2325233664, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2325250048, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2319169536, length=3D2048)]error =3D 6 Aug 16 03:04:21 elfi kernel: g_vfs_done():mirror/gm0s1d[WRITE=20 (offset=3D2312404992, length=3D16384)]error =3D 6 Aug 16 03:04:21 elfi kernel: Copyright (c) 1992-2006 The FreeBSD =20 Project. Aug 16 03:04:21 elfi kernel: Copyright (c) 1979, 1980, 1983, 1986, =20 1988, 1989, 1991, 1992, 1993, 1994 Aug 16 03:04:21 elfi kernel: The Regents of the University of =20 California. All rights reserved. Aug 16 03:04:21 elfi kernel: FreeBSD 6.1-RELEASE #3: Tue May 9 =20 20:40:23 CEST 2006 ...(regular boot stuff)... (labdator is a box with a elfi nfs export mounted) dmesg shows me some other stuff not in messages: ad4: FAILURE - device detached subdisk4: detached ad4: detached GEOM_MIRROR: Cannot write metadata on ad4s1 (device=3Dgm0s1, error=3D6). GEOM_MIRROR: Cannot update metadata on disk ad4s1 (error=3D6). GEOM_MIRROR: Cannot update metadata on disk ad4s1 (error=3D6). GEOM_MIRROR: Cannot update metadata on disk ad4s1 (error=3D6). GEOM_MIRROR: Device gm0s1: provider ad4s1 disconnected. g_vfs_done():mirror/gm0s1f[READ(offset=3D112910630912, length=3D32768)]=20= error =3D 6 ad6: FAILURE - device detached subdisk6: detached ad6: detached GEOM_MIRROR: Cannot write metadata on ad6s1 (device=3Dgm0s1, error=3D6). GEOM_MIRROR: Cannot update metadata on disk ad6s1 (error=3D6). GEOM_MIRROR: Device gm0s1: provider ad6s1 disconnected. GEOM_MIRROR: Device gm0s1: provider mirror/gm0s1 destroyed. GEOM_MIRROR: Device gm0s1 destroyed. g_vfs_done():mirror/gm0s1f[READ(offset=3D27868381184, length=3D32768)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[READ(offset=3D2324807680, length=3D16384)]=20 error =3D 6 g_vfs_done():mirror/gm0s1d[READ(offset=3D2324824064, length=3D16384)]=20 error =3D 6 g_vfs_done():mirror/gm0s1d[READ(offset=3D2324840448, length=3D16384)]=20 error =3D 6 g_vfs_done():mirror/gm0s1d[READ(offset=3D2324856832, length=3D16384)]=20 error =3D 6 g_vfs_done():mirror/gm0s1d[READ(offset=3D2324873216, length=3D16384)]=20 error =3D 6 g_vfs_done():mirror/gm0s1f[READ(offset=3D17173594112, length=3D32768)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325168128, length=3D16384)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325184512, length=3D16384)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325200896, length=3D16384)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325217280, length=3D16384)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325233664, length=3D16384)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2325250048, length=3D16384)]=20= error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2319169536, length=3D2048)]=20 error =3D 6 g_vfs_done():mirror/gm0s1d[WRITE(offset=3D2312404992, length=3D16384)]=20= error =3D 6 Copyright (c) 1992-2006 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights =20 reserved. FreeBSD 6.1-RELEASE #3: Tue May 9 20:40:23 CEST 2006 (...boot..) 03:04 was when i got home, from other sources i've been told the box =20 died around ~01:21 (IRC pinged out, maybe this was just logs that =20 failed to write to disk which froze irssi or something). Ok so this time it didnt just fail the raid (which it have done =20 before, a reboot and it started to rebuild..), this time it took the =20 whole box down with it.. This is the first time it has happened since =20= I got that new motherboard (read earlier thread).. Later in boot: Aug 16 03:04:21 elfi kernel: ad4: 286188MB <Maxtor 7L300S0 BANC1G10> =20 at ata2-master SATA150 Aug 16 03:04:21 elfi kernel: ad6: 286188MB <Maxtor 7L300S0 BANC1G10> =20 at ata3-master SATA150 Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1 created =20 (id=3D4118114647). Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20 ad4s1 detected. Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20 ad6s1 detected. Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Component ad4s1 (device =20 gm0s1) broken, skipping. Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20 ad6s1 activated. Aug 16 03:04:21 elfi kernel: GEOM_MIRROR: Device gm0s1: provider =20 mirror/gm0s1 launched. Usually when the box has been rebooted before the failed component =20 has been rebuilt automaticly.. Solved with: $ gmirror forget $ gmirror insert gm0s1 ad4s1 And now its rebuilding ad4 again... Any new hints? Should i try RELENG_6 instead? Johan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A30712DF-85D8-4393-AA88-D45732147BFA>