From owner-freebsd-geom@FreeBSD.ORG Tue Aug 21 12:32:44 2007 Return-Path: Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0183016A469; Tue, 21 Aug 2007 12:32:44 +0000 (UTC) (envelope-from johan@stromnet.se) Received: from av12-2-sn2.hy.skanova.net (av12-2-sn2.hy.skanova.net [81.228.8.186]) by mx1.freebsd.org (Postfix) with ESMTP id 7F3FE13C4B3; Tue, 21 Aug 2007 12:32:43 +0000 (UTC) (envelope-from johan@stromnet.se) Received: by av12-2-sn2.hy.skanova.net (Postfix, from userid 502) id 59158382DC; Tue, 21 Aug 2007 14:15:25 +0200 (CEST) Received: from smtp4-2-sn2.hy.skanova.net (smtp4-2-sn2.hy.skanova.net [81.228.8.93]) by av12-2-sn2.hy.skanova.net (Postfix) with ESMTP id ACDAA382E0; Tue, 21 Aug 2007 14:15:24 +0200 (CEST) Received: from phomca.stromnet.se (90-224-172-102-no129.tbcn.telia.com [90.224.172.102]) by smtp4-2-sn2.hy.skanova.net (Postfix) with ESMTP id 8EB9037E4B; Tue, 21 Aug 2007 14:15:24 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by phomca.stromnet.se (Postfix) with ESMTP id 437EAB826; Tue, 21 Aug 2007 14:15:24 +0200 (CEST) X-Virus-Scanned: amavisd-new at stromnet.se Received: from phomca.stromnet.se ([127.0.0.1]) by localhost (phomca.stromnet.se [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zn2q1mt5l5gb; Tue, 21 Aug 2007 14:15:18 +0200 (CEST) Received: from [172.28.1.102] (jstrom-mb.stromnet.se [172.28.1.102]) by phomca.stromnet.se (Postfix) with ESMTP id 07AA0B824; Tue, 21 Aug 2007 14:15:18 +0200 (CEST) Mime-Version: 1.0 (Apple Message framework v752.3) Content-Transfer-Encoding: quoted-printable Message-Id: <8039436E-1824-4C2E-915B-9069DEF23B10@stromnet.se> Content-Type: text/plain; charset=ISO-8859-1; delsp=yes; format=flowed To: freebsd-geom@freebsd.org, freebsd-stable@freebsd.org From: =?ISO-8859-1?Q?Johan_Str=F6m?= Date: Tue, 21 Aug 2007 14:15:08 +0200 X-Mailer: Apple Mail (2.752.3) Cc: Subject: Crashed gmirror, single disk marked SYNC and wont boot... X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Aug 2007 12:32:44 -0000 Hi FreeBSD gw-1.stromnet.se 6.2-RELEASE-p1 FreeBSD 6.2-RELEASE-p1 #7: =20 Tue Feb 13 18:24:34 CET 2007 johan@elfi.stromnet.se:/usr/obj/usr/=20 src/sys/ROUTER.POLLING i386 (ROUTER.POLLING is GENERIC + options DEVICE_POLLING and ALTQ, =20 IPSEC, also pfsync and carp) This weekend I had a disk failing on me in a machine running gmirror =20 gm0 with 2 providers (ad0 and ad6). The whole box froze with no =20 screen output, and on hard reboot I got some LBA errors etc from ad0, =20= after a few reboots it got up and running though (I wasnt at the =20 screen, had do do it by phone so couldn't really debug very well). As soon as the box got up, I removed ad0 from the gmirror, so ad6 was =20= the only provider. Today I got a new disk that would replace ad0.. Now remeber, ad6 was the only disk in the mirror. I took the box down =20= fine, replaced the disk. ad0 was now gone and instead I hade ad4 (ad4=20 +6 is SATA, ad0 was IDE). Changed so I booted of the old SATA.. =20 Okay, there came the first problem; the boot loader gave me the usual =20= options F1 FreeBSD F5 Disk 2 (or whatever it said).. If I pressed F1 =20 i got the same prompt again.. F5 nothing at all.. Funny!... The =20 system refused to load the loader (or whatever the 1-9 menu thingy is =20= called) kernel or anything.. So I finally plugged the old ad0 disk into the machine to at least =20 get it booted, thinking it would go up on the gmirror.. Nope..: (got the new ad4 out here) ad0: 38166MB at ata0-master UDMA100 ad6: 152627MB at ata3-master SATA150 GEOM_MIRROR: Device gm0 created (id=3D4029378995). GEOM_MIRROR: Device gm0: provider ad6 detected. Root mount waiting for: GMIRROR Root mount waiting for: GMIRROR Root mount waiting for: GMIRROR Root mount waiting for: GMIRROR GEOM_MIRROR: Force device gm0 start due to timeout. Trying to mount root from ufs:/dev/mirror/gm0s1a Manual root filesystem specification: : Mount using filesystem eg. ufs:da0s1a ? List valid disk boot devices Abort manual input mountroot> Okey... so why wouldnt it load my mirror from ad6 now?? I just did a =20 clean shutdown without problems.. It didnt even recognize any slices =20 on ad6s1 (altough the ad6s1 was found)... I entered ad0s1 as root and booted from there, ofcourse i got to =20 emergency shell since fstab looked for the gmirror devices, which =20 didnt exist.. Some more digging into gmirror, I did a gmirror dump ad6: Metadata on /dev/ad6: magic: GEOM::MIRROR version: 3 name: gm0 mid: 4029378995 did: 449032193 all: 3 genid: 0 syncid: 5 priority: 0 slice: 4096 balance: round-robin mediasize: 20416757248 sectorsize: 512 syncoffset: 0 mflags: NONE dflags: SYNCHRONIZING hcprovider: provsize: 160041885696 MD5 hash: 6e1e8ca80a27e0e1b0460feab595c39f Some googling indicated that SYNCHRONIZING means that its not =20 "complete" and wont mount? Is that correct? Why would it be in that =20 state then, I just shut it down fine... And where the f*ck did my =20 slices go??.. Did a sysctl kern.geom.mirror.debug=3D2 and tried to gmirror activate =20= the mirror: GEOM_MIRROR[1]: Creating device gm0 (id=3D4029378995). GEOM_MIRROR[0]: Device gm0 created (id=3D4029378995). GEOM_MIRROR[1]: root_mount_hold 0xc3539510 GEOM_MIRROR[1]: Adding disk ad6 to gm0. GEOM_MIRROR[2]: Adding disk ad6. GEOM_MIRROR[2]: Disk ad6 connected. GEOM_MIRROR[1]: Disk ad6 state changed from NONE to NEW (device gm0). GEOM_MIRROR[0]: Device gm0: provider ad6 detected. GEOM_MIRROR[2]: Tasting ad6s1. GEOM_MIRROR[0]: Force device gm0 start due to timeout. GEOM_MIRROR[1]: root_mount_rel[2169] 0xc3539510 GEOM_MIRROR[2]: No I/O requests for gm0, it can be destroyed. GEOM_MIRROR[2]: Metadata on ad6 updated. GEOM_MIRROR[2]: Access ad6 r-1w-1e-1 =3D 0 GEOM_MIRROR[0]: Device gm0 destroyed. GEOM_MIRROR[1]: Thread exiting. GEOM_MIRROR[1]: Consumer ad6 destroyed. Soo.. What is going on here? Anyone with some clues? Currently =20 running on the ad0 disk, no raid at all.. Lets hope it doesnt die on =20 me (havent had any signs of that since sunday when it froze and gave =20 boot errors now so I'm hoping..). The data loss from using ad0 =20 instead of ad6 is probably minimal, its a router so its more or less =20 only logging that seems to been lost... For now I just want to get =20 clear about wth happened here and how to prevent it, and how to get =20 back up on a gmirror with ad6 and ad4 (to be plugged in) so I can =20 throw ad0 out... Thanks -- Johan Str=F6m Stromnet johan@stromnet.se http://www.stromnet.se/