Date: Tue, 21 Aug 2007 14:15:08 +0200 From: =?ISO-8859-1?Q?Johan_Str=F6m?= <johan@stromnet.se> To: freebsd-geom@freebsd.org, freebsd-stable@freebsd.org Subject: Crashed gmirror, single disk marked SYNC and wont boot... Message-ID: <8039436E-1824-4C2E-915B-9069DEF23B10@stromnet.se>
next in thread | raw e-mail | index | archive | help
Hi
FreeBSD gw-1.stromnet.se 6.2-RELEASE-p1 FreeBSD 6.2-RELEASE-p1 #7: =20
Tue Feb 13 18:24:34 CET 2007 johan@elfi.stromnet.se:/usr/obj/usr/=20
src/sys/ROUTER.POLLING i386
(ROUTER.POLLING is GENERIC + options DEVICE_POLLING and ALTQ, =20
IPSEC, also pfsync and carp)
This weekend I had a disk failing on me in a machine running gmirror =20
gm0 with 2 providers (ad0 and ad6). The whole box froze with no =20
screen output, and on hard reboot I got some LBA errors etc from ad0, =20=
after a few reboots it got up and running though (I wasnt at the =20
screen, had do do it by phone so couldn't really debug very well).
As soon as the box got up, I removed ad0 from the gmirror, so ad6 was =20=
the only provider. Today I got a new disk that would replace ad0..
Now remeber, ad6 was the only disk in the mirror. I took the box down =20=
fine, replaced the disk. ad0 was now gone and instead I hade ad4 (ad4=20
+6 is SATA, ad0 was IDE). Changed so I booted of the old SATA.. =20
Okay, there came the first problem; the boot loader gave me the usual =20=
options F1 FreeBSD F5 Disk 2 (or whatever it said).. If I pressed F1 =20
i got the same prompt again.. F5 nothing at all.. Funny!... The =20
system refused to load the loader (or whatever the 1-9 menu thingy is =20=
called) kernel or anything..
So I finally plugged the old ad0 disk into the machine to at least =20
get it booted, thinking it would go up on the gmirror.. Nope..:
(got the new ad4 out here)
ad0: 38166MB <WDC WD400BB-00CAA1 17.07W17> at ata0-master UDMA100
ad6: 152627MB <SAMSUNG HD160JJ ZM100-41> at ata3-master SATA150
GEOM_MIRROR: Device gm0 created (id=3D4029378995).
GEOM_MIRROR: Device gm0: provider ad6 detected.
Root mount waiting for: GMIRROR
Root mount waiting for: GMIRROR
Root mount waiting for: GMIRROR
Root mount waiting for: GMIRROR
GEOM_MIRROR: Force device gm0 start due to timeout.
Trying to mount root from ufs:/dev/mirror/gm0s1a
Manual root filesystem specification:
<fstype>:<device> Mount <device> using filesystem <fstype>
eg. ufs:da0s1a
? List valid disk boot devices
<empty line> Abort manual input
mountroot>
Okey... so why wouldnt it load my mirror from ad6 now?? I just did a =20
clean shutdown without problems.. It didnt even recognize any slices =20
on ad6s1 (altough the ad6s1 was found)...
I entered ad0s1 as root and booted from there, ofcourse i got to =20
emergency shell since fstab looked for the gmirror devices, which =20
didnt exist..
Some more digging into gmirror, I did a gmirror dump ad6:
Metadata on /dev/ad6:
magic: GEOM::MIRROR
version: 3
name: gm0
mid: 4029378995
did: 449032193
all: 3
genid: 0
syncid: 5
priority: 0
slice: 4096
balance: round-robin
mediasize: 20416757248
sectorsize: 512
syncoffset: 0
mflags: NONE
dflags: SYNCHRONIZING
hcprovider:
provsize: 160041885696
MD5 hash: 6e1e8ca80a27e0e1b0460feab595c39f
Some googling indicated that SYNCHRONIZING means that its not =20
"complete" and wont mount? Is that correct? Why would it be in that =20
state then, I just shut it down fine... And where the f*ck did my =20
slices go??..
Did a sysctl kern.geom.mirror.debug=3D2 and tried to gmirror activate =20=
the mirror:
GEOM_MIRROR[1]: Creating device gm0 (id=3D4029378995).
GEOM_MIRROR[0]: Device gm0 created (id=3D4029378995).
GEOM_MIRROR[1]: root_mount_hold 0xc3539510
GEOM_MIRROR[1]: Adding disk ad6 to gm0.
GEOM_MIRROR[2]: Adding disk ad6.
GEOM_MIRROR[2]: Disk ad6 connected.
GEOM_MIRROR[1]: Disk ad6 state changed from NONE to NEW (device gm0).
GEOM_MIRROR[0]: Device gm0: provider ad6 detected.
GEOM_MIRROR[2]: Tasting ad6s1.
GEOM_MIRROR[0]: Force device gm0 start due to timeout.
GEOM_MIRROR[1]: root_mount_rel[2169] 0xc3539510
GEOM_MIRROR[2]: No I/O requests for gm0, it can be destroyed.
GEOM_MIRROR[2]: Metadata on ad6 updated.
GEOM_MIRROR[2]: Access ad6 r-1w-1e-1 =3D 0
GEOM_MIRROR[0]: Device gm0 destroyed.
GEOM_MIRROR[1]: Thread exiting.
GEOM_MIRROR[1]: Consumer ad6 destroyed.
Soo.. What is going on here? Anyone with some clues? Currently =20
running on the ad0 disk, no raid at all.. Lets hope it doesnt die on =20
me (havent had any signs of that since sunday when it froze and gave =20
boot errors now so I'm hoping..). The data loss from using ad0 =20
instead of ad6 is probably minimal, its a router so its more or less =20
only logging that seems to been lost... For now I just want to get =20
clear about wth happened here and how to prevent it, and how to get =20
back up on a gmirror with ad6 and ad4 (to be plugged in) so I can =20
throw ad0 out...
Thanks
--
Johan Str=F6m
Stromnet
johan@stromnet.se
http://www.stromnet.se/
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8039436E-1824-4C2E-915B-9069DEF23B10>
