Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 20 Jul 2006 03:00:52 +0100
From:      "Andrew - Supernews" <andrew@supernews.net>
To:        freebsd-geom@freebsd.org
Cc:        ade@freebsd.org
Subject:   gmirror panics on startup, and some other cases
Message-ID:  <E1G3Nqb-000BqI-Iv@trinity.supernews.net>

next in thread | raw e-mail | index | archive | help
Running RELENG_6 as of June 21 2006, we ran into what looks like a
couple of related gmirror bugs relating to synchronization problems
during destruction. In the worst case these can cause a kernel panic
before reaching single-user mode, if geom_mirror is loaded from
loader.conf.

We can reproduce it as follows (might require two CPUs, haven't
tested it without SMP):

1. create a new gmirror device from only one disk, e.g.
    gmirror label m0 da10

2. insert a new disk to the existing device, e.g.
    gmirror insert m0 da20
This will start synchronization of the new disk.

3. Remove the original disk (what actually happened to us was that
this disk partially failed, but that's not necessary to demonstrate
the problem):
    gmirror remove m0 da10

Most likely you will get an immediate panic from doing that (I've had
at least two different ones that I've not tracked down yet).

Subsequently, though, you'll get a further panic any time that gmirror
tastes the remaining disk (da20 in this example). What is happening in
that case appears to be:

g_mirror_taste sees a disk it likes, so it creates the new gmirror
geom (m0), and the various ancillary bits, starts a kernel thread for
it, and sends a "new disk" message to that thread; having sent the
message, it waits for it to be processed, and then re-acquires a lock
on &sc->sc_lock.

In the m0 kernel thread, the "new disk" message is processed. The disk
goes to state "new", and the device state changed to "running".  The
thread then posts another message to itself changing the disk state
from "new" to "synchronizing" (which isn't really meaningful since
there is only one disk in the mirror now); the device state check then
sees that there are no disks in either the "new" or "active" state,
and destroys the whole device, including destroying a lock out from
under the tasting thread which is waiting on it, hence the panic.

Log appended below. Haven't been able to get a really usable crashdump
out of it yet.

-- 
Andrew, Supernews
http://www.supernews.com

--8=--

g_mirror_taste(MIRROR, da20)
GEOM_MIRROR[2]: Tasting da20.
g_access(0xb0ae5280(da20), 1, 0, 0)
open delta:[r1w0e0] old:[r0w0e0] provider:[r0w0e0] 0xb0aba180(da20)
g_disk_access(da20, 1, 0, 0)
bio_request(0xb0ab6294) from 0xb0ae5280(mirror:taste) to 0xb0aba180(da20) cmd 1
g_io_deliver(0xb0ab6294) from 0xb0ae5280(mirror:taste) to 0xb0aba180(da20) cmd 1 error 0 off 147015821312 len 512
g_access(0xb0ae5280(da20), -1, 0, 0)
open delta:[r-1w0e0] old:[r1w0e0] provider:[r1w0e0] 0xb0aba180(da20)
g_disk_access(da20, -1, 0, 0)
g_detach(0xb0ae5280)
g_destroy_consumer(0xb0ae5280)
g_destroy_geom(0xb0abad00(mirror:taste))
     magic: GEOM::MIRROR
   version: 3
      name: m0
       mid: 3504796118
       did: 2410144083
       all: 1
     genid: 0
    syncid: 1
  priority: 0
     slice: 4096
   balance: split
 mediasize: 147015821312
sectorsize: 512
syncoffset: 1297350656
    mflags: NONE
    dflags: SYNCHRONIZING
hcprovider: 
  provsize: 147015821824
  MD5 hash: eb7116ccf587743165eb48f285103a01
GEOM_MIRROR[1]: Creating device m0 (id=3504796118).
GEOM_MIRROR[0]: Device m0 created (id=3504796118).
GEOM_MIRROR[1]: root_mount_hold 0xb0ab4c30
GEOM_MIRROR[1]: Adding disk da20 to m0.
GEOM_MIRROR[2]: Adding disk da20.
g_access(0xb0ae5240(da20), 1, 1, 1)
open delta:[r1w1e1] old:[r0w0e0] provider:[r0w0e0] 0xb0aba180(da20)
g_disk_access(da20, 1, 1, 1)
g_post_event_x(0xa04bd774, 0xb0aba180, 2, 0)
  ref 0xb0aba180
GEOM_MIRROR[2]: Disk da20 connected.
GEOM_MIRROR[4]: g_mirror_event_send: Sending event 0xb0ab7aa0.
GEOM_MIRROR[4]: g_mirror_event_send: Waking up 0xb0a7fa00.
GEOM_MIRROR[4]: g_mirror_event_send: Sleeping 0xb0ab7aa0.
GEOM_MIRROR[4]: g_mirror_event_send: Sleeping 0xb0ab7aa0.
GEOM_MIRROR[5]: g_mirror_worker: Let's see...
GEOM_MIRROR[3]: Running event for disk da20.
GEOM_MIRROR[3]: Changing disk da20 state from NONE to NEW.
GEOM_MIRROR[1]: Disk da20 state changed from NONE to NEW (device m0).
GEOM_MIRROR[0]: Device m0: provider da20 detected.
GEOM_MIRROR[1]: Device m0 state changed from STARTING to RUNNING.
GEOM_MIRROR[3]: State for da20 disk: SYNCHRONIZING.
GEOM_MIRROR[4]: g_mirror_event_send: Sending event 0xb0ab7260.
GEOM_MIRROR[4]: g_mirror_event_send: Waking up 0xb0a7fa00.
GEOM_MIRROR[4]: g_mirror_worker: Waking up 0xb0ab7aa0.
GGEOM_MIRROR[4]: g_mirror_event_send: Woken up 0xb0ab7aa0.
EOM_MIRROR[5]: g_mirror_worker: I'm here 1.
GEOM_MIRROR[5]: g_mirror_worker: Let's see...
GEOM_MIRROR[3]: Running event for disk da20.
GEOM_MIRROR[3]: Changing disk da20 state from NEW to SYNCHRONIZING.
GEOM_MIRROR[1]: Disk da20 state changed from NEW to SYNCHRONIZING (device m0).
GEOM_MIRROR[1]: root_mount_rel[1683] 0xb0ab4c30
GEOM_MIRROR[2]: No I/O requests for m0, it can be destroyed.
bio_request(0xb0b14108) from 0xb0ae5240(m0) to 0xb0aba180(da20) cmd 2
g_io_deliver(0xb0b14108) from 0xb0ae5240(m0) to 0xb0aba180(da20) cmd 2 error 0 off 147015821312 len 512
GEOM_MIRROR[2]: Metadata on da20 updated.
GEOM_MIRROR[2]: Access da20 r-1w-1e-1 = 0
g_access(0xb0ae5240(da20), -1, -1, -1)
open delta:[r-1w-1e-1] old:[r1w1e1] provider:[r1w1e1] 0xb0aba180(da20)
g_disk_access(da20, -1, -1, -1)
g_post_event_x(0xa04bd0a0, 0xb0aba180, 2, 0)
  ref 0xb0aba180
g_post_event_x(0xb0b2c904, 0xb0ae5240, 2, 0)
g_wither_geom(0xb0a86000(m0.sync))
GEOM_MIRROR[0]: Device m0 destroyed.
g_wither_geom(0xb0abae80(m0))
GEOM_MIRROR[1]: Thread exiting.


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x1c
fault code              = supervisor write, page not present
instruction pointer     = 0x20:0xa04f81e5
stack pointer           = 0x28:0xd28c6b64
frame pointer           = 0x28:0xd28c6b6c
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 2 (g_event)
trap number             = 12
panic: page fault
cpuid = 0
KDB: stack backtrace:
kdb_backtrace(100,b07b9000,28,d28c6b24,c) at kdb_backtrace+0x29
panic(a06b4123,a06d4d09,0,fffff,b07bf09b) at panic+0x114
trap_fatal(d28c6b24,1c,b07b9000,0,c) at trap_fatal+0x2ce
trap_pfault(d28c6b24,0,1c) at trap_pfault+0x1f7
trap(a0510008,28,d28c0028,b0a7fa2c,4) at trap+0x325
calltrap() at calltrap+0x5
--- trap 0xc, eip = 0xa04f81e5, esp = 0xd28c6b64, ebp = 0xd28c6b6c ---
_sx_xlock(b0a7fa2c,b0b35a45,d1,b0a7fb18,b0abb900) at _sx_xlock+0x79
g_mirror_event_send(b0abb900,1,2,0,b0a7fa00) at g_mirror_event_send+0x1ce
g_mirror_add_disk(b0a7fa00,b0aba180,d28c6c18,b0a7fa2c,b0b35a45) at g_mirror_add_disk+0x1ab
g_mirror_taste(b0b37040,b0aba180,0) at g_mirror_taste+0x1d5
g_load_class(b0ab4950,0) at g_load_class+0x143
one_event(d28c6d10,a04bbb71,258,190,b07b7000) at one_event+0x188
g_run_events(258,190,b07b7000,a04bbad8,d28c6d24) at g_run_events+0x9
g_event_procbody(0,d28c6d38) at g_event_procbody+0x99
fork_exit(a04bbad8,0,d28c6d38) at fork_exit+0x71
fork_trampoline() at fork_trampoline+0x8
--- trap 0x1, eip = 0, esp = 0xd28c6d6c, ebp = 0 ---
Uptime: 3m8s
Dumping 3327 MB (2 chunks)
  chunk 0: 1MB (155 pages) ... ok
  chunk 1: 3327MB (851568 pages) 3311panic: ahd_run_qoutfifo recursion
cpuid = 0
Uptime: 3m11s
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...
cpu_reset: Stopping other CPUs


--8=----




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E1G3Nqb-000BqI-Iv>