Date: Mon, 28 Mar 2011 13:06:46 -0700 From: Freddie Cash <fjwcash@gmail.com> To: Mikolaj Golub <trociny@freebsd.org> Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>, FreeBSD Stable <freebsd-stable@freebsd.org>, FreeBSD-Current <freebsd-current@freebsd.org>, Pawel Jakub Dawidek <pjd@freebsd.org> Subject: Re: Any success stories for HAST + ZFS? Message-ID: <AANLkTi=zXX93Tzd1fYq3bJ4BEuvUf43y=94fT3rXd6j9@mail.gmail.com> In-Reply-To: <86zkogep2o.fsf@kopusha.home.net> References: <AANLkTi=hP9RoGRKLacxQKSL_6XzwKJZxAh_OeoT2W3EX@mail.gmail.com> <20110325075541.GA1742@garage.freebsd.pl> <AANLkTinmQY7G4Bh3LQdsa4M4B3sNL3zMqVo%2BFiSJnR07@mail.gmail.com> <86zkogep2o.fsf@kopusha.home.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Mar 27, 2011 at 5:16 AM, Mikolaj Golub <trociny@freebsd.org> wrote: On Sat, 26 Mar 2011 10:52:08 -0700 Freddie Cash wrote: > > =C2=A0FC> hastd backtrace is here: > =C2=A0FC> http://www.sd73.bc.ca/downloads/crash/hast-backtrace.png > > It is not a hastd crash, but a kernel crash triggered by hastd process. Ah, interesting. > I am not sure I got the same crash as you but apparently the race is poss= ible > in g_gate on device creation. 95% of the time that it would crash, would be when creating the /dev/hast/* devices (switching to primary role). Most of the crashes happened when doing "hastctl role primary all", but would occasionally happen when doing it manually for each resource. Creating the resources by hand, one every 2 seconds or so, would usually create them all without crashing. The other 5% of the time, the hastd crashes occurred either when importing the ZFS pool, or when running multiple parallel rsyncs to the pool. hastd was always shown as the last running process in the backtrace onscreen. > I got the following crash starting many hast providers simultaneously: > > fault virtual address =C2=A0 =3D 0x0 > > #8 =C2=A00xc0c11adc in calltrap () at /usr/src/sys/i386/i386/exception.s:= 168 > #9 =C2=A00xc086ac6b in g_gate_ioctl (dev=3D0xc6a24300, cmd=3D3374345472, > =C2=A0 =C2=A0addr=3D0xc9fec000 "\002", flags=3D3, td=3D0xc7ff0b80) > =C2=A0 =C2=A0at /usr/src/sys/geom/gate/g_gate.c:410 > #10 0xc0853c5b in devfs_ioctl_f (fp=3D0xc9b9e310, com=3D3374345472, > =C2=A0 =C2=A0data=3D0xc9fec000, cred=3D0xc8c9c200, td=3D0xc7ff0b80) > =C2=A0 =C2=A0at /usr/src/sys/fs/devfs/devfs_vnops.c:678 > #11 0xc09210cd in kern_ioctl (td=3D0xc7ff0b80, fd=3D3, com=3D3374345472, > =C2=A0 =C2=A0data=3D0xc9fec000 "\002") at file.h:262 > #12 0xc0921254 in ioctl (td=3D0xc7ff0b80, uap=3D0xf5edbcec) > =C2=A0 =C2=A0at /usr/src/sys/kern/sys_generic.c:679 > #13 0xc0916616 in syscallenter (td=3D0xc7ff0b80, sa=3D0xf5edbce4) > =C2=A0 =C2=A0at /usr/src/sys/kern/subr_trap.c:315 > #14 0xc0c2b9ff in syscall (frame=3D0xf5edbd28) > =C2=A0 =C2=A0at /usr/src/sys/i386/i386/trap.c:1086 > #15 0xc0c11b71 in Xint0x80_syscall () > =C2=A0 =C2=A0at /usr/src/sys/i386/i386/exception.s:266 > > Or just creating many ggate devices simultaneously: > > for i in `jot 100`; do > =C2=A0 =C2=A0./ggiocreate $i& > done > > ggiocreate.c is attached. > > In my case the kernel crashes in g_gate_create() when checking for name > collisions in strcmp(): > > =C2=A0 =C2=A0 =C2=A0 =C2=A0/* Check for name collision. */ > =C2=A0 =C2=A0 =C2=A0 =C2=A0for (unit =3D 0; unit < g_gate_maxunits; unit+= +) { > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (g_gate_units[u= nit] =3D=3D NULL) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0continue; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (strcmp(name, g= _gate_units[unit]->sc_provider->name) !=3D 0) > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0continue; > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mtx_unlock(&g_gate= _units_lock); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mtx_destroy(&sc->s= c_queue_mtx); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0free(sc, M_GATE); > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return (EEXIST); > =C2=A0 =C2=A0 =C2=A0 =C2=A0} > > I think the issue is the following. When preparing sc we take > g_gate_units_lock, check for name collision, fill sc fields except > sc->sc_provider, and registers sc in g_gate_units[unit]. sc_provider is f= illed > later, when g_gate_units_lock is released. So the scenario is possible: > > 1) Thread A registers sc in g_gate_units[unit] with > g_gate_units[unit]->sc_provider still null and releases g_gate_units_lock= . > > 2) Thread B traverses g_gate_units[] when checking for name collision and > craches accessing g_gate_units[unit]->sc_provider->name. > > The attached patch fixes the issue in my case. Patch applied cleanly to 8-STABLE with ZFSv28 patch also applied. Just to be safe, did a full buildwold/kernel cycle, running GENERIC kernel. So far, I have not been able to produce a crash in hastd, through several reboots, switching from primary to secondary and back, and just switching from primary to init and back. So far, so good. Now to see if I can reproduce any of the ZFS crashes I had earlier. --=20 Freddie Cash fjwcash@gmail.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AANLkTi=zXX93Tzd1fYq3bJ4BEuvUf43y=94fT3rXd6j9>