From owner-freebsd-hackers@FreeBSD.ORG Thu Sep 16 19:45:50 2010 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0BBDD1065672 for ; Thu, 16 Sep 2010 19:45:50 +0000 (UTC) (envelope-from mdf356@gmail.com) Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 6A5ED8FC08 for ; Thu, 16 Sep 2010 19:45:49 +0000 (UTC) Received: by iwn34 with SMTP id 34so1514022iwn.13 for ; Thu, 16 Sep 2010 12:45:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=HOzJqFNga1KhNkzgOYbMTNziS2nPm+kCrDKKKHokiSs=; b=IVcXYD+SW90n01lylQwKdIVnY/tUyKlmAx/JFan6hHlhGKhdV6mLP7ITsu43EcyPw7 irOL1OvDC4ucSc3A9gBtOncar4TUmRjAg40lnUNaCdpzoGVAhKwGLI8gvZFhxoCxhOif dtN1sPz5M2LkIUkrzWM7oNzxYHlJyeGCPOA8I= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=QG6o9Txg51rKZGkruImg/68QpGiCohZLee2HGsX05Tru7mhrqkG4eqSGYh5eWA/o+z pOnH50WKIWGv4il5GI8HtPORslGWAQqKud1wCEcnN4TFicPlKLLvE1p7hZrR0Wcp7HFM XRfof4KZyixodEoCEY4AquOlzEWsneF6/x02E= MIME-Version: 1.0 Received: by 10.231.31.129 with SMTP id y1mr3938081ibc.45.1284666348448; Thu, 16 Sep 2010 12:45:48 -0700 (PDT) Received: by 10.231.187.71 with HTTP; Thu, 16 Sep 2010 12:45:48 -0700 (PDT) In-Reply-To: <4C92694D.1070705@feral.com> References: <4C92694D.1070705@feral.com> Date: Thu, 16 Sep 2010 12:45:48 -0700 Message-ID: From: Matthew Fleming To: Matthew Jacob Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-hackers@freebsd.org Subject: Re: race conditions for destroying and opening a dev X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Sep 2010 19:45:50 -0000 On Thu, Sep 16, 2010 at 12:00 PM, Matthew Jacob wrote: > > Has anyone seen this scenario before? I am seeing it in RELENG_7, but the > code in question exists through to head. > > Thread 1: > > (kgdb) where > #0 =A0sched_switch (td=3D0xffffff003a04ea80, newtd=3D0xffffff00210b4000, > flags=3DVariable "flags" is not available. > ) at ../../../kern/sched_ule.c:1944 > #1 =A00xffffffff803b6091 in mi_switch (flags=3D1, newtd=3D0x0) at > ../../../kern/kern_synch.c:450 > #2 =A00xffffffff80402399 in sleepq_switch (wchan=3D0xffffff8413b50b60) at > ../../../kern/subr_sleepqueue.c:497 > #3 =A00xffffffff80402e8c in sleepq_timedwait (wchan=3D0xffffff8413b50b60)= at > ../../../kern/subr_sleepqueue.c:615 > #4 =A00xffffffff803b682d in _sleep (ident=3D0xffffff8413b50b60, > lock=3D0xffffffff80b0ee00, priority=3D76, wmesg=3D0xffffffff806583bb "dev= drn", > timo=3D100) at ../../../kern/kern_synch.c:228 > #5 =A00xffffffff8037640c in destroy_devl (dev=3D0xffffff003aaf0000) at > ../../../kern/kern_conf.c:874 > #6 =A00xffffffff80376759 in destroy_dev (dev=3D0xffffff003aaf0000) at > ../../../kern/kern_conf.c:916 > #7 =A00xffffffff8034c939 in g_dev_orphan (cp=3D0xffffff003a544800) at > ../../../geom/geom_dev.c:438 > #8 =A00xffffffff803506a0 in g_run_events () at ../../../geom/geom_event.c= :164 > #9 =A00xffffffff80351f1c in g_event_procbody () at > ../../../geom/geom_kern.c:141 > #10 0xffffffff8038a73a in fork_exit (callout=3D0xffffffff80351eb0 > , arg=3D0x0, > frame=3D0xffffff8413b50c80) at ../../../kern/kern_fork.c:829 > #11 0xffffffff805a747e in fork_trampoline () at > ../../../amd64/amd64/exception.S:564 > #12 0x0000000000000000 in ?? () > > This thread is waiting on the threadcount to go away- i.e., the last clos= e > of the device to occur ("da16" in this case). > > Thread 2: > > (kgdb) where > #0 =A0sched_switch (td=3D0xffffff009bb4ca80, newtd=3D0xffffff003af43380, > flags=3DVariable "flags" is not available. > ) at ../../../kern/sched_ule.c:1944 > #1 =A00xffffffff803b6091 in mi_switch (flags=3D1, newtd=3D0x0) at > ../../../kern/kern_synch.c:450 > #2 =A00xffffffff80402399 in sleepq_switch (wchan=3D0xffffffff80b0e040) at > ../../../kern/subr_sleepqueue.c:497 > #3 =A00xffffffff80402f84 in sleepq_wait (wchan=3D0xffffffff80b0e040) at > ../../../kern/subr_sleepqueue.c:580 > #4 =A00xffffffff803b5385 in _sx_xlock_hard (sx=3D0xffffffff80b0e040, > tid=3D18446742976810240640, opts=3DVariable "opts" is not available. > ) at ../../../kern/kern_sx.c:562 > #5 =A00xffffffff803b5731 in _sx_xlock (sx=3D0xffffffff80b0e040, opts=3D0, > file=3D0xffffffff80652d27 "../../../geom/geom_dev.c", line=3D196) at sx.h= :154 > #6 =A00xffffffff8034d1bc in g_dev_open (dev=3D0xffffff003aaf0000, flags= =3D1, > fmt=3DVariable "fmt" is not available. > ) at ../../../geom/geom_dev.c:196 > #7 =A00xffffffff80333741 in devfs_open (ap=3D0xffffff841dea88b0) at > ../../../fs/devfs/devfs_vnops.c:902 > #8 =A00xffffffff80601daf in VOP_OPEN_APV (vop=3D0xffffffff8089fb80, > a=3D0xffffff841dea88b0) at vnode_if.c:371 > #9 =A00xffffffff80467246 in vn_open_cred (ndp=3D0xffffff841dea8a00, > flagp=3D0xffffff841dea894c, cmode=3DVariable "cmode" is not available. > ) at vnode_if.h:199 > #10 0xffffffff80463770 in kern_open (td=3D0xffffff009bb4ca80, path=3D0x51= 14a0 >
, pathseg=3DVariable "pathseg" is not > available. > ) at ../../../kern/vfs_syscalls.c:1054 > #11 0xffffffff805c599e in syscall (frame=3D0xffffff841dea8c80) at > ../../../amd64/amd64/trap.c:911 > #12 0xffffffff805a723b in Xfast_syscall () at > ../../../amd64/amd64/exception.S:349 > #13 0x00000008009a219c in ?? () > > This thread was opening the device, bumped the refcount, but then wedged = on > the geom topology lock ..... > > the refcount field is protected under devmtx.... > > Anyone seen this? > > I'm half inclined to either add in CDP_SCHED_DTR when one calls destroy_d= ev, > or make dev_refthread look at CDP_ACTIVE, leaning more toward the latter. > > Any thoughts on this? We had a similar bug at Isilon, but in our case it was in cam/scsi/scsi_pass.c::passcleanup() calling destroy_dev(). We switched it to destroy_dev_sched() to fix the si_threadcount deadlock. Cheers, matthew