From owner-freebsd-net@FreeBSD.ORG Wed May 8 17:47:06 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id C4118917; Wed, 8 May 2013 17:47:06 +0000 (UTC) (envelope-from delphij@delphij.net) Received: from anubis.delphij.net (anubis.delphij.net [64.62.153.212]) by mx1.freebsd.org (Postfix) with ESMTP id AC9A018C; Wed, 8 May 2013 17:47:06 +0000 (UTC) Received: from zeta.ixsystems.com (drawbridge.ixsystems.com [206.40.55.65]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by anubis.delphij.net (Postfix) with ESMTPSA id 4583A11846; Wed, 8 May 2013 10:46:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=delphij.net; s=anubis; t=1368035220; bh=2LN/w7y1r510UbPD4w1rG43zhEeFVunR/jQxC6JgQuk=; h=Date:From:Reply-To:To:CC:Subject:References:In-Reply-To; b=rqb5R3NiqPXLQZODqPpnTbYLwQ2ZY6hNLK3z2eonFHW+4zJ4LJRSMZjT17eO6aOTK hLB0j9JXBI3tb24BiWjAjwTk2aNAGrUynWDe99l9PEvopV7+vb/jF0mjYjDieTmk4i n8LM4wIw8j0EF5SdkFSyKpJCwlZ/pn1WyOsXOwYE= Message-ID: <518A8F92.30100@delphij.net> Date: Wed, 08 May 2013 10:46:58 -0700 From: Xin Li Organization: The FreeBSD Project MIME-Version: 1.0 To: Garrett Cooper Subject: Re: LOR: "taskqueue_drain with the following non-sleepable locks held" with if_em References: <518988F0.2080902@delphij.net> In-Reply-To: X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: jfv@freebsd.org, freebsd-net@freebsd.org, Haven Hash , Xin LI , jeff@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: d@delphij.net List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 May 2013 17:47:06 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 05/07/13 21:55, Garrett Cooper wrote: > On Tue, May 7, 2013 at 4:06 PM, Xin Li > wrote: >> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 >> >> On 05/07/13 15:03, Garrett Cooper wrote: >>> Saw the following LOR on a CURRENT build as of yesterday with >>> an almost idle machine processing ARP requests: >>> >>> root@wf220:/mnt # taskqueue_drain with the following >>> non-sleepable locks held: exclusive rw lle (lle) r = 0 >>> (0xfffffe001450b410) locked @ /usr/src/sys/netinet/in.c:1484 >>> KDB: stack backtrace: db_trace_self_wrapper() at >>> db_trace_self_wrapper+0x2b/frame 0xffffff848d4f7690 >>> kdb_backtrace() at kdb_backtrace+0x39/frame 0xffffff848d4f7740 >>> witness_warn() at witness_warn+0x4a8/frame 0xffffff848d4f7800 >>> taskqueue_drain() at taskqueue_drain+0x3a/frame >>> 0xffffff848d4f7840 set_timeout() at set_timeout+0x4a/frame >>> 0xffffff848d4f7860 netevent_callback() at >>> netevent_callback+0x16/frame 0xffffff848d4f7870 arpintr() at >>> arpintr+0x9b5/frame 0xffffff848d4f7930 netisr_dispatch_src() >>> at netisr_dispatch_src+0x60/frame 0xffffff848d4f79a0 >>> ether_demux() at ether_demux+0x130/frame 0xffffff848d4f79d0 >>> ether_nh_input() at ether_nh_input+0x369/frame >>> 0xffffff848d4f7a30 netisr_dispatch_src() at >>> netisr_dispatch_src+0x60/frame 0xffffff848d4f7aa0 em_rxeof() >>> at em_rxeof+0x30e/frame 0xffffff848d4f7b10 em_msix_rx() at >>> em_msix_rx+0x33/frame 0xffffff848d4f7b40 >>> intr_event_execute_handlers() at >>> intr_event_execute_handlers+0x80/frame 0xffffff848d4f7b70 >>> ithread_loop() at ithread_loop+0x128/frame 0xffffff848d4f7bb0 >>> fork_exit() at fork_exit+0x71/frame 0xffffff848d4f7bf0 >>> fork_trampoline() at fork_trampoline+0xe/frame >>> 0xffffff848d4f7bf0 --- trap 0, rip = 0, rsp = >>> 0xffffff848d4f7cb0, rbp = 0 --- root@wf220:/mnt # uname -a >>> FreeBSD wf220.west.isilon.com 10.0-CURRENT FreeBSD 10.0-CURRENT >>> #1: Tue May 7 08:04:59 PDT 2013 >>> root@wf220.west.isilon.com:/usr/obj/usr/src/sys/ISI-GENERIC >>> amd64 >>> >>> I've seen this issue before for a few weeks/months, so it's >>> nothing new (but probably should be fixed...). Thanks! >> >> This have nothing to do with em(4) but looks like a bug in our >> Linux compatibility wrapper. In the InfiniBand code, its >> _handle_arp_update_event() calls netevent_callback() with >> NETEVENT_NEIGH_UPDATE, where a cancel_delayed_work() causes the >> drain. >> >> Looking at the Linux code, it seems that we just shouldn't do >> the drain in the cancel_delayed_work() wrapper >> (sys/ofed/include/linux/workqueue.h) so it seems like we need >> something like this: >> >> Index: sys/ofed/include/linux/workqueue.h >> =================================================================== >> >> - - --- sys/ofed/include/linux/workqueue.h (revision 250337) >> +++ sys/ofed/include/linux/workqueue.h (working copy) @@ -184,9 >> +184,9 @@ { >> >> callout_stop(&work->timer); - - if (work->work.taskqueue && - >> - taskqueue_cancel(work->work.taskqueue, >> &work->work.work_task, NULL)) - - >> taskqueue_drain(work->work.taskqueue, &work->work.work_task); + >> if (work->work.taskqueue) + return >> (taskqueue_cancel(work->work.taskqueue, + >> &work->work.work_task, NULL) != 0); return 0; } >> >> >> >> I've added Jeff to Cc. > > The patch LGTM (I haven't hit the issue after 10 minutes of use; > generally it pops up almost immediately after boot or within the > first couple of minutes). Committed as r250374. (The return value is inverted in this version and I committed what I believed was correct, based on my reading of Linux documentation. The return value does not affect your test result though, as it's discarded anyway.) Cheers, - -- Xin LI https://www.delphij.net/ FreeBSD - The Power to Serve! Live free or die -----BEGIN PGP SIGNATURE----- iQEcBAEBCgAGBQJRio+SAAoJEG80Jeu8UPuzvCUH+QHAXi3UCqyoBfUsNTkHofmB riKFONZem5QsR425tg1qPcYwpgcQKAaZpu6a5ILsWZ2IPliN3QysrXFDmkVsL53/ yYK4Pcpa9TA11EjyHj3Bt1hnUqRldz5Olwhpb+RExAWaBZ0Nczf26H2GDOZvEXB4 99OXYje7bR1mbZOUoPkVcDqr4Mh0EZDHct5SxQv3eMagble5iaEiVkvunS0/P3nk njpFbODbfMM9qs3QVxvukp3rA9M7E5cbyhl0WNDHs5h192kvy+rh5C4w3LYi+Vx9 Wlpjy9t1kxA8bLi2d0fyLqsigo2Yz6BHAwB9zs9nQ02Mg3wOPsBIIkr4y1DFiOY= =uH91 -----END PGP SIGNATURE-----