From owner-freebsd-net@FreeBSD.ORG  Wed May  8 04:55:17 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id D6E9F454;
 Wed,  8 May 2013 04:55:17 +0000 (UTC)
 (envelope-from yaneurabeya@gmail.com)
Received: from mail-vc0-f176.google.com (mail-vc0-f176.google.com
 [209.85.220.176])
 by mx1.freebsd.org (Postfix) with ESMTP id 780C85F3;
 Wed,  8 May 2013 04:55:17 +0000 (UTC)
Received: by mail-vc0-f176.google.com with SMTP id ib11so1284120vcb.7
 for <multiple recipients>; Tue, 07 May 2013 21:55:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=HFnt7X928KQdn79Nu6N9dP7YOnGIkudn42KKUEDdTb0=;
 b=REqN7FPnaQtA68O35ATmtGDg1nR+x1OqanMUOI9Q82KhdZCmZzMBnsXg4ee4ugL0wY
 1MqpvjlgrpeQERrXvyejUsyer4eG3t3wdb8fgOBt6XC3y9dfgSpuuudl1rmYbjMwjpKG
 fodeLgDoVDWC1jOtbTejvswNhBsVagBuOzLvRcDXD79+I2Ez0O9+vXybk043GE87kF4/
 Uj2/xuh3hsTvmFe6u4khG9EjHtH+FiIPxLgHZucGYXSC4UJMMlczW5wowbbJJbS/rzQu
 DiWg3gKO17wuooA09jTKSCeKQCqhSBQ8cwIVlshxQxrIdM9D6Hap98/Tw0tJVDO3YQ9L
 H/7g==
MIME-Version: 1.0
X-Received: by 10.52.176.163 with SMTP id cj3mr2938980vdc.35.1367988916402;
 Tue, 07 May 2013 21:55:16 -0700 (PDT)
Received: by 10.220.141.72 with HTTP; Tue, 7 May 2013 21:55:16 -0700 (PDT)
In-Reply-To: <518988F0.2080902@delphij.net>
References: <CAGHfRMDEerVRBYvreUm0SyEVWa92q0SXYrHeSbFTNRKrHvzx4Q@mail.gmail.com>
 <518988F0.2080902@delphij.net>
Date: Tue, 7 May 2013 21:55:16 -0700
Message-ID: <CAGHfRMAw=B3shZ+fCjFc2TURsJTEskfhvUBK5Ocwc0Oi+5giSg@mail.gmail.com>
Subject: Re: LOR: "taskqueue_drain with the following non-sleepable locks
 held" with if_em
From: Garrett Cooper <yaneurabeya@gmail.com>
To: Xin LI <d@delphij.net>
Content-Type: text/plain; charset=ISO-8859-1
Cc: jfv@freebsd.org, freebsd-net@freebsd.org,
 Haven Hash <haven.hash@isilon.com>, jeff@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 08 May 2013 04:55:17 -0000

On Tue, May 7, 2013 at 4:06 PM, Xin Li <delphij@delphij.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA512
>
> On 05/07/13 15:03, Garrett Cooper wrote:
>> Saw the following LOR on a CURRENT build as of yesterday with an
>> almost idle machine processing ARP requests:
>>
>> root@wf220:/mnt # taskqueue_drain with the following non-sleepable
>> locks held: exclusive rw lle (lle) r = 0 (0xfffffe001450b410)
>> locked @ /usr/src/sys/netinet/in.c:1484 KDB: stack backtrace:
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
>> 0xffffff848d4f7690 kdb_backtrace() at kdb_backtrace+0x39/frame
>> 0xffffff848d4f7740 witness_warn() at witness_warn+0x4a8/frame
>> 0xffffff848d4f7800 taskqueue_drain() at taskqueue_drain+0x3a/frame
>> 0xffffff848d4f7840 set_timeout() at set_timeout+0x4a/frame
>> 0xffffff848d4f7860 netevent_callback() at
>> netevent_callback+0x16/frame 0xffffff848d4f7870 arpintr() at
>> arpintr+0x9b5/frame 0xffffff848d4f7930 netisr_dispatch_src() at
>> netisr_dispatch_src+0x60/frame 0xffffff848d4f79a0 ether_demux() at
>> ether_demux+0x130/frame 0xffffff848d4f79d0 ether_nh_input() at
>> ether_nh_input+0x369/frame 0xffffff848d4f7a30 netisr_dispatch_src()
>> at netisr_dispatch_src+0x60/frame 0xffffff848d4f7aa0 em_rxeof() at
>> em_rxeof+0x30e/frame 0xffffff848d4f7b10 em_msix_rx() at
>> em_msix_rx+0x33/frame 0xffffff848d4f7b40
>> intr_event_execute_handlers() at
>> intr_event_execute_handlers+0x80/frame 0xffffff848d4f7b70
>> ithread_loop() at ithread_loop+0x128/frame 0xffffff848d4f7bb0
>> fork_exit() at fork_exit+0x71/frame 0xffffff848d4f7bf0
>> fork_trampoline() at fork_trampoline+0xe/frame 0xffffff848d4f7bf0
>> --- trap 0, rip = 0, rsp = 0xffffff848d4f7cb0, rbp = 0 ---
>> root@wf220:/mnt # uname -a FreeBSD wf220.west.isilon.com
>> 10.0-CURRENT FreeBSD 10.0-CURRENT #1: Tue May  7 08:04:59 PDT 2013
>> root@wf220.west.isilon.com:/usr/obj/usr/src/sys/ISI-GENERIC  amd64
>>
>> I've seen this issue before for a few weeks/months, so it's nothing
>> new (but probably should be fixed...). Thanks!
>
> This have nothing to do with em(4) but looks like a bug in our Linux
> compatibility wrapper.  In the InfiniBand code, its
> _handle_arp_update_event() calls netevent_callback() with
> NETEVENT_NEIGH_UPDATE, where a cancel_delayed_work() causes the drain.
>
> Looking at the Linux code, it seems that we just shouldn't do the
> drain in the cancel_delayed_work() wrapper
> (sys/ofed/include/linux/workqueue.h) so it seems like we need
> something like this:
>
> Index: sys/ofed/include/linux/workqueue.h
> ===================================================================
> - --- sys/ofed/include/linux/workqueue.h        (revision 250337)
> +++ sys/ofed/include/linux/workqueue.h  (working copy)
> @@ -184,9 +184,9 @@
>  {
>
>         callout_stop(&work->timer);
> - -     if (work->work.taskqueue &&
> - -         taskqueue_cancel(work->work.taskqueue, &work->work.work_task, NULL))
> - -             taskqueue_drain(work->work.taskqueue, &work->work.work_task);
> +       if (work->work.taskqueue)
> +               return (taskqueue_cancel(work->work.taskqueue,
> +                   &work->work.work_task, NULL) != 0);
>         return 0;
>  }
>
>
>
> I've added Jeff to Cc.

    The patch LGTM (I haven't hit the issue after 10 minutes of use;
generally it pops up almost immediately after boot or within the first
couple of minutes).
Thanks a million!
-Garrett