From owner-freebsd-net@freebsd.org Tue Jun 14 15:26:23 2016 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EB4BDAF23E0 for ; Tue, 14 Jun 2016 15:26:23 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-oi0-x22c.google.com (mail-oi0-x22c.google.com [IPv6:2607:f8b0:4003:c06::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B41862F02; Tue, 14 Jun 2016 15:26:23 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-oi0-x22c.google.com with SMTP id d132so157935408oig.1; Tue, 14 Jun 2016 08:26:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=hCw6zfilP471N+yzQ8SaIKMr6Pim059AiwShFGK5SMM=; b=TuiasPTzmHNHtLra5OMXoHSFzeb8mmlJqtC/n2e679CU9dxikh//WMWNzM8w0NGILw ef0ZiSKy//QGhsOE1kwqG03ySkqhNUSmSbGZ20iZcSSQUOHvUE0cbTLc1XJPQds4uZ8i W+tindHKa10ejolveIvIcANH1t0E0d0ykUN+/uLzxEcb3xTwHiyD6z4cME5nd/D3ilZn 50LjjkK7kLBsqrB6L+4suiKUWIQfyizZaCODr+OokuDzYSKjZ93pG1JzAsTpuCeSWlxb oxdg2AteKPxhwiMNwAjb9mwphpzo0svYofdBCCoysRtUP6orEQmF89uT42kdc5UdXEPY z3eg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=hCw6zfilP471N+yzQ8SaIKMr6Pim059AiwShFGK5SMM=; b=EgYm2f7I/oFO92PwboeHELSUp70UMjlN+d7Wah1gyGFLoE1X6QdL4QaDlJxKEuf6G2 kgu4g/AmDmDPYTA1AXvi7RKcUgsrfH2cl8M1StN30DcZpNrVKxPB5aNk5KjhKAcs+h2o DfzeF9xqufEGxSzdQWq3t61G0Qf9NXuly5Fx9Rr6Pgn4SuzGpmr2L4QP1KY1AY+FJwAo 0A4XzBJ3mogRjFSy/LTDmVLjcQA5WXiYmAsVRQUph1LWPu0xO0Czloc10Lt0F2qYQVFr XgSXSOEvaOkN2d681rfDuWDLaHphHoHzc+I/KlsqzQjGSu+qJTQyNz3TOk7w/L4sGCSr UuEg== X-Gm-Message-State: ALyK8tKIhUknsW8J4O84upRqpjKL71kqJioZi3QUO9erT0IE31XJ3AmNfasa8dvps8zkJQb6dJsKJPehHUaAOQ== X-Received: by 10.202.114.65 with SMTP id p62mr11431767oic.105.1465917982909; Tue, 14 Jun 2016 08:26:22 -0700 (PDT) MIME-Version: 1.0 Sender: asomers@gmail.com Received: by 10.202.102.206 with HTTP; Tue, 14 Jun 2016 08:26:22 -0700 (PDT) In-Reply-To: <459d2639-b490-beee-9cd4-05f38983eaed@freebsd.org> References: <459d2639-b490-beee-9cd4-05f38983eaed@freebsd.org> From: Alan Somers Date: Tue, 14 Jun 2016 09:26:22 -0600 X-Google-Sender-Auth: xgDcCb3pREXs52cLjG_1eyVNc0I Message-ID: Subject: Re: lagg(4): LOR, deadlock and panic To: Sean Bruno Cc: "freebsd-net@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Jun 2016 15:26:24 -0000 On Tue, Jun 14, 2016 at 9:13 AM, Sean Bruno wrote: > tl;dr --> https://reviews.freebsd.org/D6845 > > Navdeep and I have been poking at an LOR that seems to be popping up > in -current that is related to lagg(4) and lagg_get_counter(). > > root@sysdev07:~ # ifconfig lagg0 create laggport ix0 laggproto lacp > 192.168.100.11/24 > lagg0: link state changed to DOWN > root@sysdev07:~ # ifconfig ix0 up > lock order reversal: > 1st 0xfffff8002d7c9190 if_addr_lock (if_addr_lock) @ > /usr/home/sbruno/fbsd_head/sys/net/rtsock.c:1717 > 2nd 0xfffff800271a5808 if_lagg rmlock (if_lagg rmlock) @ > /usr/home/sbruno/fbsd_head/sys/modules/if_lagg/../../net/if_lagg.c:1057 > stack backtrace: > #0 0xffffffff80aa5ab0 at witness_debugger+0x70 > #1 0xffffffff80aa59a4 at witness_checkorder+0xe54 > #2 0xffffffff80a42521 at _rm_rlock_debug+0x111 > #3 0xffffffff82222b2c at lagg_get_counter+0x4c > #4 0xffffffff80b2ebd1 at if_data_copy+0xa1 > #5 0xffffffff80b533bc at sysctl_rtsock+0x56c > #6 0xffffffff80a53f0a at sysctl_root_handler_locked+0x8a > #7 0xffffffff80a536c8 at sysctl_root+0x188 > #8 0xffffffff80a53cbe at userland_sysctl+0x16e > #9 0xffffffff80a53b14 at sys___sysctl+0x74 > #10 0xffffffff80eb5b3b at amd64_syscall+0x2db > #11 0xffffffff80e95c4b at Xfast_syscall+0xfb > > Running a netstat -w 1 in the backgrouund while repeatedly creating > destroying the interface lagg0 will lead to either a panic or a deadlock: > > e.g. netstat -w 1 > /dev/null & > while [ 1 ]; do > ifconfig lagg0 destroy > ifconfig lagg0 create laggport ix0 laggproto lacp 192.168.100.11/24 > done > > When the system deadlocks on the console, kdb sees the locks held like > this: > KDB: enter: Break to debugger > [ thread pid 11 tid 100007 ] > Stopped at kdb_alt_break_internal+0x18e: movq $0,kdb_why > db> show allocks > No such command > db> show alllocks > Process 2173 (ifconfig) thread 0xfffff8002d125a00 (100186) > exclusive rm if_lagg rmlock (if_lagg rmlock) r = 0 > (0xfffff8002717e408) locked @ > /usr/home/sbruno/fbsd_head/sys/modules/if_lagg/../../net/if_lagg.c:1530 > exclusive sleep mutex in6_multi_mtx (in6_multi_mtx) r = 0 > (0xffffffff81d7e288) locked @ > /usr/home/sbruno/fbsd_head/sys/netinet6/in6_mcast.c:1142 > Process 792 (netstat) thread 0xfffff80027e67a00 (100167) > shared rw if_addr_lock (if_addr_lock) r = 0 (0xfffff80103e95190) > locked @ /usr/home/sbruno/fbsd_head/sys/net/rtsock.c:1717 > shared rw ifnet_rw (ifnet_rw) r = 0 (0xffffffff81d7b760) locked @ > /usr/home/sbruno/fbsd_head/sys/net/rtsock.c:1713 > exclusive sleep mutex Giant (Giant) r = 0 (0xffffffff81d55e08) locked > @ /usr/home/sbruno/fbsd_head/sys/kern/kern_sysctl.c:164 > > This looks like the netstat is causing a call into the counter > function while the destruction or creation is ongoing. > > Removing the LAGG_RLOCK() calls from lagg_get_counter() makes the > deadlock, LOR and panic go away, however this can't be that easy. I'm > unsure what the RLOCK is for in lagg_get_counter(). It appears that > there is a higher lock in the ifnet access that is protecting > simultaneous access already, but I'm very ignorant of what's going on > here. > > I don't see any other driver with locks in its get_counter() > functions, so I'm not sure what the best course of action here is. > > Sean I don't know the best answer either. But while you're in there, are you interested in fixing any other lagg panics too? I've written some ATF torture tests for lagg, but I haven't checked them into head yet because most of them quickly panic. -Alan