Date: Thu, 10 May 2018 19:54:36 +0200 From: Harry Schmalzbauer <freebsd@omnilan.de> To: Stephen Hurd <shurd@llnw.com> Cc: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>, Stephen Hurd <shurd@freebsd.org>, Kevin Bowling <kevin.bowling@kev009.com> Subject: Re: iflib-if_em tests with HEAD and lagg panic [Was: Re: svn commit: r333338 - in stable/11/sys: dev/bnxt kern net sys] Message-ID: <5AF4875C.5000201@omnilan.de> In-Reply-To: <CAGK_Ob1XR_D_B=vXeHtMQwHA2yXhhWPfMtpwHKwDbGfoWgaOVw@mail.gmail.com> References: <201805072142.w47LgN1R041002@repo.freebsd.org> <5AF16B8B.7030703@omnilan.de> <CAK7dMtBkCvLgPVnsf%2BECcrdbKNvOShONeZ=vqvg3dJ5ZeuoP5w@mail.gmail.com> <5AF17134.7020602@omnilan.de> <CAK7dMtB3V1F=2AxtsbUznn5DO81G3Zkh9UYiN3eWkyOfV_CYmg@mail.gmail.com> <5AF1CF0F.4040909@omnilan.de> <65972f0d-2873-42ea-464c-a3db543abafb@freebsd.org> <5AF1E073.5010701@omnilan.de> <CAGK_Ob1XR_D_B=vXeHtMQwHA2yXhhWPfMtpwHKwDbGfoWgaOVw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Bezüglich Stephen Hurd's Nachricht vom 08.05.2018 20:58 (localtime): > Can you test the review here: https://reviews.freebsd.org/D15355 > > It looks like there are two different locks protecting the same data > everywhere but in lagg_ioctl(). This is a rough first-pass, and there may > be some lingering recursion and performance regressions with it. > … >> Sleeping on "e1000_delay" with the following non-sleepable locks held: >> exclusive rm if_lagg rmlock (if_lagg rmlock) r = 0 (0xfffff80014228c08) >> locked @ /usr/src/sys/net/if_lagg.c:1433 >> stack backtrace: >> #0 0xffffffff80701113 at witness_debugger+0x73 >> #1 0xffffffff807024f1 at witness_warn+0x461 >> #2 0xffffffff806a42cc at _sleep+0x6c >> #3 0xffffffff806a4b34 at pause_sbt+0x144 >> #4 0xffffffff80440e21 at e1000_write_phy_reg_mdic+0xf1 >> #5 0xffffffff804446bf at e1000_enable_phy_wakeup_reg_access_bm+0x2f >> #6 0xffffffff80432e0a at e1000_update_mc_addr_list_pch2lan+0x3a >> #7 0xffffffff8041408f at em_if_multi_set+0x1bf >> #8 0xffffffff807bc02e at iflib_if_ioctl+0xfe >> #9 0xffffffff82111a15 at lagg_ioctl+0x115 >> #10 0xffffffff807dd348 at inm_release_task+0x218 >> #11 0xffffffff806dea29 at gtaskqueue_run_locked+0x139 >> #12 0xffffffff806de7a8 at gtaskqueue_thread_loop+0x88 >> #13 0xffffffff80659d84 at fork_exit+0x84 >> #14 0xffffffff809b767e at fork_trampoline+0xe >> Sleeping thread (tid 100017, pid 0) owns a non-sleepable lock >> KDB: stack backtrace of thread 100017: >> sched_switch() at sched_switch+0x945/frame 0xfffffe00750dc5d0 >> mi_switch() at mi_switch+0x18c/frame 0xfffffe00750dc600 >> sleepq_switch() at sleepq_switch+0x10d/frame 0xfffffe00750dc640 >> sleepq_timedwait() at sleepq_timedwait+0x50/frame 0xfffffe00750dc680 >> _sleep() at _sleep+0x307/frame 0xfffffe00750dc730 >> pause_sbt() at pause_sbt+0x144/frame 0xfffffe00750dc780 >> e1000_write_phy_reg_mdic() at e1000_write_phy_reg_mdic+0xf1/frame >> 0xfffffe00750dc7c0 >> e1000_enable_phy_wakeup_reg_access_bm() at >> e1000_enable_phy_wakeup_reg_access_bm+0x2f/frame 0xfffffe00750dc7e0 >> e1000_update_mc_addr_list_pch2lan() at >> e1000_update_mc_addr_list_pch2lan+0x3a/frame 0xfffffe00750dc820 >> em_if_multi_set() at em_if_multi_set+0x1bf/frame 0xfffffe00750dc870 >> iflib_if_ioctl() at iflib_if_ioctl+0xfe/frame 0xfffffe00750dc8e0 >> lagg_ioctl() at lagg_ioctl+0x115/frame 0xfffffe00750dc990 >> inm_release_task() at inm_release_task+0x218/frame 0xfffffe00750dc9f0 >> gtaskqueue_run_locked() at gtaskqueue_run_locked+0x139/frame >> 0xfffffe00750dca40 >> gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x88/frame >> 0xfffffe00750dca70 >> fork_exit() at fork_exit+0x84/frame 0xfffffe00750dcab0 >> fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00750dcab0 >> --- trap 0, rip = 0, rsp = 0, rbp = 0 --- >> panic: sleeping thread >> cpuid = 3 >> time = 1525794682 >> KDB: stack backtrace: >> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame >> 0xfffffe008fe180e0 >> vpanic() at vpanic+0x1a3/frame 0xfffffe008fe18140 >> panic() at panic+0x43/frame 0xfffffe008fe181a0 >> propagate_priority() at propagate_priority+0x335/frame 0xfffffe008fe181e0 >> turnstile_wait() at turnstile_wait+0x38d/frame 0xfffffe008fe18230 >> __mtx_lock_sleep() at __mtx_lock_sleep+0x1e1/frame 0xfffffe008fe182b0 >> __mtx_lock_flags() at __mtx_lock_flags+0xf9/frame 0xfffffe008fe18300 >> _rm_rlock() at _rm_rlock+0x280/frame 0xfffffe008fe18330 >> _rm_rlock_debug() at _rm_rlock_debug+0x14c/frame 0xfffffe008fe18380 >> lagg_transmit() at lagg_transmit+0x38/frame 0xfffffe008fe183f0 >> ether_output_frame() at ether_output_frame+0xaa/frame 0xfffffe008fe18420 >> ether_output() at ether_output+0x68b/frame 0xfffffe008fe184c0 >> arprequest() at arprequest+0x474/frame 0xfffffe008fe185c0 >> arp_ifinit() at arp_ifinit+0x58/frame 0xfffffe008fe18600 >> ether_ioctl() at ether_ioctl+0x1d1/frame 0xfffffe008fe18630 >> lagg_ioctl() at lagg_ioctl+0x602/frame 0xfffffe008fe186e0 >> in_control() at in_control+0x8f5/frame 0xfffffe008fe18780 >> ifioctl() at ifioctl+0x19c6/frame 0xfffffe008fe18850 >> kern_ioctl() at kern_ioctl+0x2b9/frame 0xfffffe008fe188b0 >> sys_ioctl() at sys_ioctl+0x168/frame 0xfffffe008fe18980 >> amd64_syscall() at amd64_syscall+0x2cc/frame 0xfffffe008fe18ab0 >> fast_syscall_common() at fast_syscall_common+0x101/frame >> 0xfffffe008fe18ab0 >> --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x8004820ba, rsp = >> 0x7fffffffe1c8, rbp = 0x7fffffffe210 --- >> KDB: enter: panic I can confirm that the D15355 version I tested eleminates that panic. Also no LOR with em0+em1 as laggports. >From the kawela report: > Bezüglich Kevin Bowling's Nachricht vom 08.05.2018 11:52 (localtime): >> On Tue, May 8, 2018 at 2:43 AM, Harry Schmalzbauer <freebsd@omnilan.de> wrote: > … >>> But if the simple iflib/hw-support test with kawela+hartwell helps I'm >>> happy to do. >> >> At this point it would be helpful, we think e1000 is nearing pretty >> good shape and I need to become familiar with any outstanding bugs. > > Here's the results for kawela (82576) which, to my surprise, still shows > up as "igb" – I thought it would be "emX". … > Running simple NFS4 copies with all offloading bells and whistles > enabled and MTU 9000 work fine (over IPv6 and LACP) at full line rate. > > Only one LACP LOR (no panic as with emo+em1 lagg, where I saw pages full > of LORs): > lock order reversal: (sleepable after non-sleepable) > 1st 0xfffff80002bc9208 if_lagg rmlock (if_lagg rmlock) @ > /usr/src/sys/net/if_lagg.c:1433 > 2nd 0xfffff80002c04550 iflib ctx lock (iflib ctx lock) @ > /usr/src/sys/net/iflib.c:3999 > stack backtrace: > #0 0xffffffff80701113 at witness_debugger+0x73 > #1 0xffffffff80700f94 at witness_checkorder+0xe34 > #2 0xffffffff806a26a8 at _sx_xlock+0x68 > #3 0xffffffff807bbfbc at iflib_if_ioctl+0x8c > #4 0xffffffff8079e5f4 at if_addmulti+0x264 > #5 0xffffffff821144a8 at lagg_setmulti+0x108 > #6 0xffffffff82111a28 at lagg_ioctl+0x128 > #7 0xffffffff8079e5f4 at if_addmulti+0x264 > #8 0xffffffff807d8b7e at in_joingroup_locked+0x1ce > #9 0xffffffff807d8982 at in_joingroup+0x42 > #10 0xffffffff807d47cb at in_control+0x93b > #11 0xffffffff8079d656 at ifioctl+0x19c6 > #12 0xffffffff807068c9 at kern_ioctl+0x2b9 > #13 0xffffffff80706598 at sys_ioctl+0x168 > #14 0xffffffff809dab2c at amd64_syscall+0x2cc > #15 0xffffffff809b71ad at fast_syscall_common+0x101 This LOR (igb0+igb1 as laggports) also vanished with the D15355 version I tested. Please excuse that I'm not familar with the phabricator and just did "raw diff download" after briefly flying over the comments. According to st_mtime this was on May 9th, 08:14:02 UTC (10:14 local (CEST) time). No idea what timezone phabricator reports to me, most likely respecting local time. Which means latest revision was part of my test – but I'm not sure... Shall I redo? -harry
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5AF4875C.5000201>