Date: Thu, 20 Oct 2016 15:22:31 +1100 From: Kubilay Kocak <koobs@FreeBSD.org> To: Cassiano Peixoto <peixotocassiano@gmail.com>, Donald Baud <donaldbaud@yahoo.com> Cc: "net@freebsd.org" <net@freebsd.org> Subject: Re: FreeBSD10.3-RELEASE. Kernel panic. Message-ID: <16994026-ae44-4a39-822b-c4a218a71ce7@FreeBSD.org> In-Reply-To: <CAJajdNVdsRzq5KdshEBiOzPT6VeLakpv7E_z1=RJun6%2BU4P9wQ@mail.gmail.com> References: <CAAFYNruF4gFAiTCAhyRUQzcovW2osrKn4ehiuNR0btJCZbnOGg@mail.gmail.com> <57FC859F.5000200@grosbein.net> <CAJajdNUXOrzWDKVmSB1Xm_G6zqBhMsZ2vesDcAw2CPGFBU0xtg@mail.gmail.com> <2033449965.65391.1476244568309@mail.yahoo.com> <a450f0eb-378a-2bd5-2f24-a0eb6b941856@freebsd.org> <86183ea5-5855-5fb3-22f6-d25454859186@yahoo.com> <CACpH0McW4KkDbCnfL4DKc4aQiOhnuMYC0q%2B8ELJn6dtDs0HW3A@mail.gmail.com> <958e01c2-8459-9614-ddd6-d0953fc86c02@yahoo.com> <CAJajdNXkGBLWsVWJiZZQYo=5vqcFzcYOJv0zVA8vzORexjB91A@mail.gmail.com> <CAJajdNVdsRzq5KdshEBiOzPT6VeLakpv7E_z1=RJun6%2BU4P9wQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 19/10/2016 3:23 AM, Cassiano Peixoto wrote: > Hi guys, > > I have some update about this issue. After my last email i had 3 crashes. > Two of them had the same message on kernel debug: > > (kgdb) list *0xffffffff8228c918 > 0xffffffff8228c918 is in trim_map_seg_compare > (/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/trim_map.c:108). > 103 trim_map_seg_compare(const void *x1, const void *x2) > 104 { > 105 const trim_seg_t *s1 = x1; > 106 const trim_seg_t *s2 = x2; > 107 > 108 if (s1->ts_start < s2->ts_start) { > 109 if (s1->ts_end > s2->ts_start) > 110 return (0); > 111 return (-1); > 112 } > Current language: auto; currently minimal > (kgdb) bt > #0 doadump (textdump=<value optimized out>) at pcpu.h:221 > #1 0xffffffff80ad8e69 in kern_reboot (howto=260) at > /usr/src/sys/kern/kern_shutdown.c:366 > #2 0xffffffff80ad941b in vpanic (fmt=<value optimized out>, ap=<value > optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759 > #3 0xffffffff80ad9253 in panic (fmt=0x0) at > /usr/src/sys/kern/kern_shutdown.c:690 > #4 0xffffffff80fa0d31 in trap_fatal (frame=0xfffffe02374957f0, > eva=4294967343) at /usr/src/sys/amd64/amd64/trap.c:841 > #5 0xffffffff80fa0f23 in trap_pfault (frame=0xfffffe02374957f0, > usermode=0) at /usr/src/sys/amd64/amd64/trap.c:691 > #6 0xffffffff80fa04cc in trap (frame=0xfffffe02374957f0) at > /usr/src/sys/amd64/amd64/trap.c:442 > #7 0xffffffff80f84141 in calltrap () at > /usr/src/sys/amd64/amd64/exception.S:236 > #8 0xffffffff8228c918 in trim_map_seg_compare (x1=0xfffffe0237495920, > x2=0x100000007) at > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/trim_map.c:108 > #9 0xffffffff821a98e1 in avl_find (tree=<value optimized out>, > value=<value optimized out>, where=0x0) at > /usr/src/sys/cddl/contrib/opensolaris/common/avl/avl.c:268 > #10 0xffffffff8228ce9e in trim_map_write_start (zio=<value optimized out>) > at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/trim_map.c:363 > #11 0xffffffff822592df in zio_vdev_io_start (zio=0xfffff802191ea000) at > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:2866 > #12 0xffffffff82255b26 in zio_execute (zio=<value optimized out>) at > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1556 > #13 0xffffffff822551e9 in zio_nowait (zio=0xfffff802191ea000) at > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1610 > #14 0xffffffff8223c738 in vdev_queue_io_done (zio=<value optimized out>) at > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c:884 > #15 0xffffffff822594a9 in zio_vdev_io_done (zio=0xfffff8006daad000) at > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:2895 > #16 0xffffffff82255b26 in zio_execute (zio=<value optimized out>) at > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1556 > #17 0xffffffff80b363ca in taskqueue_run_locked (queue=<value optimized > out>) at /usr/src/sys/kern/subr_taskqueue.c:449 > #18 0xffffffff80b372d8 in taskqueue_thread_loop (arg=<value optimized out>) > at /usr/src/sys/kern/subr_taskqueue.c:703 > #19 0xffffffff80a90055 in fork_exit (callout=0xffffffff80b371f0 > <taskqueue_thread_loop>, arg=0xfffff8001006b920, frame=0xfffffe0237495c00) > at /usr/src/sys/kern/kern_fork.c:1038 > #20 0xffffffff80f8467e in fork_trampoline () at > /usr/src/sys/amd64/amd64/exception.S:611 > #21 0x0000000000000000 in ?? () > (kgdb) up 8 > #8 0xffffffff8228c918 in trim_map_seg_compare (x1=0xfffffe0237495920, > x2=0x100000007) at > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/trim_map.c:108 > 108 if (s1->ts_start < s2->ts_start) { > > But my last crash had a different message: > > (kgdb) list *0xffffffff80b3a89c > 0xffffffff80b3a89c is in turnstile_broadcast > (/usr/src/sys/kern/subr_turnstile.c:837). > 832 > 833 /* > 834 * Transfer the blocked list to the pending list. > 835 */ > 836 mtx_lock_spin(&td_contested_lock); > 837 TAILQ_CONCAT(&ts->ts_pending, &ts->ts_blocked[queue], td_lockq); > 838 mtx_unlock_spin(&td_contested_lock); > 839 > 840 /* > 841 * Give a turnstile to each thread. The last thread gets > Current language: auto; currently minimal > (kgdb) bt > #0 doadump (textdump=<value optimized out>) at pcpu.h:221 > #1 0xffffffff80ad8e69 in kern_reboot (howto=260) at > /usr/src/sys/kern/kern_shutdown.c:366 > #2 0xffffffff80ad941b in vpanic (fmt=<value optimized out>, ap=<value > optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759 > #3 0xffffffff80ad9253 in panic (fmt=0x0) at > /usr/src/sys/kern/kern_shutdown.c:690 > #4 0xffffffff80fa0d31 in trap_fatal (frame=0xfffffe0237384870, eva=48) at > /usr/src/sys/amd64/amd64/trap.c:841 > #5 0xffffffff80fa0f23 in trap_pfault (frame=0xfffffe0237384870, > usermode=0) at /usr/src/sys/amd64/amd64/trap.c:691 > #6 0xffffffff80fa04cc in trap (frame=0xfffffe0237384870) at > /usr/src/sys/amd64/amd64/trap.c:442 > #7 0xffffffff80f84141 in calltrap () at > /usr/src/sys/amd64/amd64/exception.S:236 > #8 0xffffffff80b3a89c in turnstile_broadcast (ts=0x0, queue=1) at > /usr/src/sys/kern/subr_turnstile.c:837 > #9 0xffffffff80ad48cf in __rw_wunlock_hard (c=0xfffff8024f3c2960, > tid=<value optimized out>, file=<value optimized out>, line=<value > optimized out>) > at /usr/src/sys/kern/kern_rwlock.c:1027 > #10 0xffffffff80e1a75c in vm_map_delete (map=<value optimized out>, > start=<value optimized out>, end=<value optimized out>) at > /usr/src/sys/vm/vm_map.c:2960 > #11 0xffffffff80e1828e in vmspace_exit (td=<value optimized out>) at > /usr/src/sys/vm/vm_map.c:3077 > #12 0xffffffff80a88686 in exit1 (td=0xfffff80015533a00, rval=268849920, > signo=0) at /usr/src/sys/kern/kern_exit.c:398 > #13 0xffffffff80a87e1d in sys_sys_exit (td=0x0, uap=<value optimized out>) > at /usr/src/sys/kern/kern_exit.c:178 > #14 0xffffffff80fa168e in amd64_syscall (td=<value optimized out>, > traced=0) at subr_syscall.c:135 > #15 0xffffffff80f8442b in Xfast_syscall () at > /usr/src/sys/amd64/amd64/exception.S:396 > #16 0x0000000800b661aa in ?? () > Previous frame inner to this frame (corrupt stack?) > (kgdb) up 8 > #8 0xffffffff80b3a89c in turnstile_broadcast (ts=0x0, queue=1) at > /usr/src/sys/kern/subr_turnstile.c:837 > 837 TAILQ_CONCAT(&ts->ts_pending, &ts->ts_blocked[queue], td_lockq); > > As you can see we are dealing with random crashes. I feel i'm not moving > forward here. it's not a hardware problem because i have 3 different > servers with same issue. > > Donald, did you have a chance to try 11-RELEASE? Any other behavior? > > Anyone have some idea that could help? > > Thanks. > > > On Thu, Oct 13, 2016 at 12:24 PM, Cassiano Peixoto < > peixotocassiano@gmail.com> wrote: > >> Hi guys, >> >> First of all, thanks to share your thoughts about this issue. I think it’s >> really important to find out a solution for this issue together. >> >> I can see two behaviors related, but for me the root cause is the same: >> >> 1- mpd5 process stuck with umtxn flag >> 2- system crash >> >> I’ve tested recently on FreeBSD 10.3 and FreeBSD-11-RC3. I’ve tried all >> suggested tunings with no success. >> >> My environment is: >> - About 430 clients connected (but i can add more) >> - Using ZFS >> - igb NICs. >> - Generic kernel >> >> Two days ago i updated my system to FreeBSD 11-RELEASE-p1 and after this >> my system seems stable for almost 3 days. No crashes anymore. I need more >> days to feel confident if something has changed. But anyway, my crashes >> before happened every day. >> >> If it crashs again i’ll apply Donald recommendation and let you guys know. >> >> Let’s keep in touch, to try to at last fix it. >> >> Thanks. >> >> On Wed, Oct 12, 2016 at 8:24 PM, Donald Baud via freebsd-net < >> freebsd-net@freebsd.org> wrote: >> >>> On 10/12/16 3:24 PM, Zaphod Beeblebrox wrote: >>> >>> While my mp5 servers are possibly less busy (I havn't had common >>>> crashes), I have noticed a "group" of problems. >>>> >>>> 1. The carrier dropping communication (ie: fiber cut or l2 switch >>>> breakage) of the L2TP streams can leave mpd5 in a state where it will not >>>> die and will not destroy interfaces (requires reboot to clear). >>>> >>> I've encountered that once on 10.3 and I had tweaked some sysctl values >>> while monitoring : >>>> vmstat -z | head -1; vmstat -z | grep -i netgraph >>> >>> you might want to search other people's experience with the following >>> values: >>> # net.graph.maxdgram #this is set in /etc/sysctl.conf >>> # net.graph.recvspace #this is set in /etc/sysctl.conf >>> # net.graph.maxdata #this is set in /boot/loader.conf >>> # net.graph.maxalloc #this is set in /boot/loader.conf >>> >>> I'll leave others to comment on what's best to set as values with their >>> experience on FreeBSD10.3. >>> In my case, as I had explained, one of the recipes that worked for me is >>> to comment out and leave those kernel values to their default. >>> >>> I've read in mpd5 mailing list some saying that FreeBSD-11 have had >>> upgrades on the netgraph modules. >>> I am now using FreeBSD-11 and It looks like I don't need any of the >>> kernel tweaks that I've described. >>> >>> Also, may I suggest you troubleshoot the fiber-cut or L2 switch breakage >>> by playing with some ipfw values to simulate a fiber-cut.: >>> ex: ipfw add 100 deny ip from 10.10.10.10 to me >>> >>>> 2. There are race conditions between quagga and mpd5 for adding/dropping >>>> routes. >>>> >>> While troubleshooting the crashes of the mpd5, I have removed net/quagga >>> and installed net/bird instead. >>> I am now using net/bird I've written a little howto to get you started >>> with net/bird >>> see: https://forums.freebsd.org/threads/56988/ >>> >>> 3. if A is a pppoe client and B is the mpd5 server, A cannot access TCP >>>> services on B. It can access tcp services _beyond_ B, but not on B. (there >>>> is a ticket open for this). >>>> >>>> On Wed, Oct 12, 2016 at 10:51 AM, Donald Baud via freebsd-net < >>>> freebsd-net@freebsd.org <mailto:freebsd-net@freebsd.org>> wrote: >>>> >>>> >>>> On 10/12/16 1:13 AM, Julian Elischer wrote: >>>> >>>> On 11/10/2016 8:56 PM, Donald Baud via freebsd-net wrote: >>>> >>>> I've been plagued with these =daily= panics until I tried >>>> the following recipes and the server has been up for 30 >>>> days so far: >>>> >>>> Normally I should expermient more to see which one of the >>>> receipes is really the fix, but I'm just glad that the >>>> server is stable for now. >>>> >>>> >>>> this is really great information. >>>> It makes debugging a lot more possible. >>>> I know it is a hard question, but do you have a way to >>>> simulate this workload? >>>> >>>> I have no real way to simulate this kind of workload >>>> >>>> >>>> Sadly, I don't have a way to simulate the workload but I am very >>>> interested to help fix these crashes since as Cassiano said, this >>>> makes mpd5/freebsd useless for pppoe/l2tp termination. >>>> >>>> At this point, I would suggest that Cassiano and Андрей confirm >>>> that they don't get panics when they apply the recipes that I am >>>> using. >>>> >>>> I am still running many other cisco-vpdn gateways that I would >>>> convert into mpd5/freebsd but my plan was stalled with the daily >>>> crashes. >>>> I'll wait a couple of weeks to be sure that my recipes are a valid >>>> workaround before converting my remaining cisco gateways to mpd5. >>>> >>>> -Dbaud >>>> >>>> >>>> >>>> recipe-1: Don't let mpd5 start automatically when server >>>> boots: >>>> i.e. in: /etc/rc.conf >>>> mpd5_enable="NO" >>>> and wait about 5 minutes after server boots then issue: >>>> /usr/local/etc/rc.d/mpd5 onestart >>>> >>>> >>>> recipe-2: recompile the kernel with the NETGRAPH_DEBUG >>>> option: >>>> options NETGRAPH >>>> options NETGRAPH_DEBUG >>>> options NETGRAPH_KSOCKET >>>> options NETGRAPH_L2TP >>>> options NETGRAPH_SOCKET >>>> options NETGRAPH_TEE >>>> options NETGRAPH_VJC >>>> options NETGRAPH_PPP >>>> options NETGRAPH_IFACE >>>> options NETGRAPH_MPPC_COMPRESSION >>>> options NETGRAPH_MPPC_ENCRYPTION >>>> options NETGRAPH_TCPMSS >>>> options IPFIREWALL >>>> >>>> recipe-3: recompile the kernel and disable the IPv6 and >>>> SCTP options: >>>> nooptions INET6 >>>> nooptions SCTP >>>> >>>> recipe-4: Don't use any of the sysctl optimizations >>>> in other words I commented out all values in sysctl.conf: >>>> # net.graph.maxdgram=20480 (this is the default) >>>> # net.graph.recvspace=20480 (this is the default) >>>> >>>> recipe-5: Don't use any of the loader.conf optimizations >>>> in other words I commented out all values in loader.conf >>>> # net.graph.maxdata=4096 (this is the default) >>>> # net.graph.maxalloc=4096 (this is the default) >>>> >>>> ================================ >>>> In my case, I had the panics with 10.3 and 11-PRERELEASE >>>> 11.0-PRERELEASE FreeBSD 11.0-PRERELEASE #2 r305587 >>>> >>>> With those recipes, I have been running without any crash >>>> for a month and counting. Thats' 300 l2tp tunnels and >>>> 1400 l2tp sessions generating 700Mbit/s. >>>> >>>> >>>> -DBaud >>>> >>>> >>>> On Tuesday, October 11, 2016 7:30 AM, Cassiano Peixoto >>>> <peixotocassiano@gmail.com >>>> <mailto:peixotocassiano@gmail.com>> wrote: >>>> Hi, >>>> >>>> There are many users complaining about this: >>>> >>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=186114 >>>> <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=186114> >>>> >>>> I've been dealing with this issue for one year with no >>>> solution. mpd5 as >>>> pppoe server on FreeBSD is useless with this bug. >>>> >>>> I really would like to see it working again, i think it's >>>> quite important >>>> to both project and many users. >>>> >>>> Thanks. >>>> >>>> On Tue, Oct 11, 2016 at 3:24 AM, Eugene Grosbein >>>> <eugen@grosbein.net <mailto:eugen@grosbein.net>> wrote: >>>> >>>> 11.10.2016 11:02, Андрей Леушкин пишет: >>>> >>>> Hello. I have problem with "FreeBSD nas >>>> 10.3-RELEASE FreeBSD 10.3-RELEASE >>>> #0: Fri Oct 7 21:12:56 YEKT 2016 >>>> nas@nas:/usr/obj/usr/src/sys/nasv3 >>>> amd64" >>>> >>>> Kernel panic is repeated at intervals of 2-3 days. >>>> At first I thought that >>>> the problem is in the hardware, but the problem >>>> did not go away after >>>> replacing the server platform. >>>> >>>> Coredumps and more info on link >>>> https://drive.google.com/open? >>>> id=0BxciMy2q7ZjTTkIxem9wTE1tM2M >>>> <https://drive.google.com/open >>>> ?id=0BxciMy2q7ZjTTkIxem9wTE1tM2M> >>>> >>>> Sorry for my english. >>>> I'll wait for an answer. >>>> >>>> This is known and long-stanging problem in the FreeBSD >>>> network stack. >>>> It shows up when you have lots of network interfaced >>>> created/removed >>>> frequently >>>> like in your case of Network Access Server (PPtP, >>>> PPPoE etc). >>>> >>>> Generally, people run into this problem using mpd5 >>>> network daemon. >>>> mpd5 uses NETGRAPH kernel subsystem to process traffic >>>> and >>>> if an interface disappears (f.e., ,user disconnected) >>>> while kernel still processes traffic obtained from >>>> this interface, it >>>> panices. >>>> >>>> There were lots of reports of this problem. Noone >>>> seems to be working on >>>> it at the moment. >>>> You should fill a PR using Bugzilla and attach your >>>> logs to it. >>>> >>>> Eugene Grosbein >>>> >>>> >>> _______________________________________________ >>> freebsd-net@freebsd.org mailing list >>> https://lists.freebsd.org/mailman/listinfo/freebsd-net >>> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >>> >> >> > _______________________________________________ > freebsd-net@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > For anyone experiencing these mpd hangs/crashes, if you believe your issue is the same as that described in Issue 186114 [1], please add your comments there including full system version information and crash backtraces (*as attachments*) if experiencing panics. Resolution of this problem is contingent on a clear test/reproduction cases (ideally as reduced as possible). [1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=186114 ./koobs
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?16994026-ae44-4a39-822b-c4a218a71ce7>