Date: Tue, 14 Feb 2017 17:08:27 -0800 From: Mark Millard <markmi@dsl-only.net> To: Andrew Turner <andrew@fubar.geek.nz> Cc: freebsd-arm <freebsd-arm@freebsd.org> Subject: Re: A potential fix for arm64's: sh`forkshell child-process path after fork sometimes has a bad stack pointer value Message-ID: <142FC38B-48F6-4456-8CD1-D180EDB6A73C@dsl-only.net> In-Reply-To: <6EED2BFF-CAFB-4F58-8D0D-8E060319278C@dsl-only.net> References: <DC3CC3BE-9D8C-41ED-ADD0-AFD4019B2E90@dsl-only.net> <2D04FF37-DEC8-42CE-961D-AE8CD58A0EAA@dsl-only.net> <93064627-5F72-4167-90B1-0A98ABF4C99C@dsl-only.net> <3BC697B9-4A3E-49FF-AB11-1106E2EF8399@dsl-only.net> <20170214165644.15dedf6e@zapp> <6EED2BFF-CAFB-4F58-8D0D-8E060319278C@dsl-only.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2017-Feb-14, at 9:17 AM, Mark Millard <markmi@dsl-only.net> wrote: > On 2017-Feb-14, at 8:56 AM, Andrew Turner <andrew at fubar.geek.nz> = wrote: >=20 > On Tue, 14 Feb 2017 08:35:54 -0800 >> Mark Millard <markmi at dsl-only.net> wrote: >>=20 >>> The following change has let my test run for 8.5 hours so far = without >>> a fork-failure in sh`forkshell : >>>=20 >>> # svnlite diff /usr/src/sys/arm64/arm64/swtch.S >>> Index: /usr/src/sys/arm64/arm64/swtch.S >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>> --- /usr/src/sys/arm64/arm64/swtch.S (revision 312982) >>> +++ /usr/src/sys/arm64/arm64/swtch.S (working copy) >>> @@ -241,6 +241,12 @@ >>> mov fp, #0 /* Stack traceback stops here. */ >>> bl _C_LABEL(fork_exit) >>>=20 >>> + /* >>> + * Disable interrupts to avoid >>> + * overwriting sp_el0 and spsr_el1 by an IRQ exception. >>> + */ >>> + msr daifset, #2 >>> + >>> /* Restore sp and lr */ >>> ldp x0, x1, [sp] >>> msr sp_el0, x0 >>> @@ -263,12 +269,6 @@ >>> ldp x28, x29, [sp, #TF_X + 28 * 8] >>> /* Skip x30 as it was restored above as lr */ >>>=20 >>> - /* >>> - * Disable interrupts to avoid >>> - * overwriting spsr_el1 by an IRQ exception. >>> - */ >>> - msr daifset, #2 >>> - >>> /* Restore elr and spsr */ >>> ldp x0, x1, [sp, #16] >>> msr elr_el1, x0 >>>=20 >>> I'm going to switch to attempting a self-hosted buildworld >>> buildkernel again. >>=20 >> Can you try the patch in https://reviews.freebsd.org/D9593. It moves >> loading of sp_el0 until after interrupts have been disabled. >>=20 >> Andrew >=20 > Sure. I'll stop the self-hosted buildworld buildkernel and > switch over to your source. >=20 > One minor point: >=20 > /* Skip x30 as it was restored above as lr */ >=20 > now should say something like: >=20 > /* Skip x30 as it is restored below as lr */ As reported on https://reviews.freebsd.org/D9593 the buildworld buildkernel test stopped in buildworld with two sh processed failing. But the core files do not suggest a stack corruption to me, nor was fork active. My test code recorded its before and after fork stack address examples and they were equal as they should be. It appeared that simply starting the buildworld buildkernel would continue on so I restarted it. It has in fact continued on and is still building. I see no reason to take the stoppage as something to count against the change. And I'll say so in new comments in https://reviews.freebsd.org/D9593 once the build completes or fails and I report on that. Failure details (both cores are basically the same for these details): (lldb) up frame #9: 0x000000004054c82c libc.so.7`ifree(tsd=3D<unavailable>, = ptr=3D<unavailable>, tcache=3D<unavailable>, slow_path=3D<unavailable>) = + 304 at jemalloc_jemalloc.c:1889 1886 usize =3D isalloc(tsd_tsdn(tsd), ptr, = config_prof); 1887 prof_free(tsd, ptr, usize); 1888 } else if (config_stats || config_valgrind) -> 1889 usize =3D isalloc(tsd_tsdn(tsd), ptr, = config_prof); 1890 if (config_stats) 1891 *tsd_thread_deallocatedp_get(tsd) +=3D usize; 1892=09 (lldb) print config_stats (const bool) $0 =3D true (lldb) print config_valgrind (const bool) $1 =3D false So the new failure was actually during config_stats activity, which is apparently enabled by default for how I built -r312982 . The actual abort initiation was from: (lldb) up frame #3: 0x00000000405340fc libc.so.7`huge_node_get [inlined] = __je_rtree_get(dependent=3Dtrue) + 308 at rtree.h:328 325 RTREE_GET_LEAF(RTREE_HEIGHT_MAX-1) 326 #undef RTREE_GET_SUBTREE 327 #undef RTREE_GET_LEAF -> 328 default: not_reached(); 329 } 330 #undef RTREE_GET_BIAS 331 not_reached(); The back traces look similar to this one of the pair: (lldb) bt * thread #1: tid =3D 100137, 0x0000000040554e54 libc.so.7`_thr_kill + 8, = name =3D 'sh', stop reason =3D signal SIGABRT * frame #0: 0x0000000040554e54 libc.so.7`_thr_kill + 8 frame #1: 0x0000000040554e18 libc.so.7`__raise(s=3D6) + 64 at = raise.c:52 frame #2: 0x0000000040554d8c libc.so.7`abort + 84 at abort.c:65 frame #3: 0x00000000405340fc libc.so.7`huge_node_get [inlined] = __je_rtree_get(dependent=3Dtrue) + 308 at rtree.h:328 frame #4: 0x00000000405340dc libc.so.7`huge_node_get [inlined] = __je_chunk_lookup(dependent=3Dtrue) at chunk.h:89 frame #5: 0x00000000405340dc = libc.so.7`huge_node_get(ptr=3D<unavailable>) + 276 at jemalloc_huge.c:11 frame #6: 0x0000000040534114 = libc.so.7`__je_huge_salloc(tsdn=3D<unavailable>, ptr=3D<unavailable>) + = 24 at jemalloc_huge.c:434 frame #7: 0x000000004054c84c libc.so.7`ifree [inlined] = __je_arena_salloc(demote=3Dfalse) + 32 at arena.h:1426 frame #8: 0x000000004054c82c libc.so.7`ifree [inlined] = __je_isalloc(demote=3Dfalse) at jemalloc_internal.h:1045 frame #9: 0x000000004054c82c libc.so.7`ifree(tsd=3D<unavailable>, = ptr=3D<unavailable>, tcache=3D<unavailable>, slow_path=3D<unavailable>) = + 304 at jemalloc_jemalloc.c:1889 frame #10: 0x000000004054cd94 = libc.so.7`__free(ptr=3D0x0000000040a17520) + 148 at = jemalloc_jemalloc.c:2016 frame #11: 0x0000000000411328 sh`ckfree(p=3D<unavailable>) + 32 at = memalloc.c:88 frame #12: 0x0000000000407cd8 sh`clearcmdentry + 76 at exec.c:505 frame #13: 0x0000000000406bfc sh`evalcommand(cmd=3D<unavailable>, = flags=3D<unavailable>, backcmd=3D<unavailable>) + 3476 at eval.c:1182 frame #14: 0x0000000000405570 sh`evaltree(n=3D0x0000000040a1c270, = flags=3D<unavailable>) + 212 at eval.c:290 frame #15: 0x000000000041105c sh`cmdloop(top=3D<unavailable>) + 252 = at main.c:231 frame #16: 0x0000000000410ed0 sh`main(argc=3D<unavailable>, = argv=3D<unavailable>) + 660 at main.c:178 frame #17: 0x0000000000402f30 sh`__start + 360 frame #18: 0x0000000040434658 ld-elf.so.1`.rtld_start + 24 at = rtld_start.S:41 =3D=3D=3D Mark Millard markmi at dsl-only.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?142FC38B-48F6-4456-8CD1-D180EDB6A73C>