Date: Fri, 5 Feb 2016 02:38:59 -0800 From: Mark Millard <markmi@dsl-only.net> To: FreeBSD PowerPC ML <freebsd-ppc@freebsd.org> Subject: Re: projects/clang380-import -r294962+ based powerpc (32-bit) buildworld -j 6: make gets SEGV, a partial smoking gun? Message-ID: <CDCA5549-E22D-4101-B203-F076A328E2DF@dsl-only.net> In-Reply-To: <BCF659EE-11AD-4E59-8426-8327BA795C90@dsl-only.net> References: <BCF659EE-11AD-4E59-8426-8327BA795C90@dsl-only.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2016-Feb-4, at 11:46 PM, Mark Millard <markmi@dsl-only.net> wrote: >=20 > The problem: >=20 > For a clang 3.8.0 based buildworld TARGET_ARCH=3Dpowerpc installation = attempting "make -j 6 buildworld" (run on 4 powerpc cores) eventually = gets a segmentation fault in a make instance. (More details later.) = "make buildworld" does not fault. >=20 > I expect that the details that I describe below implies some form of = intermittency, such as a race condition. >=20 > (This is with the content of sys/powerpc/powerpc/sigcode32.S -r295186 = in place so that signal delivery maintains the modulo 16 byte = stack/frame alignment status instead of changing the alignment.) >=20 > (clang 3.8.0 targeting powerpc (32-bit) is known to be able to = introduce more stack alignment dependencies by sometimes "or-ing" in = offset-bits into some aligned-address lower bits instead of using = addition. But I do not know if that is involved here somehow.) >=20 >=20 >=20 > What is always involved and what varies: >=20 > In all cases the failure was r31 being used as a frame-pointer with = the value zero in r31 at the time of the address calculation, even when = when the dereference of the address was later. r1 still seemed to be a = valid stack pointer in all cases. >=20 > In every case the faulting routine had called one or more routines = during its operation --and those had returned. There was an example or = two of a self-contained routine that was recursive that got the failure. >=20 > In some cases prior calls in the faulting routine had non-zero r31 = values when they returned. (There was later r31 usage that did not = fault.) >=20 > Overall the call chains varied widely for various example faults, = although some call context is more common as a failure point. >=20 > Use of ktrace with "-di -t cs" and use of kdump for extracting for the = failing process shows the same 5 line sequence before every example = "PSIG SIGSEGV". What was before those 5 lines varied across the various = kdsump outputs. >=20 > I used ktrace/kdump commands of the structure: >=20 > ktrace -di -f /usr/obj/make.out -t cs -p ??? > kdump -E -f /usr/obj/make.out -p ??? > = /var/tmp/make_ktrace_sigsegv_??.txt >=20 > Example results (showing the 5 lines and PSIG SIGSEGV): >=20 > (3 prior "sigreturn JUSTRETURN" among what is not shown) >> 65158 make 0.205791 PSIG SIGCHLD caught handler=3D0x180aae0 = mask=3D0x0 code=3DCLD_EXITED >> 65158 make 0.205822 CALL write(0x3,0x189e914,0x1) >> 65158 make 0.205847 RET write 1 >> 65158 make 0.205869 CALL sigreturn(0xffffbb50) >> 65158 make 0.205923 RET sigreturn JUSTRETURN >> 65158 make 0.205962 PSIG SIGSEGV SIG_DFL code=3DSEGV_MAPERR >=20 > (365 prior "sigreturn JUSTRETURN" among what is not shown) >> 599 make 5.552305 PSIG SIGCHLD caught handler=3D0x180aae0 = mask=3D0x0 code=3DCLD_EXITED >> 599 make 5.552323 CALL write(0x3,0x189e914,0x1) >> 599 make 5.552337 RET write 1 >> 599 make 5.552347 CALL sigreturn(0xffffbb30) >> 599 make 5.552358 RET sigreturn JUSTRETURN >> 599 make 5.552381 PSIG SIGSEGV SIG_DFL code=3DSEGV_MAPERR >=20 > (287 prior "sigreturn JUSTRETURN" among what is not shown) >> 75728 make 4.141097 PSIG SIGCHLD caught handler=3D0x180aae0 = mask=3D0x0 code=3DCLD_EXITED >> 75728 make 4.141116 CALL write(0x3,0x189e914,0x1) >> 75728 make 4.141154 RET write 1 >> 75728 make 4.141349 CALL sigreturn(0xffffbaa0) >> 75728 make 4.141366 RET sigreturn JUSTRETURN >> 75728 make 4.141404 PSIG SIGSEGV SIG_DFL code=3DSEGV_MAPERR >=20 > (273 prior "sigreturn JUSTRETURN" among what is not shown) >> 12195 make 27.213277 PSIG SIGCHLD caught handler=3D0x180aae0 = mask=3D0x0 code=3DCLD_EXITED >> 12195 make 27.213322 CALL write(0x3,0x189e914,0x1) >> 12195 make 27.213346 RET write 1 >> 12195 make 27.213361 CALL sigreturn(0xffffb1e0) >> 12195 make 27.213383 RET sigreturn JUSTRETURN >> 12195 make 27.213418 PSIG SIGSEGV SIG_DFL code=3DSEGV_MAPERR >=20 > (789 prior "sigreturn JUSTRETURN" among what is not shown) >> 50545 make 80.255162 PSIG SIGCHLD caught handler=3D0x180aae0 = mask=3D0x0 code=3DCLD_EXITED >> 50545 make 80.255192 CALL write(0x3,0x189e914,0x1) >> 50545 make 80.255219 RET write 1 >> 50545 make 80.255241 CALL sigreturn(0xffffafa0) >> 50545 make 80.255265 RET sigreturn JUSTRETURN >> 50545 make 80.255317 PSIG SIGSEGV SIG_DFL code=3DSEGV_MAPERR >=20 > The 5 line sequence is not sufficient for the problem to occur but = appears to be necessary: There were sometimes hundreds of prior "PSIG = SIGCHLD". . ."RET sigreturn JUSTRETURNS" sequences for which they were = not followed by "PSIG SIGSEGV". But every failure tested with ktrace has = the 5 lines as an immediate prefix in the list for the process. >=20 > Which instance of make varied and where in make the fault happens = varied. The "-E" elapsed times above and those JUSTRETURN counts give a = solid clue to there being variability in when the fault happens. >=20 > I'll use some script log file sizes for the buidlworld as another = indication of variability. I've sorted them: >=20 > 2942664 > 3304207 > 3342660 > 3474585 > 3941983 >=20 > so spanning from 2.9 MBytes to 3.9 MBytes. I've since gotten a few = with less and some with more. >=20 >=20 > Note: A couple of times with ktrace being involved it failed at an = earlier stage than I've seen otherwise. It may be that ktrace being = involved makes the problem more likely/frequent. >=20 >=20 >=20 > Context basics (quad core PowerMac running TARGET_ARCH=3Dpowerpc = (32-bit)): >=20 > # freebsd-version -ku; uname -aKU > 11.0-CURRENT > 11.0-CURRENT > FreeBSD FBSDG4C1 11.0-CURRENT FreeBSD 11.0-CURRENT #2 r294962M: Mon = Feb 1 00:31:03 PST 2016 = markmi@FreeBSDx64:/usr/obj/clang_gcc421/powerpc.powerpc/usr/src/sys/GENERI= Cvtsc-NODEBUG powerpc 1100097 1100097 >=20 > This is with the content of sys/powerpc/powerpc/sigcode32.S -r295186 = in place so that signal delivery maintains the modulo 16 byte = stack/frame alignment status instead of changing the alignment. >=20 > buildkernel was via gcc 4.2.1 > buildworld was via clang 3.8.0 >=20 >=20 >=20 > I'm not sure that I'm going to get much farther in tracking down the = source of the race(?) that leads to the SEGV's. I should have given an example of clang 3.8.0 produced post-amble code = that ends up picking up the 0x0 and putting it is r31 (the Frame Pointer = for clang's produced code). 0x1801b8c <Buf_AddBytes+104>: lwz r30,24(r31) 0x1801b90 <Buf_AddBytes+108>: lwz r29,20(r31) 0x1801b94 <Buf_AddBytes+112>: lwz r28,16(r31) 0x1801b98 <Buf_AddBytes+116>: lwz r27,12(r31) 0x1801b9c <Buf_AddBytes+120>: lwz r26,8(r31) 0x1801ba0 <Buf_AddBytes+124>: addi r1,r1,32 # Stack pointer = adjusted first 0x1801ba4 <Buf_AddBytes+128>: lwz r0,4(r1) 0x1801ba8 <Buf_AddBytes+132>: lwz r31,-4(r1) # Then Frame Pointer = load happens # "outside" the new = stack range. 0x1801bac <Buf_AddBytes+136>: mtlr r0 0x1801bb0 <Buf_AddBytes+140>: blr Here is the related bl and following code: 0x182516c <VarSYSVMatch+248>: bl 0x1801b24 <Buf_AddBytes> 0x1825170 <VarSYSVMatch+252>: lis r3,396 =3D> 0x1825174 <VarSYSVMatch+256>: lwz r4,16(r31) 0x1825178 <VarSYSVMatch+260>: lwz r3,-13260(r3) 0x182517c <VarSYSVMatch+264>: subf. r3,r4,r3 0x1825180 <VarSYSVMatch+268>: bne 0x18251c4 = <VarSYSVMatch+336> The "=3D>" points to a "sometimes SEGV" place from r31=3D0x0 being = involved in a clang 3.8.0 buildworld's based "make" after it resumes = normal operation after handling a SIGCHLD delivery. So the "lwz r31,-4(r1)" earlier above resulted in a 0x0 in r31 just = before the SEGV. By contrast the gcc 4.2.1 generated code does not use r31 as a frame = pointer and has: 0x1801db8 <Buf_AddBytes+96>: lwz r0,52(r1) 0x1801dbc <Buf_AddBytes+100>: lwz r27,28(r1) 0x1801dc0 <Buf_AddBytes+104>: lwz r28,32(r1) 0x1801dc4 <Buf_AddBytes+108>: mtlr r0 0x1801dc8 <Buf_AddBytes+112>: lwz r29,36(r1) 0x1801dcc <Buf_AddBytes+116>: lwz r30,40(r1) 0x1801dd0 <Buf_AddBytes+120>: lwz r31,44(r1) 0x1801dd4 <Buf_AddBytes+124>: addi r1,r1,48 # Stack pointer is = adjusted last. # The prior code = worked in the range. 0x1801dd8 <Buf_AddBytes+128>: blr and: 0x1823cac <VarSYSVMatch+340>: bl 0x1801d58 <Buf_AddBytes> 0x1823cb0 <VarSYSVMatch+344>: b 0x1823c0c <VarSYSVMatch+180> . . . backing up to the jump point . . . 0x1823c0c <VarSYSVMatch+180>: lwz r0,28(r1) # Stack pointer used, = not a Frame Pointer 0x1823c10 <VarSYSVMatch+184>: lwz r9,0(r25) 0x1823c14 <VarSYSVMatch+188>: xor. r0,r0,r9 0x1823c18 <VarSYSVMatch+192>: li r9,0 0x1823c1c <VarSYSVMatch+196>: li r3,1 0x1823c20 <VarSYSVMatch+200>: bne- 0x1823cb4 <VarSYSVMatch+348> As r31 is not used as a frame pointer the 4.2.1 code can not get the = SEGV that the clang 3.8.0 code sometimes gets. =3D=3D=3D Mark Millard markmi at dsl-only.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CDCA5549-E22D-4101-B203-F076A328E2DF>