Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 5 Feb 2016 02:38:59 -0800
From:      Mark Millard <markmi@dsl-only.net>
To:        FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: projects/clang380-import -r294962+ based powerpc (32-bit) buildworld -j 6: make gets SEGV, a partial smoking gun?
Message-ID:  <CDCA5549-E22D-4101-B203-F076A328E2DF@dsl-only.net>
In-Reply-To: <BCF659EE-11AD-4E59-8426-8327BA795C90@dsl-only.net>
References:  <BCF659EE-11AD-4E59-8426-8327BA795C90@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2016-Feb-4, at 11:46 PM, Mark Millard <markmi@dsl-only.net> wrote:
>=20
> The problem:
>=20
> For a clang 3.8.0 based buildworld TARGET_ARCH=3Dpowerpc installation =
attempting "make -j 6 buildworld"  (run on 4 powerpc cores) eventually =
gets a segmentation fault in a  make instance. (More details later.) =
"make buildworld" does not fault.
>=20
> I expect that the details that I describe below implies some form of =
intermittency, such as a race condition.
>=20
> (This is with the content of sys/powerpc/powerpc/sigcode32.S -r295186 =
in place so that signal delivery maintains the modulo 16 byte =
stack/frame alignment status instead of changing the alignment.)
>=20
> (clang 3.8.0 targeting powerpc (32-bit) is known to be able to =
introduce more stack alignment dependencies by sometimes "or-ing" in =
offset-bits into some aligned-address lower bits instead of using =
addition. But I do not know if that is involved here somehow.)
>=20
>=20
>=20
> What is always involved and what varies:
>=20
> In all cases the failure was r31 being used as a frame-pointer with =
the value zero in r31 at the time of the address calculation, even when =
when the dereference of the address was later. r1 still seemed to be a =
valid stack pointer in all cases.
>=20
> In every case the faulting routine had called one or more routines =
during its operation --and those had returned. There was an example or =
two of a self-contained routine that was recursive that got the failure.
>=20
> In some cases prior calls in the faulting routine had non-zero r31 =
values when they returned. (There was later r31 usage that did not =
fault.)
>=20
> Overall the call chains varied widely for various example faults, =
although some call context is more common as a failure point.
>=20
> Use of ktrace with "-di -t cs" and use of kdump for extracting for the =
failing process shows the same 5 line sequence before every example =
"PSIG SIGSEGV". What was before those 5 lines varied across the various =
kdsump outputs.
>=20
> I used ktrace/kdump commands of the structure:
>=20
> ktrace -di -f /usr/obj/make.out -t cs -p ???
> kdump -E -f /usr/obj/make.out -p ??? > =
/var/tmp/make_ktrace_sigsegv_??.txt
>=20
> Example results (showing the 5 lines and PSIG SIGSEGV):
>=20
> (3 prior "sigreturn JUSTRETURN" among what is not shown)
>> 65158 make     0.205791 PSIG  SIGCHLD caught handler=3D0x180aae0 =
mask=3D0x0 code=3DCLD_EXITED
>> 65158 make     0.205822 CALL  write(0x3,0x189e914,0x1)
>> 65158 make     0.205847 RET   write 1
>> 65158 make     0.205869 CALL  sigreturn(0xffffbb50)
>> 65158 make     0.205923 RET   sigreturn JUSTRETURN
>> 65158 make     0.205962 PSIG  SIGSEGV SIG_DFL code=3DSEGV_MAPERR
>=20
> (365 prior "sigreturn JUSTRETURN" among what is not shown)
>>   599 make     5.552305 PSIG  SIGCHLD caught handler=3D0x180aae0 =
mask=3D0x0 code=3DCLD_EXITED
>>   599 make     5.552323 CALL  write(0x3,0x189e914,0x1)
>>   599 make     5.552337 RET   write 1
>>   599 make     5.552347 CALL  sigreturn(0xffffbb30)
>>   599 make     5.552358 RET   sigreturn JUSTRETURN
>>   599 make     5.552381 PSIG  SIGSEGV SIG_DFL code=3DSEGV_MAPERR
>=20
> (287 prior "sigreturn JUSTRETURN" among what is not shown)
>> 75728 make     4.141097 PSIG  SIGCHLD caught handler=3D0x180aae0 =
mask=3D0x0 code=3DCLD_EXITED
>> 75728 make     4.141116 CALL  write(0x3,0x189e914,0x1)
>> 75728 make     4.141154 RET   write 1
>> 75728 make     4.141349 CALL  sigreturn(0xffffbaa0)
>> 75728 make     4.141366 RET   sigreturn JUSTRETURN
>> 75728 make     4.141404 PSIG  SIGSEGV SIG_DFL code=3DSEGV_MAPERR
>=20
> (273 prior "sigreturn JUSTRETURN" among what is not shown)
>> 12195 make     27.213277 PSIG  SIGCHLD caught handler=3D0x180aae0 =
mask=3D0x0 code=3DCLD_EXITED
>> 12195 make     27.213322 CALL  write(0x3,0x189e914,0x1)
>> 12195 make     27.213346 RET   write 1
>> 12195 make     27.213361 CALL  sigreturn(0xffffb1e0)
>> 12195 make     27.213383 RET   sigreturn JUSTRETURN
>> 12195 make     27.213418 PSIG  SIGSEGV SIG_DFL code=3DSEGV_MAPERR
>=20
> (789 prior "sigreturn JUSTRETURN" among what is not shown)
>> 50545 make     80.255162 PSIG  SIGCHLD caught handler=3D0x180aae0 =
mask=3D0x0 code=3DCLD_EXITED
>> 50545 make     80.255192 CALL  write(0x3,0x189e914,0x1)
>> 50545 make     80.255219 RET   write 1
>> 50545 make     80.255241 CALL  sigreturn(0xffffafa0)
>> 50545 make     80.255265 RET   sigreturn JUSTRETURN
>> 50545 make     80.255317 PSIG  SIGSEGV SIG_DFL code=3DSEGV_MAPERR
>=20
> The 5 line sequence is not sufficient for the problem to occur but =
appears to be necessary: There were sometimes hundreds of prior "PSIG =
SIGCHLD". . ."RET sigreturn JUSTRETURNS" sequences for which they were =
not followed by "PSIG SIGSEGV". But every failure tested with ktrace has =
the 5 lines as an immediate prefix in the list for the process.
>=20
> Which instance of make varied and where in make the fault happens =
varied. The "-E" elapsed times above and those JUSTRETURN counts give a =
solid clue to there being variability in when the fault happens.
>=20
> I'll use some script log file sizes for the buidlworld as another =
indication of variability. I've sorted them:
>=20
> 2942664
> 3304207
> 3342660
> 3474585
> 3941983
>=20
> so spanning from 2.9 MBytes to 3.9 MBytes. I've since gotten a few =
with less and some with more.
>=20
>=20
> Note: A couple of times with ktrace being involved it failed at an =
earlier stage than I've seen otherwise. It may be that ktrace being =
involved makes the problem more likely/frequent.
>=20
>=20
>=20
> Context basics (quad core PowerMac running TARGET_ARCH=3Dpowerpc =
(32-bit)):
>=20
> # freebsd-version -ku; uname -aKU
> 11.0-CURRENT
> 11.0-CURRENT
> FreeBSD FBSDG4C1 11.0-CURRENT FreeBSD 11.0-CURRENT #2 r294962M: Mon =
Feb  1 00:31:03 PST 2016     =
markmi@FreeBSDx64:/usr/obj/clang_gcc421/powerpc.powerpc/usr/src/sys/GENERI=
Cvtsc-NODEBUG  powerpc 1100097 1100097
>=20
> This is with the content of sys/powerpc/powerpc/sigcode32.S -r295186 =
in place so that signal delivery maintains the modulo 16 byte =
stack/frame alignment status instead of changing the alignment.
>=20
> buildkernel was via gcc 4.2.1
> buildworld was via clang 3.8.0
>=20
>=20
>=20
> I'm not sure that I'm going to get much farther in tracking down the =
source of the race(?) that leads to the SEGV's.

I should have given an example of clang 3.8.0 produced post-amble code =
that ends up picking up the 0x0 and putting it is r31 (the Frame Pointer =
for clang's produced code).

0x1801b8c <Buf_AddBytes+104>:	lwz     r30,24(r31)
0x1801b90 <Buf_AddBytes+108>:	lwz     r29,20(r31)
0x1801b94 <Buf_AddBytes+112>:	lwz     r28,16(r31)
0x1801b98 <Buf_AddBytes+116>:	lwz     r27,12(r31)
0x1801b9c <Buf_AddBytes+120>:	lwz     r26,8(r31)
0x1801ba0 <Buf_AddBytes+124>:	addi    r1,r1,32   # Stack pointer =
adjusted first
0x1801ba4 <Buf_AddBytes+128>:	lwz     r0,4(r1)
0x1801ba8 <Buf_AddBytes+132>:	lwz     r31,-4(r1) # Then Frame Pointer =
load happens
                                                   # "outside" the new =
stack range.
0x1801bac <Buf_AddBytes+136>:	mtlr    r0
0x1801bb0 <Buf_AddBytes+140>:	blr

Here is the related bl and following code:

   0x182516c <VarSYSVMatch+248>:	bl      0x1801b24 <Buf_AddBytes>
   0x1825170 <VarSYSVMatch+252>:	lis     r3,396
=3D> 0x1825174 <VarSYSVMatch+256>:	lwz     r4,16(r31)
   0x1825178 <VarSYSVMatch+260>:	lwz     r3,-13260(r3)
   0x182517c <VarSYSVMatch+264>:	subf.   r3,r4,r3
   0x1825180 <VarSYSVMatch+268>:	bne     0x18251c4 =
<VarSYSVMatch+336>

The "=3D>" points to a "sometimes SEGV" place from r31=3D0x0 being =
involved in a clang 3.8.0 buildworld's based "make" after it resumes =
normal operation after handling a SIGCHLD delivery.

So the "lwz r31,-4(r1)" earlier above resulted in a 0x0 in r31 just =
before the SEGV.

By contrast the gcc 4.2.1 generated code does not use r31 as a frame =
pointer and has:

0x1801db8 <Buf_AddBytes+96>:	lwz     r0,52(r1)
0x1801dbc <Buf_AddBytes+100>:	lwz     r27,28(r1)
0x1801dc0 <Buf_AddBytes+104>:	lwz     r28,32(r1)
0x1801dc4 <Buf_AddBytes+108>:	mtlr    r0
0x1801dc8 <Buf_AddBytes+112>:	lwz     r29,36(r1)
0x1801dcc <Buf_AddBytes+116>:	lwz     r30,40(r1)
0x1801dd0 <Buf_AddBytes+120>:	lwz     r31,44(r1)
0x1801dd4 <Buf_AddBytes+124>:	addi    r1,r1,48   # Stack pointer is =
adjusted last.
                                                   # The prior code =
worked in the range.
0x1801dd8 <Buf_AddBytes+128>:	blr

and:

0x1823cac <VarSYSVMatch+340>:	bl      0x1801d58 <Buf_AddBytes>
0x1823cb0 <VarSYSVMatch+344>:	b       0x1823c0c <VarSYSVMatch+180>
. . . backing up to the jump point . . .
0x1823c0c <VarSYSVMatch+180>:	lwz     r0,28(r1) # Stack pointer used, =
not a Frame Pointer
0x1823c10 <VarSYSVMatch+184>:	lwz     r9,0(r25)
0x1823c14 <VarSYSVMatch+188>:	xor.    r0,r0,r9
0x1823c18 <VarSYSVMatch+192>:	li      r9,0
0x1823c1c <VarSYSVMatch+196>:	li      r3,1
0x1823c20 <VarSYSVMatch+200>:	bne-    0x1823cb4 <VarSYSVMatch+348>

As r31 is not used as a frame pointer the 4.2.1 code can not get the =
SEGV that the clang 3.8.0 code sometimes gets.

=3D=3D=3D
Mark Millard
markmi at dsl-only.net





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CDCA5549-E22D-4101-B203-F076A328E2DF>