Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 21 Jun 2014 00:08:28 -0700
From:      John-Mark Gurney <jmg@funkthat.com>
To:        Andrew Turner <andrew@fubar.geek.nz>, arm@FreeBSD.org
Subject:   Re: AVILA getting close!
Message-ID:  <20140621070827.GJ31367@funkthat.com>
In-Reply-To: <20140621010804.GD31367@funkthat.com>
References:  <20140618225808.GG31367@funkthat.com> <20140620151023.GZ31367@funkthat.com> <20140620200827.1c33c7da@bender.Home> <20140621010804.GD31367@funkthat.com>

next in thread | previous in thread | raw e-mail | index | archive | help
John-Mark Gurney wrote this message on Fri, Jun 20, 2014 at 18:08 -0700:
> Andrew Turner wrote this message on Fri, Jun 20, 2014 at 20:08 +0100:
> > On Fri, 20 Jun 2014 08:10:24 -0700
> > John-Mark Gurney <jmg@funkthat.com> wrote:
> > 
> > > John-Mark Gurney wrote this message on Wed, Jun 18, 2014 at 15:58
> > > -0700:
> > > > So, w/ the recent couple of patches that alc has provided, I no
> > > > longer receive kernel panics on my AVILA board!
> > > > 
> > > > $ uname -a
> > > > FreeBSD avila.funkthat.com 11.0-CURRENT FreeBSD 11.0-CURRENT #27
> > > > r267333:267349M: Wed Jun 11 09:57:58 PDT 2014
> > > > jmg@carbon.funkthat.com:/usr/obj/arm.armeb/usr/src.avila/sys/AVILA
> > > > arm $ uptime 12:15AM  up 1 day, 15 mins, 2 users, load averages:
> > > > 0.13, 0.11, 0.08
> > > > 
> > > > This survived a portsnap extract...  This is all over NFS...
> > > > 
> > > > Though the issue that I'm now having is that some binaries
> > > > (newsyslog) and sometimes other binaries (awk, grep) core dump...
> > > > 
> > > > I believe this is an issue w/ rtld, or related...  If I compile
> > > > newsyslog -static, it works fine...  Otherwise I get a SIGILL, and
> > > > that is because it jumps off into the weeds..  Though gdb on arm
> > > > isn't very useful..
> > > 
> > > ok, so the SIGILL only occures under gdb, and this is because single
> > > stepping into a RAS sequence doesn't work very well...  If you set a
> > > break point on the return (after the RAS sequence), you can get past
> > > this...
> > > 
> > > I got to the point in rtld.c code:
> > >     if (obj->pltrel)
> > >         rel = (const Elf_Rel *) ((caddr_t) obj->pltrel + reloff);
> > >     else
> > >         rel = (const Elf_Rel *) ((caddr_t) obj->pltrela + reloffand
> > > was seeing gdb try to execute the pltrela line, but: i;
> > > 
> > > and was seeing gdb try to execute the pltrela line, but:
> > > (gdb) print * (const Elf_Rel *) ((caddr_t) obj->pltrela + reloff)
> > > Error accessing memory address 0x118: Bad address.
> > > (gdb) print/x obj->pltrela
> > > $4 = 0x0
> > > (gdb) print /x reloff
> > > $5 = 0x118
> > > (gdb) print obj->pltrel
> > > $6 = (const Elf_Rel *) 0x94e8
> > Based on my copy of newsyslog I built for armeb this looks correct. To
> > verify it could you dump the .dynamic section from the binary?
> > Something like 'objdump -s newsyslog' will get it.
> 
> ok, available at:
> https://www.funkthat.com/~jmg/20140619/objdump.newsyslog
> 
> > > Hun? obj->pltrel is non-zero, so it should have executed the other
> > > line...
> > > 
> > > I recompiled rtld w/ -O0, and sure enough, newsyslog runs fine... If
> > > I compile w/o -O, or w/ -O1, it fails...
> > > 
> > > Comments or suggestions?
> > 
> > What is the value of rel after the if statement? In the -O/-O1 case the
> > asm looks like:
> > 
> > ldr     r2, [sp, #20]  ; Load obj to r2
> > ldr     r3, [r2, #124] ; Load obj->pltrel to r3
> > cmp     r3, #0  ; 0x0  ; if obj->pltrel:
> > ldrne   r2, [sp, #16]  ;  != NULL: Load reloff to r2
> > addne   r4, r3, r2     ;  != NULL: Add obj->pltrel + reloff to r4
> > ldreq   r2, [sp, #20]  ;  == NULL: Load obj to r2
> > ldreq   r3, [r2, #132] ;  == NULL: Load obj->pltrela to r3
> > ldreq   r2, [sp, #16]  ;  == NULL: Load reloff to r2
> > addeq   r4, r2, r3     ;  == NULL: Add obj->pltrela + reloff to r4
> > 
> > Given this I could see how gdb gets confused.
> > 
> > It may also pay to get the registers from gdb at this point.
> 
> Arg! This is frustrating, I'm getting such different behaviors from
> time to time..  now it isn't having that fault..  but it's getting
> farther, but...  I this is because our in tree gdb is messed up..
> 
> But, I am getting farther...  now the last break at rtld.c:3651
> looks like it's returning a bogus pointer:
> (gdb) print *req
> $12 = {name = 0x9190 "__aeabi_read_tp", hash = 0xf008a80, 
>   hash_gnu = 0x52432dd3, ventry = 0x2003b1d0, flags = 0x1, 
>   defobj_out = 0x2003c400, sym_out = 0x20062454, lockstate = 0xbfffeda0}
> 
> defobj_out looks bogus to me...  We don't have any object mapped
> there:
> (gdb) info shared  
> >From        To          Syms Read   Shared Object Library
> 0x200427c8  0x20048814  Yes         /lib/libgcc_s.so.1
> 0x2007a4e8  0x2017f320  Yes         /lib/libc.so.7
> 0x20018f14  0x2002c99c  Yes         /libexec/ld-elf.so.1
> 
> the data at 0x2003c400 doesn't look like code:
> (gdb) x/32x 0x2003c400
> 0x2003c400:     0xd550b87a      0x00000001      0x00000000      0x2003a080
> 0x2003c410:     0x00000000      0x00000001      0x00000000      0x20051000
> 0x2003c420:     0x0016a000      0x00143000      0x00000000      0x20051000
> 0x2003c430:     0x2019dfd0      0x2007a4e8      0x20051034      0x000000a0
> 0x2003c440:     0x00000000      0x00000007      0x00000002      0x2019bcf0
> 0x2003c450:     0x00000004      0x00000058      0x00000000      0x00000008
> 0x2003c460:     0x20051000      0x00000000      0x2019e0b8      0x200713e0
> 0x2003c470:     0x000040c0      0x00000000      0x00000000      0x200754a0
> 
> and then as I stepi out of symlook_global:
> (gdb) x/6i $pc
> 0x2001f0b4 <symlook_global+348>:        cmp     r0, #0  ; 0x0
> 0x2001f0b8 <symlook_global+352>:        moveq   r0, #3  ; 0x3
> 0x2001f0bc <symlook_global+356>:        movne   r0, #0  ; 0x0
> 0x2001f0c0 <symlook_global+360>:        add     sp, sp, #36     ; 0x24
> 0x2001f0c4 <symlook_global+364>:        pop     {r4, r5, r6, r7, lr}
> 0x2001f0c8 <symlook_global+368>:        bx      lr
> (gdb) info registers
> r0             0x20062454       0x20062454
> r1             0x933b   0x933b
> r2             0x0      0x0
> r3             0xa4     0xa4
> r4             0x0      0x0
> r5             0xbfffed3c       0xbfffed3c
> r6             0xbfffed08       0xbfffed08
> r7             0x20037af4       0x20037af4
> r8             0x0      0x0
> r9             0x1      0x1
> r10            0x8a2c   0x8a2c
> r11            0xbfffed30       0xbfffed30
> r12            0x23de   0x23de
> sp             0xbfffec94       0xbfffec94
> lr             0x2001efb0       0x2001efb0
> pc             0x2001f0b4       0x2001f0b4
> fps            0x0      0x0
> cpsr           0x60000010       0x60000010
> 
> Then stepi till 0x2001f0c8:
> (gdb) info registers
> r0             0x0      0x0
> r1             0x933b   0x933b
> r2             0x0      0x0
> r3             0xa4     0xa4
> r4             0x2003c000       0x2003c000
> r5             0xbfffed3c       0xbfffed3c
> r6             0x20037af4       0x20037af4
> r7             0xbfffece8       0xbfffece8
> r8             0x0      0x0
> r9             0x1      0x1
> r10            0x8a2c   0x8a2c
> r11            0xbfffed30       0xbfffed30
> r12            0x23de   0x23de
> sp             0xbfffeccc       0xbfffeccc
> lr             0x2003c000       0x2003c000
> pc             0x2001f0c8       0x2001f0c8
> fps            0x0      0x0
> cpsr           0x20000010       0x20000010
> 
> and now the lr is bogus...  it transfers control to 0x2003c000 which
> is before the fault at 0x2003c0f4...  And again, this looks like data,
> not code:
> (gdb) x/64x 0x2003c000
> 0x2003c000:     0xd550b87a      0x00000001      0x2003c200      0xbfffffb8
> 0x2003c010:     0x00000000      0x00000001      0x00000000      0x00008000
> 0x2003c020:     0x00012000      0x00009000      0x00008000      0x00000000
> 0x2003c030:     0x00018724      0x00009ca8      0x00008034      0x000000e0
> 0x2003c040:     0x00008114      0x00000007      0x00000000      0x00000000
> 0x2003c050:     0x00000000      0x00000000      0x00000000      0x00000000
> 0x2003c060:     0x00000000      0x00000000      0x0001881c      0x000094a0
> 0x2003c070:     0x00000048      0x00000000      0x00000000      0x000094e8
> 0x2003c080:     0x00000308      0x00000000      0x00000000      0x0000888c
> 0x2003c090:     0x00008f9c      0x000003b1      0x00009430      0x00000002
> 0x2003c0a0:     0x00000000      0x00000000      0x0000934e      0x00008180
> 0x2003c0b0:     0x00000061      0x00008304      0x00000071      0x00000061
> 0x2003c0c0:     0x00000005      0x0000001f      0x0000000a      0x00000071
> 0x2003c0d0:     0x000084d8      0x00008558      0x000086c8      0x00000000
> 0x2003c0e0:     0x00000000      0x2003d000      0x00000000      0x00000000
> 0x2003c0f0:     0x00000000      0x2003c0f0      0x2003b180      0x00000007
> 
> If I continue to stepi from here, it will fault at f4...
> 
> This looks like a stack smash issue as the lr we pop off the stack
> is incorrect..

So, forgot I could set a watchpoint on the stack address, so I did:
(gdb) c
Continuing.
Watchpoint 9: *0xbfffecc8

Old value = 0x2001f584
New value = 0x2003c000
0x20019960 in donelist_check (dlp=0xbfffed08, obj=0x2003c000)
    at /usr/src.avila/libexec/rtld-elf/rtld.c:1380
1380            dlp->objs[dlp->num_used++] = obj;

I tracked dlp, and it's the donelist pointer, and it is allocated
from symlook_default via donelist_init at the top of the function:
#define donelist_init(dlp)                                      \
    ((dlp)->objs = alloca(obj_count * sizeof (dlp)->objs[0]),   \
    assert((dlp)->objs != NULL),                                \
    (dlp)->num_alloc = obj_count,                               \
    (dlp)->num_used = 0)

Hmm...  since obj_count is a global, and we are in a possibly
threaded environment, we should probably move assigning num_alloc
before the alloca, and then use num_alloc instead of obj_count just
to be safe...  but taht won't be the issue here, as num_used is 0
in my case...

I've looked at the assembly for donelist_init for both the working
and non-working case, and besides an extra store in the good version,
which I think is just because of the unoptimized code, things look
the same to me...

I guess the next step is to make sure that the Donelist pointer is
the same...

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140621070827.GJ31367>