From owner-freebsd-arm@FreeBSD.ORG Sat Jun 21 16:01:37 2014 Return-Path: Delivered-To: arm@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E221A494 for ; Sat, 21 Jun 2014 16:01:37 +0000 (UTC) Received: from nibbler.fubar.geek.nz (nibbler.fubar.geek.nz [199.48.134.198]) by mx1.freebsd.org (Postfix) with ESMTP id B1B7A2DE4 for ; Sat, 21 Jun 2014 16:01:37 +0000 (UTC) Received: from bender.Home (97e07ba1.skybroadband.com [151.224.123.161]) by nibbler.fubar.geek.nz (Postfix) with ESMTPSA id 250905DEF9; Sat, 21 Jun 2014 16:01:36 +0000 (UTC) Date: Sat, 21 Jun 2014 17:01:29 +0100 From: Andrew Turner To: John-Mark Gurney Subject: Re: AVILA getting close! Message-ID: <20140621170129.76e62c27@bender.Home> In-Reply-To: <20140621070827.GJ31367@funkthat.com> References: <20140618225808.GG31367@funkthat.com> <20140620151023.GZ31367@funkthat.com> <20140620200827.1c33c7da@bender.Home> <20140621010804.GD31367@funkthat.com> <20140621070827.GJ31367@funkthat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: arm@FreeBSD.org X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Jun 2014 16:01:38 -0000 On Sat, 21 Jun 2014 00:08:28 -0700 John-Mark Gurney wrote: > John-Mark Gurney wrote this message on Fri, Jun 20, 2014 at 18:08 > -0700: > > Andrew Turner wrote this message on Fri, Jun 20, 2014 at 20:08 > > +0100: > > > On Fri, 20 Jun 2014 08:10:24 -0700 > > > John-Mark Gurney wrote: > > > > > > > John-Mark Gurney wrote this message on Wed, Jun 18, 2014 at > > > > 15:58 -0700: > > > > > So, w/ the recent couple of patches that alc has provided, I > > > > > no longer receive kernel panics on my AVILA board! > > > > > > > > > > $ uname -a > > > > > FreeBSD avila.funkthat.com 11.0-CURRENT FreeBSD 11.0-CURRENT > > > > > #27 r267333:267349M: Wed Jun 11 09:57:58 PDT 2014 > > > > > jmg@carbon.funkthat.com:/usr/obj/arm.armeb/usr/src.avila/sys/AVILA > > > > > arm $ uptime 12:15AM up 1 day, 15 mins, 2 users, load > > > > > averages: 0.13, 0.11, 0.08 > > > > > > > > > > This survived a portsnap extract... This is all over NFS... > > > > > > > > > > Though the issue that I'm now having is that some binaries > > > > > (newsyslog) and sometimes other binaries (awk, grep) core > > > > > dump... > > > > > > > > > > I believe this is an issue w/ rtld, or related... If I > > > > > compile newsyslog -static, it works fine... Otherwise I get > > > > > a SIGILL, and that is because it jumps off into the weeds.. > > > > > Though gdb on arm isn't very useful.. > > > > > > > > ok, so the SIGILL only occures under gdb, and this is because > > > > single stepping into a RAS sequence doesn't work very well... > > > > If you set a break point on the return (after the RAS > > > > sequence), you can get past this... > > > > > > > > I got to the point in rtld.c code: > > > > if (obj->pltrel) > > > > rel = (const Elf_Rel *) ((caddr_t) obj->pltrel + > > > > reloff); else > > > > rel = (const Elf_Rel *) ((caddr_t) obj->pltrela + > > > > reloffand was seeing gdb try to execute the pltrela line, but: > > > > i; > > > > > > > > and was seeing gdb try to execute the pltrela line, but: > > > > (gdb) print * (const Elf_Rel *) ((caddr_t) obj->pltrela + > > > > reloff) Error accessing memory address 0x118: Bad address. > > > > (gdb) print/x obj->pltrela > > > > $4 = 0x0 > > > > (gdb) print /x reloff > > > > $5 = 0x118 > > > > (gdb) print obj->pltrel > > > > $6 = (const Elf_Rel *) 0x94e8 > > > Based on my copy of newsyslog I built for armeb this looks > > > correct. To verify it could you dump the .dynamic section from > > > the binary? Something like 'objdump -s newsyslog' will get it. > > > > ok, available at: > > https://www.funkthat.com/~jmg/20140619/objdump.newsyslog > > > > > > Hun? obj->pltrel is non-zero, so it should have executed the > > > > other line... > > > > > > > > I recompiled rtld w/ -O0, and sure enough, newsyslog runs > > > > fine... If I compile w/o -O, or w/ -O1, it fails... > > > > > > > > Comments or suggestions? > > > > > > What is the value of rel after the if statement? In the -O/-O1 > > > case the asm looks like: > > > > > > ldr r2, [sp, #20] ; Load obj to r2 > > > ldr r3, [r2, #124] ; Load obj->pltrel to r3 > > > cmp r3, #0 ; 0x0 ; if obj->pltrel: > > > ldrne r2, [sp, #16] ; != NULL: Load reloff to r2 > > > addne r4, r3, r2 ; != NULL: Add obj->pltrel + reloff to r4 > > > ldreq r2, [sp, #20] ; == NULL: Load obj to r2 > > > ldreq r3, [r2, #132] ; == NULL: Load obj->pltrela to r3 > > > ldreq r2, [sp, #16] ; == NULL: Load reloff to r2 > > > addeq r4, r2, r3 ; == NULL: Add obj->pltrela + reloff to r4 > > > > > > Given this I could see how gdb gets confused. > > > > > > It may also pay to get the registers from gdb at this point. > > > > Arg! This is frustrating, I'm getting such different behaviors from > > time to time.. now it isn't having that fault.. but it's getting > > farther, but... I this is because our in tree gdb is messed up.. > > > > But, I am getting farther... now the last break at rtld.c:3651 > > looks like it's returning a bogus pointer: > > (gdb) print *req > > $12 = {name = 0x9190 "__aeabi_read_tp", hash = 0xf008a80, > > hash_gnu = 0x52432dd3, ventry = 0x2003b1d0, flags = 0x1, > > defobj_out = 0x2003c400, sym_out = 0x20062454, lockstate = > > 0xbfffeda0} > > > > defobj_out looks bogus to me... We don't have any object mapped > > there: > > (gdb) info shared > > >From To Syms Read Shared Object Library > > 0x200427c8 0x20048814 Yes /lib/libgcc_s.so.1 > > 0x2007a4e8 0x2017f320 Yes /lib/libc.so.7 > > 0x20018f14 0x2002c99c Yes /libexec/ld-elf.so.1 > > > > the data at 0x2003c400 doesn't look like code: > > (gdb) x/32x 0x2003c400 > > 0x2003c400: 0xd550b87a 0x00000001 0x00000000 > > 0x2003a080 0x2003c410: 0x00000000 0x00000001 > > 0x00000000 0x20051000 0x2003c420: 0x0016a000 > > 0x00143000 0x00000000 0x20051000 0x2003c430: > > 0x2019dfd0 0x2007a4e8 0x20051034 0x000000a0 > > 0x2003c440: 0x00000000 0x00000007 0x00000002 > > 0x2019bcf0 0x2003c450: 0x00000004 0x00000058 > > 0x00000000 0x00000008 0x2003c460: 0x20051000 > > 0x00000000 0x2019e0b8 0x200713e0 0x2003c470: > > 0x000040c0 0x00000000 0x00000000 0x200754a0 > > > > and then as I stepi out of symlook_global: > > (gdb) x/6i $pc > > 0x2001f0b4 : cmp r0, #0 ; 0x0 > > 0x2001f0b8 : moveq r0, #3 ; 0x3 > > 0x2001f0bc : movne r0, #0 ; 0x0 > > 0x2001f0c0 : add sp, sp, #36 ; > > 0x24 0x2001f0c4 : pop {r4, r5, r6, > > r7, lr} 0x2001f0c8 : bx lr > > (gdb) info registers > > r0 0x20062454 0x20062454 > > r1 0x933b 0x933b > > r2 0x0 0x0 > > r3 0xa4 0xa4 > > r4 0x0 0x0 > > r5 0xbfffed3c 0xbfffed3c > > r6 0xbfffed08 0xbfffed08 > > r7 0x20037af4 0x20037af4 > > r8 0x0 0x0 > > r9 0x1 0x1 > > r10 0x8a2c 0x8a2c > > r11 0xbfffed30 0xbfffed30 > > r12 0x23de 0x23de > > sp 0xbfffec94 0xbfffec94 The stack is bogus here, the stack should be 8-byte aligned in non-leaf functions. > > lr 0x2001efb0 0x2001efb0 > > pc 0x2001f0b4 0x2001f0b4 > > fps 0x0 0x0 > > cpsr 0x60000010 0x60000010 > > > > Then stepi till 0x2001f0c8: > > (gdb) info registers > > r0 0x0 0x0 > > r1 0x933b 0x933b > > r2 0x0 0x0 > > r3 0xa4 0xa4 > > r4 0x2003c000 0x2003c000 > > r5 0xbfffed3c 0xbfffed3c > > r6 0x20037af4 0x20037af4 > > r7 0xbfffece8 0xbfffece8 > > r8 0x0 0x0 > > r9 0x1 0x1 > > r10 0x8a2c 0x8a2c > > r11 0xbfffed30 0xbfffed30 > > r12 0x23de 0x23de > > sp 0xbfffeccc 0xbfffeccc It's still bogus suggesting someone in the call chain failed at aligning it correctly. > > lr 0x2003c000 0x2003c000 > > pc 0x2001f0c8 0x2001f0c8 > > fps 0x0 0x0 > > cpsr 0x20000010 0x20000010 > > > > and now the lr is bogus... it transfers control to 0x2003c000 which > > is before the fault at 0x2003c0f4... And again, this looks like > > data, not code: When the stack is incorrectly aligned all bets are off, I've seen strange errors when it's not 8-byte aligned. The compiler relies on this assumption to load/store stack based data. > > (gdb) x/64x 0x2003c000 > > 0x2003c000: 0xd550b87a 0x00000001 0x2003c200 > > 0xbfffffb8 0x2003c010: 0x00000000 0x00000001 > > 0x00000000 0x00008000 0x2003c020: 0x00012000 > > 0x00009000 0x00008000 0x00000000 0x2003c030: > > 0x00018724 0x00009ca8 0x00008034 0x000000e0 > > 0x2003c040: 0x00008114 0x00000007 0x00000000 > > 0x00000000 0x2003c050: 0x00000000 0x00000000 > > 0x00000000 0x00000000 0x2003c060: 0x00000000 > > 0x00000000 0x0001881c 0x000094a0 0x2003c070: > > 0x00000048 0x00000000 0x00000000 0x000094e8 > > 0x2003c080: 0x00000308 0x00000000 0x00000000 > > 0x0000888c 0x2003c090: 0x00008f9c 0x000003b1 > > 0x00009430 0x00000002 0x2003c0a0: 0x00000000 > > 0x00000000 0x0000934e 0x00008180 0x2003c0b0: > > 0x00000061 0x00008304 0x00000071 0x00000061 > > 0x2003c0c0: 0x00000005 0x0000001f 0x0000000a > > 0x00000071 0x2003c0d0: 0x000084d8 0x00008558 > > 0x000086c8 0x00000000 0x2003c0e0: 0x00000000 > > 0x2003d000 0x00000000 0x00000000 0x2003c0f0: > > 0x00000000 0x2003c0f0 0x2003b180 0x00000007 > > > > If I continue to stepi from here, it will fault at f4... > > > > This looks like a stack smash issue as the lr we pop off the stack > > is incorrect.. > > So, forgot I could set a watchpoint on the stack address, so I did: > (gdb) c > Continuing. > Watchpoint 9: *0xbfffecc8 > > Old value = 0x2001f584 > New value = 0x2003c000 > 0x20019960 in donelist_check (dlp=0xbfffed08, obj=0x2003c000) > at /usr/src.avila/libexec/rtld-elf/rtld.c:1380 > 1380 dlp->objs[dlp->num_used++] = obj; > > I tracked dlp, and it's the donelist pointer, and it is allocated > from symlook_default via donelist_init at the top of the function: > #define donelist_init(dlp) \ > ((dlp)->objs = alloca(obj_count * sizeof (dlp)->objs[0]), \ > assert((dlp)->objs != NULL), \ > (dlp)->num_alloc = obj_count, \ > (dlp)->num_used = 0) > > Hmm... since obj_count is a global, and we are in a possibly > threaded environment, we should probably move assigning num_alloc > before the alloca, and then use num_alloc instead of obj_count just > to be safe... but taht won't be the issue here, as num_used is 0 > in my case... > > I've looked at the assembly for donelist_init for both the working > and non-working case, and besides an extra store in the good version, > which I think is just because of the unoptimized code, things look > the same to me... I suspect it's more likely the working version is not making assumptions on the stack alignment, where the broken one is. Andrew