Date: Sat, 15 Nov 2003 11:30:39 -0800 From: Marcel Moolenaar <marcel@xcllnt.net> To: deischen@freebsd.org Cc: davidxu@freebsd.org Subject: Re: KSE/ia64 broken Message-ID: <20031115193039.GA55917@dhcp01.pn.xcllnt.net> In-Reply-To: <Pine.GSO.4.10.10311151223450.5390-100000@pcnet5.pcnet.com> References: <20031115031906.GA52831@dhcp01.pn.xcllnt.net> <Pine.GSO.4.10.10311151223450.5390-100000@pcnet5.pcnet.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Nov 15, 2003 at 12:36:42PM -0500, Daniel Eischen wrote: > On Fri, 14 Nov 2003, Marcel Moolenaar wrote: > > > Gang, > > > > The following change broke KSE on ia64: > > > > -------- > > revision 1.18 > > date: 2003/11/08 06:07:04; author: davidxu; state: Exp; lines: +16 -17 > > Use THR lock instead of KSE lock to avoid scheduler be blocked in spinlock. > > > > Reviewed by: deischen > > -------- > > > > We seem to be clobbering the thread structure instead of writing > > to the mailbox. This happens at initialization. Can it be that > > the change assumes PER_KSE and doesxn't work for PER_THREAD? > > I _think_ this may be because rltd-elf (at least for ia64) calls > malloc with the rtld lock held. I'm not sure how to test this > theory. No worries, I have a way to disproof it :-) Staticly linked binaries are as much broken as dynamicly linked binaries. So, if we have a rtld problem, it's not the only one: The following triggers the bug (from the configure script of Python): \begin{C} #include <pthread.h> void* routine(void* p){return NULL;} int main(){ pthread_t p; if(pthread_create(&p,NULL,routine,NULL)!=0) return 1; (void)pthread_detach(p); return 0; } \end{C} (gdb) run Starting program: /nfs/home/marcel/KSE/foo Program received signal SIGILL, Illegal instruction. _pthread_create (thread=0x200000000009b8d8, attr=0x9fffffffffffeb00, start_routine=@0x2000000000082520: 0x2000000000034cf0 <sig_daemon>, arg=0x0) at /nfs/freebsd/5.x/src/lib/libpthread/thread/thr_create.c:202 202 new_thread->tcb->tcb_tmbx.tm_context.uc_stack.ss_size = (gdb) bt #0 _pthread_create (thread=0x200000000009b8d8, attr=0x9fffffffffffeb00, start_routine=@0x2000000000082520: 0x2000000000034cf0 <sig_daemon>, arg=0x0) at /nfs/freebsd/5.x/src/lib/libpthread/thread/thr_create.c:202 #1 0x2000000000035430 in _thr_start_sig_daemon () at /nfs/freebsd/5.x/src/lib/libpthread/thread/thr_sig.c:236 #2 0x2000000000004e80 in _kse_setthreaded (threaded=1) at /nfs/freebsd/5.x/src/lib/libpthread/thread/thr_kern.c:449 #3 0x2000000000001490 in _pthread_create (thread=0x9fffffffffffebc0, attr=0x0, start_routine=@0x2000000000082090: 0x2000000000000380 <routine>, arg=0x0) at /nfs/freebsd/5.x/src/lib/libpthread/thread/thr_create.c:111 #4 0x20000000000003e0 in main () at foo.c:7 (gdb) p *new_thread $1 = {tcb = 0xfffffffffffefeff, magic = 4294967295, name = 0x0, uniqueid = 0, tle = {tqe_next = 0x9fffffffffffeaa0, tqe_prev = 0x0}, kle = { tqe_next = 0x2000000000001e00, tqe_prev = 0x4041}, gcle = { tqe_next = 0xc00000000000048f, tqe_prev = 0x8000000000000130}, hle = { le_next = 0x0, le_prev = 0x0}, lock = {l_head = 0x20000000000b1ad0, l_tail = 0xf, l_type = 2322596671, l_wait = 0, l_wakeup = @0x2000000000081d40: 0x20000000000055e0 <_thr_lock_wakeup>}, : (partially shown) (gdb) p *crit $2 = {tm_context = {uc_sigmask = {__bits = {0, 0, 0, 0}}, uc_mcontext = { mc_flags = 3, _reserved_ = 0, mc_special = {sp = 11529215046068464288, unat = 0, rp = 2305843009213701744, pr = 32833, pfs = 13835058055282164879, bspstore = 9223372036854776160, rnat = 0, __spare = 0, tp = 2305843009214421712, rsc = 15, fpsr = 2674341018862399, psr = 19872948576272, gp = 2305843009214327112, ndirty = 0, cfm = 9223372036854776975, iip = 2305843009213701936, ifa = 18446744073709488367, isr = 549755813936}, mc_preserved = {unat = 0, gr4 = 2131440464, gr5 = 2147314880, gr6 = 1423, gr7 = 2135720224, br1 = 2145524112, br2 = 9223372041147674448, br3 = 9223372041147674368, br4 = 9223372041147674464, br5 = 2146730000, lc = 0, __spare = 0}, : (partially shown) If you interpret the data then new_thread is clobbered by a ucontext: (gdb) p /x *(ucontext_t*)new_thread $6 = {uc_sigmask = {__bits = {0xfffefeff, 0xffffffff, 0xffffffff, 0xffffffff}}, uc_mcontext = {mc_flags = 0x0, _reserved_ = 0x0, mc_special = {sp = 0x9fffffffffffeaa0, unat = 0x0, rp = 0x2000000000001e00, pr = 0x4041, pfs = 0xc00000000000048f, bspstore = 0x8000000000000130, rnat = 0x0, __spare = 0x0, tp = 0x20000000000b1ad0, rsc = 0xf, fpsr = 0x9804c8a70033f, psr = 0x0, gp = 0x2000000000081d40, ndirty = 0x20000000000a8460, cfm = 0x0, iip = 0x0, ifa = 0x20000000000bc000, isr = 0x20000000000bc000}, mc_preserved = {unat = 0x0, gr4 = 0x7f0b3350, gr5 = 0x7ffd6cc0, gr6 = 0x58f, gr7 = 0x7f4c8120, br1 = 0x7fe21990, br2 = 0x80000000ffe06f50, br3 = 0x80000000ffe06f00, br4 = 0x80000000ffe06f60, br5 = 0x7ff48010, lc = 0x0, __spare = 0x0}, mc_preserved_fp = {fr2 = {fpr_bits = {0x0 <repeats 16 times>}, : (partially shown) mc_flags = 0, which means that the context is synchronous, and with ifa and isr set to new_thread, it looks like there may have been an async context before that (those are the registers we use to tell the kernel which value (=isr) to write at what location (=ifa) when we restore an async context. The sync has a valid thread pointer in it (tp=0x20000000000b1ad0): (gdb) p /x *(struct tcb*)(((ucontext_t*)new_thread)->uc_mcontext.mc_special.tp - (sizeof(struct tcb) - sizeof(struct ia64_tp))) $13 = {tcb_tmbx = {tm_context = {uc_sigmask = {__bits = {0x0, 0x0, 0x0, 0x0}}, uc_mcontext = {mc_flags = 0x3, _reserved_ = 0x0, mc_special = { sp = 0x9fffffffffffeaa0, unat = 0x0, rp = 0x2000000000001e70, pr = 0x8041, pfs = 0xc00000000000048f, bspstore = 0x8000000000000160, rnat = 0x0, __spare = 0x0, tp = 0x20000000000b1ad0, rsc = 0xf, fpsr = 0x9804c8a70033f, psr = 0x1213080a6010, gp = 0x200000000009a948, ndirty = 0x0, cfm = 0x800000000000048f, iip = 0x2000000000001f30, ifa = 0xffffffffffff08ef, isr = 0x8000000030}, mc_preserved = { unat = 0x0, gr4 = 0x7f0b3350, gr5 = 0x7ffd6cc0, gr6 = 0x58f, gr7 = 0x7f4c8120, br1 = 0x7fe21990, br2 = 0x80000000ffe06f50, br3 = 0x80000000ffe06f00, br4 = 0x80000000ffe06f60, br5 = 0x7ff48010, lc = 0x0, __spare = 0x0}, mc_preserved_fp = { fr2 = {fpr_bits = {0x0 <repeats 16 times>}, fpr_flt = 0x00000000000000000000000000000000}, fr3 = {fpr_bits = { 0x0 <repeats 16 times>}, : (partially shown) The mailbox has an async context in it caused by what looks like a page fault (faulting address = 0xffffffffffff08ef). The instruction that caused the page fault is at 0x2000000000001f30. Note that mc_flags = 3, which tells us it is indeed an async context and that the high FP registers have been saved in the context. (gdb) l *0x2000000000001f30 0x2000000000001f30 is in _pthread_create (/nfs/freebsd/5.x/src/lib/libpthread/thread/thr_create.c:202). 197 _kse_critical_leave(crit); 198 199 new_thread->tcb->tcb_tmbx.tm_udata = new_thread; 200 new_thread->tcb->tcb_tmbx.tm_context.uc_sigmask = 201 new_thread->sigmask; 202 new_thread->tcb->tcb_tmbx.tm_context.uc_stack.ss_size = 203 new_thread->attr.stacksize_attr; 204 new_thread->tcb->tcb_tmbx.tm_context.uc_stack.ss_sp = 205 new_thread->attr.stackaddr_attr; 206 makecontext(&new_thread->tcb->tcb_tmbx.tm_context, So, the page fault was really the SIGILL. This indicates that the internal consistency is reliable. Yet, our newly created thread structure is clobbered: (gdb) l 193 188 189 /* 190 * Initialize the machine context. 191 * Enter a critical region to get consistent context. 192 */ 193 crit = _kse_critical_enter(); 194 THR_GETCONTEXT(&new_thread->tcb->tcb_tmbx.tm_context); 195 /* Initialize the thread for signals: */ 196 new_thread->sigmask = curthread->sigmask; 197 _kse_critical_leave(crit); This suddenly looks very suspicious... (gdb) p /x curthread->sigmask $17 = {__bits = {0xfffefeff, 0xffffffff, 0xffffffff, 0xffffffff}} (gdb) p /x *new_thread $18 = {tcb = 0xfffffffffffefeff, magic = 0xffffffff, name = 0x0, uniqueid = 0x0, tle = {tqe_next = 0x9fffffffffffeaa0, tqe_prev = 0x0}, -- Marcel Moolenaar USPA: A-39004 marcel@xcllnt.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20031115193039.GA55917>