From owner-freebsd-amd64@FreeBSD.ORG Thu Sep 29 20:51:49 2005 Return-Path: X-Original-To: freebsd-amd64@FreeBSD.org Delivered-To: freebsd-amd64@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 9F48916A41F; Thu, 29 Sep 2005 20:51:49 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [204.156.12.53]) by mx1.FreeBSD.org (Postfix) with ESMTP id 37CED43D48; Thu, 29 Sep 2005 20:51:49 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by cyrus.watson.org (Postfix) with ESMTP id C57E646BAC; Thu, 29 Sep 2005 16:51:48 -0400 (EDT) Date: Thu, 29 Sep 2005 21:51:48 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Rob Watt In-Reply-To: <20050929160945.A65402@daemon.mistermishap.net> Message-ID: <20050929212738.A34322@fledge.watson.org> References: <20050925115912.H11229@fledge.watson.org> <20050927140535.G50334@daemon.mistermishap.net> <20050927203128.S61419@fledge.watson.org> <20050927222624.R34322@fledge.watson.org> <20050928134724.P56436@daemon.mistermishap.net> <20050929185538.R61419@fledge.watson.org> <20050929160945.A65402@daemon.mistermishap.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-hackers@FreeBSD.org, mikep@hudson-trading.com, freebsd-amd64@FreeBSD.org, Jason Carroll Subject: Re: freebsd-5.4-stable panics X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Sep 2005 20:51:49 -0000 On Thu, 29 Sep 2005, Rob Watt wrote: > On Thu, 29 Sep 2005, Robert Watson wrote: > >> Could you dump the contents of *td and *td->td_proc for me? I'm quite >> interested to know what the value in td->td_proc->p_state is, among other >> things. If I could also have you generate a dump of the KSE group >> structures in td->td_proc->p_ksegrps and the threads in >> td->td_proc->p_threads. > > I've attached a file with many of the values you have asked for. We > looked at some of the threads referenced by td->td_proc->p_threads, but > we weren't sure we were walking the list correctly. Do you have any tips > for walking those thread lists? > >> Could you tell me if the program named by p->p_comm is linked against a >> threading library? If it's a custom app, you may already know, and if >> not, you can run ldd on the application to see what it is linked >> against. > > The programs named by p->p_comm is linked against the pthreads library. This seems to be enough information to at least track this down a bit: td_ksegrp is NULL, rather than a corrupt value, which suggests that the thread is incompletely initialized. Other hints that this are the case are that td_critnest is 1 (as is set when it is allocated), and the state is TDS_INACTIVE. Some other fields are set though, such as td_oncpu, which is normally initialized to NOCPU. > (kgdb) p *td > $1 = {td_proc = 0xffffff004aa9f000, td_ksegrp = 0x0, td_plist = > {tqe_next = 0xff ffff00b4798000, > tqe_prev = 0xffffff00a97ae010}, td_kglist = {tqe_next = > 0xffffff00b4798000, > tqe_prev = 0xffffff00a97ae020}, td_slpq = {tqe_next = 0x0, tqe_prev > = 0xffff ff001fac7c10}, td_lockq = { > tqe_next = 0xffffff00a97ae000, tqe_prev = 0xffffffffb6797a70}, > td_runq = {tq e_next = 0x0, > tqe_prev = 0xffffffff80608180}, td_selq = {tqh_first = 0x0, tqh_last > = 0xfff fff00633112c0}, > td_sleepqueue = 0xffffff00382b0400, td_turnstile = 0xffffff00c1712900, > td_umtx q = 0xffffff00d1207080, > td_tid = 100253, td_flags = 16777216, td_inhibitors = 0, td_pflags = > 128, td_d upfd = 0, td_wchan = 0x0, > td_wmesg = 0x0, td_lastcpu = 2 '\002', td_oncpu = 2 '\002', > td_owepreempt = 0 '\0', td_locks = 0, > td_blocked = 0x0, td_ithd = 0x0, td_lockname = 0x0, td_contested = > {lh_first = > 0x0}, td_sleeplocks = 0x0, > td_intr_nesting_level = 0, td_pinned = 0, td_mailbox = 0x0, td_ucred = > 0xfffff f00ad18f200, > td_standin = 0x0, td_upcall = 0x0, td_sticks = 0, td_uuticks = 0, > td_usticks = > 0, td_intrval = 0, > td_oldsigmask = {__bits = {0, 0, 0, 0}}, td_sigmask = {__bits = > {4294967295, 4 294967295, 4294967295, > 4294967295}}, td_siglist = {__bits = {0, 0, 0, 0}}, td_generation > = 14, td _sigstk = {ss_sp = 0x0, > ss_size = 0, ss_flags = 0}, td_kflags = 0, td_xsig = 0, > td_profil_addr = 0, td_profil_ticks = 0, > td_base_pri = 182 '\uffff', td_priority = 182 '\uffff', td_pcb = > 0xffffffffb68 dcd10, td_state = TDS_INACTIVE, > td_retval = {1, 29309280}, td_slpcallout = {c_links = {sle = {sle_next > = 0x0}, > tqe = {tqe_next = 0x0, > tqe_prev = 0xffffff001fac7d80}}, c_time = 55907602, c_arg = > 0xffffff0063 311260, > c_func = 0xffffffff802e32a0 , c_mtx = 0x0, c_flags = > 16}, td _frame = 0xffffffffb68dcc40, > td_kstack_obj = 0xffffff0087f93d20, td_kstack = 18446744072477315072, > td_kstac k_pages = 4, > td_altkstack_obj = 0x0, td_altkstack = 0, td_altkstack_pages = 0, > td_critnest = 1, td_md = { > md_spinlock_count = 1, md_saved_flags = 582}, td_sched = > 0xffffff0063311488} I'm not familiar with the internals of the thread and KSE life cycle here, so I think we'll need to look to those more familiar with this to understand what of two things may be going on: (1) Is the fact that td_ksegrp != NULL an invariant for a connected thread, and that kern_proc is relying on that but the thread code is failing to implement it safely? (2) Is td_ksegrp sometimes left legitimately as NULL as part of the thread life cycle, and that kern_proc incorrectly assumes that it is never NULL when hooked up to a thread. This suggests a possible work-around of simply testing td_ksegrp for NULL in kern_proc in order to avoid this, while attempting to resolve whether an invariant is violated (or incorrectly assumed), which might require some serious thinking and a solution that is non-trivial. Something like the following might work in the mean time: Index: kern_proc.c =================================================================== RCS file: /home/ncvs/src/sys/kern/kern_proc.c,v retrieving revision 1.231 diff -u -r1.231 kern_proc.c --- kern_proc.c 27 Sep 2005 18:03:15 -0000 1.231 +++ kern_proc.c 29 Sep 2005 20:50:33 -0000 @@ -882,6 +882,8 @@ } else { _PHOLD(p); FOREACH_THREAD_IN_PROC(p, td) { + if (td->td_ksegrp == NULL) + continue; fill_kinfo_thread(td, &kinfo_proc); PROC_UNLOCK(p); error = SYSCTL_OUT(req, (caddr_t)&kinfo_proc, I'm going to forward off your e-mail to the threads@ list and see if anyone there wants to talk some more about this. If you don't mind testing the above patch to see if this is a workable work-around, we may want to think about getting it committed in the mean time. Thanks, Robert N M Watson