From owner-freebsd-hackers@FreeBSD.ORG Fri Sep 30 15:25:37 2005 Return-Path: X-Original-To: freebsd-hackers@freebsd.org Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4DE7E16A41F for ; Fri, 30 Sep 2005 15:25:37 +0000 (GMT) (envelope-from apelisse@gmail.com) Received: from nproxy.gmail.com (nproxy.gmail.com [64.233.182.200]) by mx1.FreeBSD.org (Postfix) with ESMTP id DF04643D58 for ; Fri, 30 Sep 2005 15:25:34 +0000 (GMT) (envelope-from apelisse@gmail.com) Received: by nproxy.gmail.com with SMTP id x4so10566nfb for ; Fri, 30 Sep 2005 08:25:33 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:in-reply-to:mime-version:content-type:references; b=n57Jq8LPXwLxcy/WUlkPjueRA/YcooCTiBMQ1tA/DEUZpbfKRQNRm5SfVAavYOuis+YSJCUPdRpiW6Enr/Um8QokW+EShbRmlWCzDg73u9RQKUnWxDXUfd3dQ9fA1DSzOtSXBINJBTaO0Rkt+AFZXUkKjGcdvNSMb3Lrzoe5mqw= Received: by 10.48.226.17 with SMTP id y17mr118573nfg; Fri, 30 Sep 2005 08:25:33 -0700 (PDT) Received: by 10.48.108.18 with HTTP; Fri, 30 Sep 2005 08:25:33 -0700 (PDT) Message-ID: <61c746830509300825s5ad197fbt908267d54f1b7b8f@mail.gmail.com> Date: Fri, 30 Sep 2005 16:25:33 +0100 From: Antoine Pelisse To: freebsd-hackers@freebsd.org, Robert Watson In-Reply-To: <61c746830509300824g2f368d26pcc500403fe319b3b@mail.gmail.com> MIME-Version: 1.0 References: <61c746830509300215x7833746ew60896c4c1338ec65@mail.gmail.com> <61c746830509300224g3d79cbe4ve55e8b0b27004fc3@mail.gmail.com> <200509300854.48210.jhb@FreeBSD.org> <61c746830509300824g2f368d26pcc500403fe319b3b@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: Subject: freebsd-5.4-stable panics X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Antoine Pelisse List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 30 Sep 2005 15:25:37 -0000 On 9/30/05, John Baldwin wrote: > On Friday 30 September 2005 05:24 am, Antoine Pelisse wrote: > > Hi Robert, > > I don't think your patch is correct, the total linked list can be broke= n > > > while the lock is released, thus just passing the link may not be enoug= h > > I have submitted a PR[1] for this a month ago but nobody took care of i= t > > yet Regards, > > Antoine Pelisse > > > > [1] http://www.freebsd.org/cgi/query-pr.cgi?pr=3Dkern/84684 > > I think this patch looks ok. Robert, can you get the original panic on > this > thread tested against this patch? I had a small program which could reproduce this panic in 10 seconds, it was basically creating empty threads and calling kvm_getprocs() in the same time. Anyway the patch was able to stop the program from panicing. The panic is also reproducible in RELENG_6 and HEAD IIRC. > On 9/29/05, Robert Watson wrote: > > > On Thu, 29 Sep 2005, Rob Watt wrote: > > > > On Thu, 29 Sep 2005, Robert Watson wrote: > > > >> Could you dump the contents of *td and *td->td_proc for me? I'm > quite > > > >> interested to know what the value in td->td_proc->p_state is, amon= g > > > > > > > other > > > > > > >> things. If I could also have you generate a dump of the KSE group > > > >> structures in td->td_proc->p_ksegrps and the threads in > > > >> td->td_proc->p_threads. > > > > > > > > I've attached a file with many of the values you have asked for. We > > > > looked at some of the threads referenced by td->td_proc->p_threads, > but > > > > we weren't sure we were walking the list correctly. Do you have any > > > > tips > > > > > > > > for walking those thread lists? > > > > > > > >> Could you tell me if the program named by p->p_comm is linked > against > > > >> a threading library? If it's a custom app, you may already know, > and > > > >> if not, you can run ldd on the application to see what it is linke= d > > > >> against. > > > > > > > > The programs named by p->p_comm is linked against the pthreads > library. > > > > > > This seems to be enough information to at least track this down a bit= : > > > td_ksegrp is NULL, rather than a corrupt value, which suggests that > the > > > thread is incompletely initialized. Other hints that this are the cas= e > > > are that td_critnest is 1 (as is set when it is allocated), and the > state > > > is TDS_INACTIVE. Some other fields are set though, such as td_oncpu, > > > which is normally initialized to NOCPU. > > > > > > > (kgdb) p *td > > > > $1 =3D {td_proc =3D 0xffffff004aa9f000, td_ksegrp =3D 0x0, td_plist= =3D > > > > {tqe_next =3D 0xff ffff00b4798000, > > > > tqe_prev =3D 0xffffff00a97ae010}, td_kglist =3D {tqe_next =3D > > > > 0xffffff00b4798000, > > > > tqe_prev =3D 0xffffff00a97ae020}, td_slpq =3D {tqe_next =3D 0x0, tq= e_prev > > > > =3D 0xffff ff001fac7c10}, td_lockq =3D { > > > > tqe_next =3D 0xffffff00a97ae000, tqe_prev =3D 0xffffffffb6797a70}, > > > > td_runq =3D {tq e_next =3D 0x0, > > > > tqe_prev =3D 0xffffffff80608180}, td_selq =3D {tqh_first =3D 0x0, t= qh_last > > > > =3D 0xfff fff00633112c0}, > > > > td_sleepqueue =3D 0xffffff00382b0400, td_turnstile =3D > 0xffffff00c1712900, > > > > td_umtx q =3D 0xffffff00d1207080, > > > > td_tid =3D 100253, td_flags =3D 16777216, td_inhibitors =3D 0, td_p= flags =3D > > > > > 128, td_d upfd =3D 0, td_wchan =3D 0x0, > > > > td_wmesg =3D 0x0, td_lastcpu =3D 2 '\002', td_oncpu =3D 2 '\002', > > > > td_owepreempt =3D 0 '\0', td_locks =3D 0, > > > > td_blocked =3D 0x0, td_ithd =3D 0x0, td_lockname =3D 0x0, td_contes= ted =3D > > > > {lh_first =3D > > > > 0x0}, td_sleeplocks =3D 0x0, > > > > td_intr_nesting_level =3D 0, td_pinned =3D 0, td_mailbox =3D 0x0, t= d_ucred > =3D > > > > 0xfffff f00ad18f200, > > > > td_standin =3D 0x0, td_upcall =3D 0x0, td_sticks =3D 0, td_uuticks = =3D 0, > > > > td_usticks =3D > > > > 0, td_intrval =3D 0, > > > > td_oldsigmask =3D {__bits =3D {0, 0, 0, 0}}, td_sigmask =3D {__bits= =3D > > > > {4294967295, 4 294967295, 4294967295, > > > > 4294967295}}, td_siglist =3D {__bits =3D {0, 0, 0, 0}}, td_generati= on > > > > =3D 14, td _sigstk =3D {ss_sp =3D 0x0, > > > > ss_size =3D 0, ss_flags =3D 0}, td_kflags =3D 0, td_xsig =3D 0, > > > > td_profil_addr =3D 0, td_profil_ticks =3D 0, > > > > td_base_pri =3D 182 '\uffff', td_priority =3D 182 '\uffff', td_pcb = =3D > > > > 0xffffffffb68 dcd10, td_state =3D TDS_INACTIVE, > > > > td_retval =3D {1, 29309280}, td_slpcallout =3D {c_links =3D {sle = =3D > {sle_next > > > > =3D 0x0}, > > > > tqe =3D {tqe_next =3D 0x0, > > > > tqe_prev =3D 0xffffff001fac7d80}}, c_time =3D 55907602, c_arg =3D > > > > 0xffffff0063 311260, > > > > c_func =3D 0xffffffff802e32a0 , c_mtx =3D 0x0, c_fl= ags =3D > > > > 16}, td _frame =3D 0xffffffffb68dcc40, > > > > td_kstack_obj =3D 0xffffff0087f93d20, td_kstack =3D > 18446744072477315072, > > > > td_kstac k_pages =3D 4, > > > > td_altkstack_obj =3D 0x0, td_altkstack =3D 0, td_altkstack_pages = =3D 0, > > > > td_critnest =3D 1, td_md =3D { > > > > md_spinlock_count =3D 1, md_saved_flags =3D 582}, td_sched =3D > > > > 0xffffff0063311488} > > > > > > I'm not familiar with the internals of the thread and KSE life cycle > > > here, > > > > > > so I think we'll need to look to those more familiar with this to > > > understand what of two things may be going on: > > > > > > (1) Is the fact that td_ksegrp !=3D NULL an invariant for a connected > > > thread, and that kern_proc is relying on that but the thread code is > > > failing to implement it safely? > > > > > > (2) Is td_ksegrp sometimes left legitimately as NULL as part of the > > > thread life cycle, and that kern_proc incorrectly assumes that it is > > > never NULL when hooked up to a thread. > > > > > > This suggests a possible work-around of simply testing td_ksegrp for > NULL > > > in kern_proc in order to avoid this, while attempting to resolve > whether > > > an invariant is violated (or incorrectly assumed), which might requir= e > > > some serious thinking and a solution that is non-trivial. Something > like > > > the following might work in the mean time: > > > > > > Index: kern_proc.c > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > RCS file: /home/ncvs/src/sys/kern/kern_proc.c,v > > > retrieving revision 1.231 > > > diff -u -r1.231 kern_proc.c > > > --- kern_proc.c 27 Sep 2005 18:03:15 -0000 1.231 > > > +++ kern_proc.c 29 Sep 2005 20:50:33 -0000 > > > @@ -882,6 +882,8 @@ > > > } else { > > > _PHOLD(p); > > > FOREACH_THREAD_IN_PROC(p, td) { > > > + if (td->td_ksegrp =3D=3D NULL) > > > + continue; > > > fill_kinfo_thread(td, &kinfo_proc); > > > PROC_UNLOCK(p); > > > error =3D SYSCTL_OUT(req, (caddr_t)&kinfo_proc, > > > > > > I'm going to forward off your e-mail to the threads@ list and see if > > > anyone there wants to talk some more about this. If you don't mind > > > testing the above patch to see if this is a workable work-around, we > may > > > want to think about getting it committed in the mean time. > > > > > > Thanks, > > > > > > Robert N M Watson > > > _______________________________________________ > > > freebsd-hackers@freebsd.org mailing list > > > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > > > To unsubscribe, send any mail to > > > "freebsd-hackers-unsubscribe@freebsd.org " > > > > _______________________________________________ > > freebsd-hackers@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > > To unsubscribe, send any mail to " > freebsd-hackers-unsubscribe@freebsd.org" > > -- > John Baldwin <>< http://www.FreeBSD.org/~jhb/ > "Power Users Use the Power to Serve" =3D http://www.FreeBSD.org >