From owner-freebsd-sparc64@FreeBSD.ORG Tue Oct 18 17:27:24 2011 Return-Path: Delivered-To: freebsd-sparc64@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 147F3106566B for ; Tue, 18 Oct 2011 17:27:24 +0000 (UTC) (envelope-from marius@alchemy.franken.de) Received: from alchemy.franken.de (alchemy.franken.de [194.94.249.214]) by mx1.freebsd.org (Postfix) with ESMTP id 8E2E68FC1F for ; Tue, 18 Oct 2011 17:27:23 +0000 (UTC) Received: from alchemy.franken.de (localhost [127.0.0.1]) by alchemy.franken.de (8.14.4/8.14.4/ALCHEMY.FRANKEN.DE) with ESMTP id p9IHRItu016382; Tue, 18 Oct 2011 19:27:18 +0200 (CEST) (envelope-from marius@alchemy.franken.de) Received: (from marius@localhost) by alchemy.franken.de (8.14.4/8.14.4/Submit) id p9IHRIX8016381; Tue, 18 Oct 2011 19:27:18 +0200 (CEST) (envelope-from marius) Date: Tue, 18 Oct 2011 19:27:18 +0200 From: Marius Strobl To: Peter Jeremy Message-ID: <20111018172718.GT39118@alchemy.franken.de> References: <20110817094541.GJ48988@alchemy.franken.de> <20110830152725.GA28552@alchemy.franken.de> <20110831212458.GA25926@server.vk2pj.dyndns.org> <20110902153206.GR40781@alchemy.franken.de> <20111006120411.GA903@alchemy.franken.de> <20111011030529.GA4093@server.vk2pj.dyndns.org> <20111011205543.GA81376@alchemy.franken.de> <20111013035648.GA54190@server.vk2pj.dyndns.org> <20111013184224.GG39118@alchemy.franken.de> <20111018042646.GA18863@server.vk2pj.dyndns.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20111018042646.GA18863@server.vk2pj.dyndns.org> User-Agent: Mutt/1.4.2.3i Cc: freebsd-sparc64@freebsd.org Subject: Re: 'make -j16 universe' gives SIReset X-BeenThere: freebsd-sparc64@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the Sparc List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Oct 2011 17:27:24 -0000 On Tue, Oct 18, 2011 at 03:26:46PM +1100, Peter Jeremy wrote: > On 2011-Oct-13 20:42:25 +0200, Marius Strobl wrote: > >On Thu, Oct 13, 2011 at 02:56:48PM +1100, Peter Jeremy wrote: > >> Unfortunately, I can't get a crashdump because dumpon(8) doesn't like > >> my Solaris swap partitions: > >> GEOM_PART: Partition 'da0b' not suitable for kernel dumps (wrong type?) > >> GEOM_PART: Partition 'da6b' not suitable for kernel dumps (wrong type?) > >> No suitable dump device was found. > >> > >> I did write a patch for that but took it out during some earlier > >> testing to get back to stock code. It looks like I didn't PR it > >> either so I will do that when I get some time. > > I've resurrected that patch (and will send-pr it later). > > >Hrm, this backtrace seems impossible as vmtotal() explicitly locks > >the object before calling vm_object_clear_flag(). A crash dump of > >this panic really would be interesting. > > I've reproduced the same panic and got a crashdump (2 hours for > the dump and another hour for the savecore): > VNASSERT failed > panic: mutex vm object not owned at /usr/src/sys/vm/vm_object.c:281 > cpuid = 7 > > #10 0x00000000c04ffbf4 in panic (fmt=0xc0a906d0 "mutex %s not owned at %s:%d") at /usr/src/sys/kern/kern_shutdown.c:599 > #11 0x00000000c04eb1b8 in _mtx_assert (m=0xfffff8b29d750ca8, what=0x4, file=0xc0ac6c00 "/usr/src/sys/vm/vm_object.c", line=0x119) at /usr/src/sys/kern/kern_mutex.c:706 > #12 0x00000000c07f4b0c in vm_object_clear_flag (object=0xfffff8b29d750ca8, bits=0x4) at /usr/src/sys/vm/vm_object.c:281 > #13 0x00000000c07f1dac in vmtotal (oidp=0xc0ba9be8, arg1=0x0, arg2=0x30, req=0xef8a54e0) at /usr/src/sys/vm/vm_meter.c:121 > #14 0x00000000c050c13c in sysctl_root (oidp=Variable "oidp" is not available. > ) at /usr/src/sys/kern/kern_sysctl.c:1509 > #15 0x00000000c050c434 in userland_sysctl (td=0x0, name=0xef8a5628, namelen=0x2, old=0x0, oldlenp=Variable "oldlenp" is not available.) at /usr/src/sys/kern/kern_sysctl.c:1619 > #16 0x00000000c050c858 in sys___sysctl (td=0xfffff8a2e3ef48c0, uap=0xef8a5768) at /usr/src/sys/kern/kern_sysctl.c:1545 > #17 0x00000000c086ba00 in syscall (tf=Variable "tf" is not available.) at subr_syscall.c:131 > #18 0x00000000c0098e60 in tl0_intr () > > (kgdb) p *object > $1 = { > mtx = { > lock_object = { > lo_name = 0xc0a9a308 "vm object", > lo_flags = 0x1430000, > lo_data = 0x0, > lo_witness = 0xfff85180 > }, > mtx_lock = 0xfffff8a0112d75e0 > }, > ... > } > (kgdb) p *object->mtx->lock_object->lo_witness > $3 = { > w_name = "standard object", '\0' , > w_index = 0xa3, > w_class = 0xc0b82e88, > w_list = { > stqe_next = 0xfff85100 > }, > w_typelist = { > stqe_next = 0xfff85100 > }, > w_hash_next = 0x0, > w_file = 0xc0ac6388 "/usr/src/sys/vm/vm_meter.c", > w_line = 0x71, > w_refcount = 0x53718, > w_num_ancestors = 0xe, > w_num_descendants = 0xe, > w_ddb_level = 0x0, > w_displayed = 0x1, > w_reversed = 0x0 > } > (kgdb) p vm_object_list_mtx > $4 = { > lock_object = { > lo_name = 0xc0ac6e30 "vm object_list", > lo_flags = 0x1030000, > lo_data = 0x0, > lo_witness = 0xfff81d80 > }, > mtx_lock = 0xfffff8a2e3ef48c2 > } > (kgdb) p *vm_object_list_mtx.lock_object.lo_witness > $6 = { > w_name = "vm object_list", '\0' , > w_index = 0x3b, > w_class = 0xc0b82e88, > w_list = { > stqe_next = 0xfff81d00 > }, > w_typelist = { > stqe_next = 0xfff81d00 > }, > w_hash_next = 0x0, > w_file = 0xc0ac6388 "/usr/src/sys/vm/vm_meter.c", > w_line = 0x6f, > w_refcount = 0x1, > w_num_ancestors = 0xf, > w_num_descendants = 0x0, > w_ddb_level = 0x0, > w_displayed = 0x1, > w_reversed = 0x0 > } > > The witness information looks correct but I notice that vm_object_list_mtx > is owned by a different thread to the vm_object that triggers the panic. > > The panic says it occurred on CPU 7: > (kgdb) p cpuid_to_pcpu[7]->pc_curthread > $21 = (struct thread *) 0xfffff8a2e3ef48c0 > which matches the vm_object_list_mtx. > > My inital thought was a locking glitch but, looking through > cpuid_to_pcpu[], the vm_object's lock doesn't match any running thread: > > (kgdb) p cpuid_to_pcpu[0]->pc_curthread > $14 = (struct thread *) 0xfffff8a2e3008000 > (kgdb) p cpuid_to_pcpu[1]->pc_curthread > $15 = (struct thread *) 0xfffff8a2aae7c8c0 > (kgdb) p cpuid_to_pcpu[2]->pc_curthread > $16 = (struct thread *) 0xfffff8a0112acd20 > (kgdb) p cpuid_to_pcpu[3]->pc_curthread > $17 = (struct thread *) 0xfffff8a0112ac8c0 > (kgdb) p cpuid_to_pcpu[4]->pc_curthread > $18 = (struct thread *) 0xfffff8a2aae7da40 > (kgdb) p cpuid_to_pcpu[5]->pc_curthread > $19 = (struct thread *) 0xfffff8a2aa2a6460 > (kgdb) p cpuid_to_pcpu[6]->pc_curthread > $20 = (struct thread *) 0xfffff8a2e3148d20 > (kgdb) p cpuid_to_pcpu[7]->pc_curthread > $21 = (struct thread *) 0xfffff8a2e3ef48c0 > (kgdb) p cpuid_to_pcpu[8]->pc_curthread > $22 = (struct thread *) 0xfffff8d32cfa0460 > (kgdb) p cpuid_to_pcpu[9]->pc_curthread > $23 = (struct thread *) 0xfffff8a0112b3a40 > (kgdb) p cpuid_to_pcpu[10]->pc_curthread > $24 = (struct thread *) 0xfffff8a2a8f77180 > (kgdb) p cpuid_to_pcpu[11]->pc_curthread > $25 = (struct thread *) 0xfffff8a2e3ef1a40 > (kgdb) p cpuid_to_pcpu[12]->pc_curthread > $26 = (struct thread *) 0xfffff8a2e319e8c0 > (kgdb) p cpuid_to_pcpu[13]->pc_curthread > $27 = (struct thread *) 0xfffff8a2e3c30d20 > (kgdb) p cpuid_to_pcpu[14]->pc_curthread > $28 = (struct thread *) 0xfffff8a0112b2460 > (kgdb) p cpuid_to_pcpu[15]->pc_curthread > $29 = (struct thread *) 0xfffff8c1f78cb180 > > Some rummaging around says that the object is locked by yarrow: > (kgdb) p ((struct thread *) 0xfffff8a0112d75e0)->td_proc.p_comm > $35 = "yarrow", '\0' > > At this stage, I'm not sure where to go next. > Hrm, AFAICT this would mean that the _mtx_obtain_lock(), which boils down to a atomic_cmpset_acq_ptr(), in _mtx_trylock() didn't work as expected, I currently can't think of a good reason why that could happen though. The assembly generated for that code also looks just fine. Have you run the workload which is triggering this before? It would be interesting to know whether it also happens with SCHED_4BSD with current sources, pre-r226054 and pre-r225889 if the machine previously survived that load. Have you enabled PREEMPTION by chance? The other thing that worries me is that it could be a silicon bug, especially since that machine also has that issue of issuing stale vector interrupts along with a state in which it traps even on locked TLB entries, which isn't mentioned in the public erratum ... Marius