From owner-freebsd-sparc64@FreeBSD.ORG Sat Jul 2 19:22:39 2011 Return-Path: Delivered-To: freebsd-sparc64@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B8C84106566C for ; Sat, 2 Jul 2011 19:22:39 +0000 (UTC) (envelope-from alc@rice.edu) Received: from mh3.mail.rice.edu (mh3.mail.rice.edu [128.42.199.10]) by mx1.freebsd.org (Postfix) with ESMTP id 7E0FE8FC0C for ; Sat, 2 Jul 2011 19:22:39 +0000 (UTC) Received: from mh3.mail.rice.edu (localhost.localdomain [127.0.0.1]) by mh3.mail.rice.edu (Postfix) with ESMTP id 88C9328FA30; Sat, 2 Jul 2011 14:03:43 -0500 (CDT) X-Virus-Scanned: by amavis-2.6.4 at mh3.mail.rice.edu, auth channel Received: from mh3.mail.rice.edu ([127.0.0.1]) by mh3.mail.rice.edu (mh3.mail.rice.edu [127.0.0.1]) (amavis, port 10026) with ESMTP id OxoCC52Wc-jK; Sat, 2 Jul 2011 14:03:43 -0500 (CDT) Received: from adsl-216-63-78-18.dsl.hstntx.swbell.net (adsl-216-63-78-18.dsl.hstntx.swbell.net [216.63.78.18]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) (Authenticated sender: alc) by mh3.mail.rice.edu (Postfix) with ESMTPSA id DC4E328F99A; Sat, 2 Jul 2011 14:03:42 -0500 (CDT) Message-ID: <4E0F6B8D.8000500@rice.edu> Date: Sat, 02 Jul 2011 14:03:41 -0500 From: Alan Cox User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.9.2.15) Gecko/20110328 Thunderbird/3.1.9 MIME-Version: 1.0 To: Marius Strobl References: <20110608224801.GB35494@alchemy.franken.de> <20110613235144.GA12470@server.vk2pj.dyndns.org> <20110615233445.GZ7064@alchemy.franken.de> <20110619220033.GA61397@server.vk2pj.dyndns.org> <20110622100524.GO14797@alchemy.franken.de> <20110629025433.GA48145@server.vk2pj.dyndns.org> <20110629175444.GH14797@alchemy.franken.de> <20110629220010.GA53017@pjdesk.au.alcatel-lucent.com> <20110629223008.GL14797@alchemy.franken.de> <20110630221752.GG65891@pjdesk.au.alcatel-lucent.com> <20110702002325.GS14797@alchemy.franken.de> In-Reply-To: <20110702002325.GS14797@alchemy.franken.de> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Peter Jeremy , "alc@freebsd.org" , freebsd-sparc64@freebsd.org Subject: Re: 'make -j16 universe' gives SIReset X-BeenThere: freebsd-sparc64@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the Sparc List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jul 2011 19:22:39 -0000 On 07/01/2011 19:23, Marius Strobl wrote: > On Fri, Jul 01, 2011 at 08:17:52AM +1000, Peter Jeremy wrote: >> [Moving back on-list] >> >> On 2011-Jun-30 06:30:08 +0800, Marius Strobl wrote: >>> On Thu, Jun 30, 2011 at 08:00:10AM +1000, Peter Jeremy wrote: >>>> On 2011-Jun-29 19:54:44 +0200, Marius Strobl wrote: >>>>> On Wed, Jun 29, 2011 at 12:54:33PM +1000, Peter Jeremy wrote: >>>>>> My V890 has been running "make -j32 buildworld" in a loop for a >>>>>> week now without problems so I think that was the problem. >>>> OTOH, a V440 that has been running similar load for a similar period >>>> died overnight with: >>>> >>>> panic: uma_small_alloc: free page still has mappings! >>>> VNASSERT failed >>>> cpuid = 3 >>>> 0xfffff800079643c0: KDB: enter: panic >> ... >>>> I'm fairly sure that is the same kernel but will double-check and >>>> investigate that panic further. >> FWIW, that kernel didn't have the latest patchset (adding Zeus support). > That shouldn't make a difference; the later version only adds the > SPARC64 bits as you already noticed and adjusts the boot loader to > compile again. I made no changes to the existing parts apart from > fixing a comment. Besides I see no connection between fixing the > gross user TLB flushing and the below problem so far. > >>> Ok, this appears to be an unrelated problem though. Alan, do you >>> have an idea what could be causing this? >> I managed to get the same panic (though different traceback) on the >> V890 after about an hour of pho@'s stress test with INCARNATIONS=150: >> >> panic: uma_small_alloc: free page still has mappings! >> cpuid = 1 >> KDB: enter: panic >> [ thread pid 142 tid 100196 ] >> Stopped at kdb_enter+0x80: ta %xcc, 1 >> db> where >> Tracing pid 142 tid 100196 td 0xfffff8a016ace880 >> panic() at panic+0x20c >> uma_small_alloc() at uma_small_alloc+0xe8 >> keg_alloc_slab() at keg_alloc_slab+0xc8 >> keg_fetch_slab() at keg_fetch_slab+0x218 >> zone_fetch_slab() at zone_fetch_slab+0x44 >> uma_zalloc_arg() at uma_zalloc_arg+0x60c >> m_getm2() at m_getm2+0x134 >> m_uiotombuf() at m_uiotombuf+0x4c >> sosend_generic() at sosend_generic+0x420 >> sosend() at sosend+0x2c >> soo_write() at soo_write+0x3c >> dofilewrite() at dofilewrite+0x7c >> kern_writev() at kern_writev+0x38 >> write() at write+0x4c >> syscallenter() at syscallenter+0x270 >> syscall() at syscall+0x74 >> -- syscall (4, FreeBSD ELF64, write) %o7=0x101db4 -- >> userland() at 0x405936c8 >> user trace: trap %o7=0x101db4 >> pc 0x405936c8, sp 0x7fdffffd8a1 >> pc 0x101f44, sp 0x7fdffffd9a1 >> pc 0x104604, sp 0x7fdffffda81 >> pc 0x1046f0, sp 0x7fdffffdb51 >> pc 0x104994, sp 0x7fdffffdc21 >> pc 0x104d90, sp 0x7fdffffdd01 >> pc 0x101610, sp 0x7fdffffde41 >> pc 0x4020cff4, sp 0x7fdffffdf01 >> done >> db> >> >> I've got a crashdump on the V440 but discovered that gdb reports >> "GDB can't read core files on this machine." so it isn't much use. >> Any suggestions on how to debug this? > The VM and its interaction with the MD code are beyond me, I hope > Alan can chime in here. Reading through the code I see a possible > path which could lead to this though; tsb_tte_enter(), which is > the only place where TD_PV ever is set and also only in case of > managed pages, always calls pmap_cache_enter(), which together > with pmap_cache_remove() does the page color handling. In > pmap_remove_all() however, pmap_cache_remove() is only called for > managed pages, so for unmanaged pages we might miss the removal > of the mapping from the the color used. I've no idea though if > this actually is relevant, i.e. whether the VM ever calls > pmap_remove_all() for unmanaged pages. In HEAD, it does not. Other architectures have an assertion forbidding pmap_remove_all() calls on unmanaged pages. (Btw, I'm happy to add this assertion to sparc64's pmap if you like.) In older versions, calling pmap_remove_all() on unmanaged pages is expected to be a harmless NOP that's just a waste of cycles. With unmanaged pages, it is expected that pmap_remove() is used to destroy mappings before the page is freed. For years, vm_page_free{,_toq}() has asserted that the page has no managed mappings: if ((m->flags & PG_UNMANAGED) == 0) { vm_page_lock_assert(m, MA_OWNED); KASSERT(!pmap_page_is_mapped(m), ("vm_page_free_toq: freeing mapped page %p", m)); } As a debugging aid, you might want to add an additional check here on colors. > ... Tentatively I'd say it > doesn't, in which case the only solution I see is to exclude > unmanaged pages from the page color handling and caching, which > I don't know whether it's safe (besides impacting performance). > Unfortunately, with my gear I can't reproduce this. Could you > please try the below patch? I've no idea whether it's correct > but might give another datapoint. > > Marius > > Index: pmap.c > =================================================================== > --- pmap.c (revision 223705) > +++ pmap.c (working copy) > @@ -1382,21 +1385,21 @@ pmap_remove_all(vm_page_t m) > vm_page_lock_queues(); > for (tp = TAILQ_FIRST(&m->md.tte_list); tp != NULL; tp = tpn) { > tpn = TAILQ_NEXT(tp, tte_link); > - if ((tp->tte_data& TD_PV) == 0) > - continue; > pm = TTE_GET_PMAP(tp); > va = TTE_GET_VA(tp); > PMAP_LOCK(pm); > if ((tp->tte_data& TD_WIRED) != 0) > pm->pm_stats.wired_count--; > - if ((tp->tte_data& TD_REF) != 0) > - vm_page_flag_set(m, PG_REFERENCED); > - if ((tp->tte_data& TD_W) != 0) > - vm_page_dirty(m); > + if ((tp->tte_data& TD_PV) != 0) { > + if ((tp->tte_data& TD_REF) != 0) > + vm_page_flag_set(m, PG_REFERENCED); > + if ((tp->tte_data& TD_W) != 0) > + vm_page_dirty(m); > + pm->pm_stats.resident_count--; > + } > tp->tte_data&= ~TD_V; > tlb_page_demap(pm, va); > TAILQ_REMOVE(&m->md.tte_list, tp, tte_link); > - pm->pm_stats.resident_count--; > pmap_cache_remove(m, va); > TTE_ZERO(tp); > PMAP_UNLOCK(pm); >