From owner-freebsd-sparc64@FreeBSD.ORG Sat Jul 2 00:23:31 2011 Return-Path: Delivered-To: freebsd-sparc64@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2B1DD1065670; Sat, 2 Jul 2011 00:23:31 +0000 (UTC) (envelope-from marius@alchemy.franken.de) Received: from alchemy.franken.de (alchemy.franken.de [194.94.249.214]) by mx1.freebsd.org (Postfix) with ESMTP id 89B9B8FC19; Sat, 2 Jul 2011 00:23:30 +0000 (UTC) Received: from alchemy.franken.de (localhost [127.0.0.1]) by alchemy.franken.de (8.14.4/8.14.4/ALCHEMY.FRANKEN.DE) with ESMTP id p620NPG8056187; Sat, 2 Jul 2011 02:23:25 +0200 (CEST) (envelope-from marius@alchemy.franken.de) Received: (from marius@localhost) by alchemy.franken.de (8.14.4/8.14.4/Submit) id p620NP9Z056186; Sat, 2 Jul 2011 02:23:25 +0200 (CEST) (envelope-from marius) Date: Sat, 2 Jul 2011 02:23:25 +0200 From: Marius Strobl To: Peter Jeremy Message-ID: <20110702002325.GS14797@alchemy.franken.de> References: <20110608224801.GB35494@alchemy.franken.de> <20110613235144.GA12470@server.vk2pj.dyndns.org> <20110615233445.GZ7064@alchemy.franken.de> <20110619220033.GA61397@server.vk2pj.dyndns.org> <20110622100524.GO14797@alchemy.franken.de> <20110629025433.GA48145@server.vk2pj.dyndns.org> <20110629175444.GH14797@alchemy.franken.de> <20110629220010.GA53017@pjdesk.au.alcatel-lucent.com> <20110629223008.GL14797@alchemy.franken.de> <20110630221752.GG65891@pjdesk.au.alcatel-lucent.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110630221752.GG65891@pjdesk.au.alcatel-lucent.com> User-Agent: Mutt/1.4.2.3i Cc: "alc@freebsd.org" , freebsd-sparc64@freebsd.org Subject: Re: 'make -j16 universe' gives SIReset X-BeenThere: freebsd-sparc64@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the Sparc List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jul 2011 00:23:31 -0000 On Fri, Jul 01, 2011 at 08:17:52AM +1000, Peter Jeremy wrote: > [Moving back on-list] > > On 2011-Jun-30 06:30:08 +0800, Marius Strobl wrote: > >On Thu, Jun 30, 2011 at 08:00:10AM +1000, Peter Jeremy wrote: > >> On 2011-Jun-29 19:54:44 +0200, Marius Strobl wrote: > >> >On Wed, Jun 29, 2011 at 12:54:33PM +1000, Peter Jeremy wrote: > >> >> My V890 has been running "make -j32 buildworld" in a loop for a > >> >> week now without problems so I think that was the problem. > >> > >> OTOH, a V440 that has been running similar load for a similar period > >> died overnight with: > >> > >> panic: uma_small_alloc: free page still has mappings! > >> VNASSERT failed > >> cpuid = 3 > >> 0xfffff800079643c0: KDB: enter: panic > ... > >> I'm fairly sure that is the same kernel but will double-check and > >> investigate that panic further. > > FWIW, that kernel didn't have the latest patchset (adding Zeus support). That shouldn't make a difference; the later version only adds the SPARC64 bits as you already noticed and adjusts the boot loader to compile again. I made no changes to the existing parts apart from fixing a comment. Besides I see no connection between fixing the gross user TLB flushing and the below problem so far. > > >Ok, this appears to be an unrelated problem though. Alan, do you > >have an idea what could be causing this? > > I managed to get the same panic (though different traceback) on the > V890 after about an hour of pho@'s stress test with INCARNATIONS=150: > > panic: uma_small_alloc: free page still has mappings! > cpuid = 1 > KDB: enter: panic > [ thread pid 142 tid 100196 ] > Stopped at kdb_enter+0x80: ta %xcc, 1 > db> where > Tracing pid 142 tid 100196 td 0xfffff8a016ace880 > panic() at panic+0x20c > uma_small_alloc() at uma_small_alloc+0xe8 > keg_alloc_slab() at keg_alloc_slab+0xc8 > keg_fetch_slab() at keg_fetch_slab+0x218 > zone_fetch_slab() at zone_fetch_slab+0x44 > uma_zalloc_arg() at uma_zalloc_arg+0x60c > m_getm2() at m_getm2+0x134 > m_uiotombuf() at m_uiotombuf+0x4c > sosend_generic() at sosend_generic+0x420 > sosend() at sosend+0x2c > soo_write() at soo_write+0x3c > dofilewrite() at dofilewrite+0x7c > kern_writev() at kern_writev+0x38 > write() at write+0x4c > syscallenter() at syscallenter+0x270 > syscall() at syscall+0x74 > -- syscall (4, FreeBSD ELF64, write) %o7=0x101db4 -- > userland() at 0x405936c8 > user trace: trap %o7=0x101db4 > pc 0x405936c8, sp 0x7fdffffd8a1 > pc 0x101f44, sp 0x7fdffffd9a1 > pc 0x104604, sp 0x7fdffffda81 > pc 0x1046f0, sp 0x7fdffffdb51 > pc 0x104994, sp 0x7fdffffdc21 > pc 0x104d90, sp 0x7fdffffdd01 > pc 0x101610, sp 0x7fdffffde41 > pc 0x4020cff4, sp 0x7fdffffdf01 > done > db> > > I've got a crashdump on the V440 but discovered that gdb reports > "GDB can't read core files on this machine." so it isn't much use. > Any suggestions on how to debug this? The VM and its interaction with the MD code are beyond me, I hope Alan can chime in here. Reading through the code I see a possible path which could lead to this though; tsb_tte_enter(), which is the only place where TD_PV ever is set and also only in case of managed pages, always calls pmap_cache_enter(), which together with pmap_cache_remove() does the page color handling. In pmap_remove_all() however, pmap_cache_remove() is only called for managed pages, so for unmanaged pages we might miss the removal of the mapping from the the color used. I've no idea though if this actually is relevant, i.e. whether the VM ever calls pmap_remove_all() for unmanaged pages. Tentatively I'd say it doesn't, in which case the only solution I see is to exclude unmanaged pages from the page color handling and caching, which I don't know whether it's safe (besides impacting performance). Unfortunately, with my gear I can't reproduce this. Could you please try the below patch? I've no idea whether it's correct but might give another datapoint. Marius Index: pmap.c =================================================================== --- pmap.c (revision 223705) +++ pmap.c (working copy) @@ -1382,21 +1385,21 @@ pmap_remove_all(vm_page_t m) vm_page_lock_queues(); for (tp = TAILQ_FIRST(&m->md.tte_list); tp != NULL; tp = tpn) { tpn = TAILQ_NEXT(tp, tte_link); - if ((tp->tte_data & TD_PV) == 0) - continue; pm = TTE_GET_PMAP(tp); va = TTE_GET_VA(tp); PMAP_LOCK(pm); if ((tp->tte_data & TD_WIRED) != 0) pm->pm_stats.wired_count--; - if ((tp->tte_data & TD_REF) != 0) - vm_page_flag_set(m, PG_REFERENCED); - if ((tp->tte_data & TD_W) != 0) - vm_page_dirty(m); + if ((tp->tte_data & TD_PV) != 0) { + if ((tp->tte_data & TD_REF) != 0) + vm_page_flag_set(m, PG_REFERENCED); + if ((tp->tte_data & TD_W) != 0) + vm_page_dirty(m); + pm->pm_stats.resident_count--; + } tp->tte_data &= ~TD_V; tlb_page_demap(pm, va); TAILQ_REMOVE(&m->md.tte_list, tp, tte_link); - pm->pm_stats.resident_count--; pmap_cache_remove(m, va); TTE_ZERO(tp); PMAP_UNLOCK(pm);