From owner-freebsd-hackers@FreeBSD.ORG Fri Apr 18 23:53:16 2003 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 662BF37B401; Fri, 18 Apr 2003 23:53:16 -0700 (PDT) Received: from bluejay.mail.pas.earthlink.net (bluejay.mail.pas.earthlink.net [207.217.120.218]) by mx1.FreeBSD.org (Postfix) with ESMTP id B50BF43FA3; Fri, 18 Apr 2003 23:53:15 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from pool0181.cvx22-bradley.dialup.earthlink.net ([209.179.198.181] helo=mindspring.com) by bluejay.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 196mDo-0004eF-00; Fri, 18 Apr 2003 23:53:13 -0700 Message-ID: <3EA0F1E0.9D0165B4@mindspring.com> Date: Fri, 18 Apr 2003 23:51:12 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Don Lewis References: <200304190335.h3J3ZhXB015986@gw.catspoiler.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4bb08ff659f321d3113e933ccf576744193caf27dac41a8fd350badd9bab72f9c350badd9bab72f9c cc: hackers@FreeBSD.org Subject: Re: Repeated similar panics on -STABLE X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Apr 2003 06:53:16 -0000 Don Lewis wrote: > On 18 Apr, Terry Lambert wrote: > > Since the thread occurred on -hackers, thought I'd post here, too. > > > > The thread (see subject) around 02 Apr 2003 demonstrated a panic > > in kern_malloc.c in malloc() as a result in the increased default > > KVA space in the 4.x branch. > > > > I just posted a patch to -stable. > > > > If there is a committer interested in fixing this problem for > > people who doesn't read -stable, they may want to grab it and > > commit it. > > The patch looks ok, but I don't understand the failure mechanism. If > kbp->kb_next is initially NULL, then kmem_malloc() should get called. It > doesn't look possible for kmem_malloc() to return NULL in the !M_NOWAIT > case with the kmem_map. It looks like kmem_malloc() will either retry > or panic. It's hard to see. But it can happen. The way it happens is to take a fault that causes the freep to be NULL in the case that kbp->kb_last == kbp->kb_next. We know it happens, because the "va" dereference fails in the same place on several machines, and it's set to kbp->kb_next immediately before the failure. If you hear hoofbeats, look for horses, but once you've eliminated the possibility of horses, well, zebras are a good guess. 8-). Basically, if I allocate 1 page, put it on the freelist, then it gets grabbed by something else meanwhile, then cp <= va is true, and it breaks with kbp->kb_next = NULL. This can happen for zalloci() zones, for which there are page mappings, but no backing pages (mbufs, whatever). All that's required is that you run out of physical RAM before you run out of kmem_map (a distinct possibility with a large KVA, and a stair function in the scaling -- or even if you just follow the handbook, and increase the number of mbufs/nmbclusters). What the patch basically does is forces the caller to sleep until it's possible to satisfy the request (yeah, this could be "never", but it's more likely to be a transient condition. It didn't happen with the smaller KVA, because the page mapping overhead vs. pages ratio vs. users of zalloci() happened to keep things happy. Plus you have to have a pretty big load to see it. > I don't see how the kbp list could be refreshed and reemptied as you > suggested in a previous message, because we're protected by splmem() > except if kmem_malloc() blocks, and that can only happen before we put > the newly allocated memory on the kbp list. > > Depending on how much you believe the line number reported by gdb, the > trap is caused by > va = kbp->kb_next; > and not the following line: > kbp->kb_next = ((struct freelist *)va)->next; > which would tend to make me think that kbp was somehow getting stomped > on. > > Something else that bothers me is > > >> fault virtual address = 0x5cdd8000 > > If the trap is being caused by va being NULL, I'd expect the fault > adress to be 0x12 because the next member of struct freelist is at > offset 12, at least for the i386 architecture. No; the problem is kbp->kb_next is NULL, not kbp itself. Consider if "cp -= allocsize" is zero. No matter what va is, that's going to drop out of the "for(;;)" loop. FWIW, -current has a similar case (no patch, sorry) because of the new allocator, which can return NULL for any request, if it rus out of map space. This happens because the zalloci() *formerly* staked out space for each zone, and merely failed to provide backing pages. Now, with the new allocator, the zone pages are demand allocated; they're type-stable, but they aren't region-stable any more. Because of this, it's possible that a zone grow can fail, prior to reaching the administrative limit on the zone, and the allocation can fail. There's a heck of a lot of code that depends on zalloci() never being able to return NULL. I've fixed two places for different people, so far, and told people to file bugs against about three others. Basically, it comes down to tuning that used to be safe in 5.x, and some load conditions that used to be safe in 5.x, but aren't any more. IMO, the new allocator has a number of problems like this, though, and probably the correct place to fix them is the allocator. This can't mean making the allocation hang the caller until it can proceed, though (this is OK to do in malloc(), because it's OK to hang in the M_WAITOK case, and code with M_NOWAIT already expects NULL; the 4.x fix is therefore pretty easy to do in one place). My personal suggestion on 5.x, if you want to see people quit complaining about "Trap 12" or switching off to 4.8 and giving 5.x a "pass", would be to provide kernel mappings for all of the physical RAM in the machine, up front, and then allocate from that by taking over the mappings. That way, there's no such thing as a "kmem_map too small" or a trap 12 as a result of a zalloci() zone grow failure: if there's no RAM, that doesn't mean there's no mapping. It also saves all the mapping later. I've suggested this before. 8-). Actually, if you want another patch for 4.x, I can save a good 10%-15% of the performance overhead in zalloci() and zalloc() with a couple of trivial changes (unnecessary math is expensive, particularly if it's the same math each time). It's also possible to save a large amount of RAM in the zalloc() and zalloci() code; once you know that the RAM is type-stable, if you are willing to do a little better job of fit-to-page, it's pretty easy to save an average of 64 bytes per object. That's also a patch for 4.x. This isn't really a high priority, unless you are trying to squeeze everything you can out of the box, and are dealing with incredible RAM pressure to do it... -- Terry