From owner-freebsd-hackers@FreeBSD.ORG  Fri Apr 18 23:53:16 2003
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 662BF37B401; Fri, 18 Apr 2003 23:53:16 -0700 (PDT)
Received: from bluejay.mail.pas.earthlink.net (bluejay.mail.pas.earthlink.net
	[207.217.120.218])	by mx1.FreeBSD.org (Postfix) with ESMTP
	id B50BF43FA3; Fri, 18 Apr 2003 23:53:15 -0700 (PDT)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0181.cvx22-bradley.dialup.earthlink.net ([209.179.198.181]
	helo=mindspring.com)
	by bluejay.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)	id 196mDo-0004eF-00; Fri, 18 Apr 2003 23:53:13 -0700
Message-ID: <3EA0F1E0.9D0165B4@mindspring.com>
Date: Fri, 18 Apr 2003 23:51:12 -0700
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Don Lewis <truckman@FreeBSD.org>
References: <200304190335.h3J3ZhXB015986@gw.catspoiler.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4bb08ff659f321d3113e933ccf576744193caf27dac41a8fd350badd9bab72f9c350badd9bab72f9c
cc: hackers@FreeBSD.org
Subject: Re: Repeated similar panics on -STABLE
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 19 Apr 2003 06:53:16 -0000

Don Lewis wrote:
> On 18 Apr, Terry Lambert wrote:
> > Since the thread occurred on -hackers, thought I'd post here, too.
> >
> > The thread (see subject) around 02 Apr 2003 demonstrated a panic
> > in kern_malloc.c in malloc() as a result in the increased default
> > KVA space in the 4.x branch.
> >
> > I just posted a patch to -stable.
> >
> > If there is a committer interested in fixing this problem for
> > people who doesn't read -stable, they may want to grab it and
> > commit it.
> 
> The patch looks ok, but I don't understand the failure mechanism.  If
> kbp->kb_next is initially NULL, then kmem_malloc() should get called. It
> doesn't look possible for kmem_malloc() to return NULL in the !M_NOWAIT
> case with the kmem_map.  It looks like kmem_malloc() will either retry
> or panic.

It's hard to see.  But it can happen.  The way it happens is to
take a fault that causes the freep to be NULL in the case that
kbp->kb_last == kbp->kb_next.

We know it happens, because the "va" dereference fails in the
same place on several machines, and it's set to kbp->kb_next
immediately before the failure.

If you hear hoofbeats, look for horses, but once you've eliminated
the possibility of horses, well, zebras are a good guess.  8-).

Basically, if I allocate 1 page, put it on the freelist, then it
gets grabbed by something else meanwhile, then cp <= va is true,
and it breaks with kbp->kb_next = NULL.

This can happen for zalloci() zones, for which there are page
mappings, but no backing pages (mbufs, whatever).  All that's
required is that you run out of physical RAM before you run out
of kmem_map (a distinct possibility with a large KVA, and a
stair function in the scaling -- or even if you just follow the
handbook, and increase the number of mbufs/nmbclusters).

What the patch basically does is forces the caller to sleep
until it's possible to satisfy the request (yeah, this could be
"never", but it's more likely to be a transient condition.

It didn't happen with the smaller KVA, because the page mapping
overhead vs. pages ratio vs. users of zalloci() happened to keep
things happy.  Plus you have to have a pretty big load to see it.


> I don't see how the kbp list could be refreshed and reemptied as you
> suggested in a previous message, because we're protected by splmem()
> except if kmem_malloc() blocks, and that can only happen before we put
> the newly allocated memory on the kbp list.
> 
> Depending on how much you believe the line number reported by gdb, the
> trap is caused by
>         va = kbp->kb_next;
> and not the following line:
>         kbp->kb_next = ((struct freelist *)va)->next;
> which would tend to make me think that kbp was somehow getting stomped
> on.
> 
> Something else that bothers me is
> 
> >> fault virtual address   = 0x5cdd8000
> 
> If the trap is being caused by va being NULL, I'd expect the fault
> adress to be 0x12 because the next member of struct freelist is at
> offset 12, at least for the i386 architecture.

No; the problem is kbp->kb_next is NULL, not kbp itself.  Consider
if "cp -= allocsize" is zero.  No matter what va is, that's going
to drop out of the "for(;;)" loop.


FWIW, -current has a similar case (no patch, sorry) because of
the new allocator, which can return NULL for any request, if
it rus out of map space.  This happens because the zalloci()
*formerly* staked out space for each zone, and merely failed to
provide backing pages.

Now, with the new allocator, the zone pages are demand allocated;
they're type-stable, but they aren't region-stable any more.
Because of this, it's possible that a zone grow can fail, prior
to reaching the administrative limit on the zone, and the allocation
can fail.

There's a heck of a lot of code that depends on zalloci() never
being able to return NULL.

I've fixed two places for different people, so far, and told
people to file bugs against about three others.  Basically, it
comes down to tuning that used to be safe in 5.x, and some load
conditions that used to be safe in 5.x, but aren't any more.

IMO, the new allocator has a number of problems like this, though,
and probably the correct place to fix them is the allocator.  This
can't mean making the allocation hang the caller until it can
proceed, though (this is OK to do in malloc(), because it's OK to
hang in the M_WAITOK case, and code with M_NOWAIT already expects
NULL; the 4.x fix is therefore pretty easy to do in one place).

My personal suggestion on 5.x, if you want to see people quit
complaining about "Trap 12" or switching off to 4.8 and giving
5.x a "pass", would be to provide kernel mappings for all of the
physical RAM in the machine, up front, and then allocate from
that by taking over the mappings.  That way, there's no such
thing as a "kmem_map too small" or a trap 12 as a result of a
zalloci() zone grow failure: if there's no RAM, that doesn't
mean there's no mapping.  It also saves all the mapping later.
I've suggested this before.  8-).


Actually, if you want another patch for 4.x, I can save a good
10%-15% of the performance overhead in zalloci() and zalloc() with
a couple of trivial changes (unnecessary math is expensive,
particularly if it's the same math each time).

It's also possible to save a large amount of RAM in the zalloc()
and zalloci() code; once you know that the RAM is type-stable,
if you are willing to do a little better job of fit-to-page, it's
pretty easy to save an average of 64 bytes per object.  That's
also a patch for 4.x.  This isn't really a high priority, unless
you are trying to squeeze everything you can out of the box, and
are dealing with incredible RAM pressure to do it...

-- Terry