Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 26 Mar 2003 14:05:24 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        andykinney@advantagecom.net
Cc:        freebsd-hackers@FreeBSD.ORG
Subject:   Re: shared mem and panics when out of PV Entries
Message-ID:  <3E822424.C0D6E8E9@mindspring.com>
References:  <3E815E80.18738.3AC0E20@localhost>

next in thread | previous in thread | raw e-mail | index | archive | help
Andrew Kinney wrote:
> On 25 Mar 2003, at 19:28, Terry Lambert wrote:
> > Basically, you don't really care about pv_entry_t's, you care
> > about KVA space, and running out of it.
> >
> > In a previous posting, you suggested increasing KVA_PAGES fixed
> > the problem, but caused a pthreads problem.
> 
> Will running out of KVA space indirectly cause PV Entries to hit its
> limit as shown in sysctl vm.zone?

Yes.  The UMA code preallocates only a small set of entries
in a zone, and then allocates more on an as-needed basis.

Previously, the zalloci() code was used (the "i" stands for
"interrupt").  This allocator preestablished page mappings
(bt not necessarily pages) for every object that could be
allocated in the zone.

The zalloci() had the benefit of preallocating mappings, so
you could not run out, but has the disadvantage that you can
now run out unexpectedly, if something else puts pressure on
the number of page mappings, and runs you out first.

The main problem that causes this is that memory allocations
in the zone allocation case are type-stable, which means that
if they are allocated to be of a particular type, they remain
of that type.

By giving a larger "minimum", you are effectively reverting
to the old behaviour of defining a "maximum", by allocating
higher than any possible usage.

Note that you can still run out, if you go over this number,
and that running out of some things is fatal.

Basically, when the new allocation policies went in, the
code that it was applied to was not checked for low end
failure cases, so there are some introduced bugs that are
slowly being beat out of old code that never had to deal
with an allocation failure under normal conditions, ever
before.

The code changes I posted only work around the introduced
bugs in this one case; I expect that if you push your hernia
in with a girdle, it will pop out somewhere else.  But at
least you will be doing valuable work, identifying where the
introduced bugs live.  8-).


> To my knowledge, I've never seen a panic on this system
> directly resulting from running out of KVA space.  They've
> all been traced back to running out of available PV Entries.

But ask yourself "Why did the allocation of new PV Entries
fail this time?".  The answer is that you ran out of page
mappings for the new page you wanted to allocate to contain
the new entries.

As I said, it's an introduced bug, and a side effect of the
change in zone allocation policy implementation.  The patch
I posted lets you work around it by pushing the number of
"PV Entries", specifically above the number you will ever
need, at the expense of you maybe running out of page mappings
somewhere else.

Technically, the code should have been changes to attempt to
prereserve all necessary mappings on a fork(), and, if it
was not possible, then to fail the fork().  Probably this
would require counted lists, so you could lock, know how
many were free, unlock, attempt to allocate blocks until
there were enough more free, relock, verify the count, and
then complete the operation and unlock.


> I'm invariably hitting the panic in pmap_insert_entry() and I only get
> the panic when I run out of available PV Entries.  I've seen nothing
> to indicate that running out of KVA space is causing the panics,
> though I'm still learning the ropes of the BSD memory management
> code and recognize that there are many interactions with different
> portions of the memory management code that could have
> unforeseen results.

You need to look at the traceback, and the function where
the panic is actually called from.  Neither pmap_insert_entry()
nor get_pv_entry() call panic directly.

Once you understand who is calling who, you can understand
why the panic is called, rather than merely returning an
error to the caller.  Basically, it boils down to the caller
being unable to accept an allocation failure.


> Regarding the other thread you mentioned, increasing
> KVA_PAGES was just a way to make it possible to squeeze a
> higher PV Entry limit out of the system because it would allow a
> higher value for PMAP_SHPGPERPROC while still allowing the
> system to boot.  I have not determined if it "fixed the problem"
> because I had to revert to an old kernel when MySQL wigged out
> on boot, apparently due to the threading issue in 4.7 that shows up
> with increased KVA_PAGES.  I never got a chance to increase
> PMAP_SHPGPERPROC after increasing KVA_PAGES because
> MySQL is an important service on this system and I had to get it
> back up and running.

Yes.  I also suggested how to crank up the initial number of
pv_entry_t's in the first place, so that the allocation failure
won't happen in the code path that calls panic() in event of
an allocation failure it's not expecting.  8-).

The MySQL problem is the threads mailbox issue.  You can fix it
the way I suggested, so you can use a larger KVA_PAGES, without
the threading issue showing up.  I just didn't go into detail on
the lines of code to change to do it, but it's conceptually very
easy.

Frankly, I would suggest using FreeBSD pthreads, instead.

> > What you meant to say is that it caused a Linux threads kernel
> > module mailbox location problem for the user space Linux threads
> > library.  In other words, it's because you are using the Linux
> > threads implementation, that you have this problem, not
> > FreeBSD's pthreads.
> 
> I may have misspoken in the previous thread about pthreads having
> a problem when KVA_PAGES was increased.  I was referencing a
> previous thread in which the author stated pthreads had a problem
> when KVA_PAGES was increased and had assumed that the
> author knew what he was talking about.  At any rate, this was
> apparently patched and included into the RELENG_4 tree after 4.7-
> RELEASE.

It's not a FreeBSD problem, is what I was saying here.  I just
chose an ironic way to say it; sorry.  8-).

> Also worth reiterating is that resource usage by Apache is the
> source of the panics.  The version I'm using is 1.3.27, so it doesn't
> even make use of threading, at least not like Apache 2.0.  I would
> just switch to Apache 2.0, but it doesn't support all the modules we
> need yet.  Threads were only an issue with MySQL when
> KVA_PAGES>256, which doesn't appear to be related to the
> panics happening while KVA_PAGES=256.

Resource usage should *NEVER* cause a panic.  *NEVER*.  The
worst it should cause is load shedding (denial of service for
new processes while being overworked by old processes).


> > In any case, the problem you are having is because the uma_zalloc()
> > (UMA) allocator is feeling KVA space pressure.
> >
> > One way to move this pressure somewhere else, rather than dealing with
> > it in an area which results in a panic on you because the code was not
> > properly retrofit for the limitations of UMA, is to decide to
> > preallocate the UMA region used for the "PV ENTRY" zone.
> 
> I haven't read up on that section of the source, but I'll go do so now
> and determine if the changes you suggested would help in this
> case.  I know in some other posts you're a strong advocate for
> mapping all physical RAM into KVA right up front rather than
> messing around with some subset of physical RAM getting
> mapped into KVA.  That approach seems to make sense, at least
> for large memory systems, if I understand all the dynamics of the
> situation correctly.

Likely they will move the problem to some other code path, and
we can do this again, until some point, when instead of a panic,
you will get an Apache log message telling you the fork failed.

That's the ideal situation (IMO): no user space process, no matter
how terrible, should be able to panic the kernel, no matter how
high the load average goes.

At that point, you will be safe from the global denial of service
that you get following the panic, which is (IMO) "good enough".
If you need more after that point, you can throw resources at the
problem ("just add machines").


> > The way to do this is to modify /usr/src/sys/i386/i386/pmap.c
> > at about line 122, where it says:
> >
> >  #define MINPV 2048
> 
> If I read the code correctly in pmap.c, MINPV just guarantees that
> the system will have at least *some* PV Entries available by
> preallocating the KVA (28 bytes each on my system) for those PV
> Entries specified by MINPV. See the section of
> /usr/src/sys/i386/i386/pmap.c labelled "init the pv free list".  I'm not
> certain it makes a lot of sense to preallocate KVA space for
> 11,113,502 PV Entries when we don't appear to be completely
> KVA starved.

It's to ensure an allocation of a PV Entry *DOES NOT FAIL*, even
when you *are* completely starved.  That's the whole point: to
move the KVA space pressure off onto some *other* system that
*also* does not preallocate its resources, and expect that that
failure will turn into a benign "Apache: fork failed" log message,
instead of being a panic.

We want to get rid of the panic.

Once we know what works around something, we can fix it later,
but the important part is to know what the something *is*, first.

Make sense?


> As I understand it (and as you seem to have suggested),
> increasing MINPV would only be useful if we were running out of
> KVA due to other KVA consumers (like buffers, cache, mbuf
> clusters, and etc.) before we could get enough PV Entries on the
> "free" list. I don't believe that is what's happening here.

It is.  There is no other way the allocation can fail.  See
the traceback you are getting.  Compile the kernel with DDB
and BREAK_TO_DEBUGGER, and then when it blows up, get a
backtrace.



> Here's some sysctl's that are pertinent:
> vm.zone_kmem_kvaspace: 350126080
> vm.kvm_size: 1065353216
> vm.kvm_free: 58720256

The kmem space is available memory not merely consumed by the
kernel.  The kvm_size is the size of the KVA space (4G minus
what's reserved for user space).  The kvm_free is the available
kernel virtual memory.

The name of "zone_kmem_kvaspace" is misleading, and doesn't
correspond to "KVA space", which is the address space available
for exclusive use of the kernel virtual address space.

Also, these are not the values they have after the panic, because
you can't run sysctl after a panic.  8-).  If you could, you would
see that kvm_free is zero, and the other two have not changed.


> vm.zone_kmem_kvaspace indicates (if I understand it correctly)
> that kmem_alloc() allocated about 334MB of KVA at boot.
> vm.kvm_free indicates that KVM is only pressured after the system
> has been running awhile.  The sysctl's above were read after
> running for about 90 minutes after a reboot during non-peak usage
> hours.  At that time, there were 199MB allocated to buffers, 49MB
> allocated to cache, and 353MB wired.  During peak usage, we will
> typically have 199MB allocated to buffers, ~150MB allocated to
> cache, and 500MB to 700MB wired.  If I understand things
> correctly, that would mean we're peaking around the 1GB KVM
> mark and there's probably some recycling of memory used by
> cache to free up KVM for other uses when necessary.

It has to do with page mappings.  But you must have one page for
each 4M of memory.  If you are out of pages to map pages, then
it doesn't matter how many are free: you get no more mappings.

Note that there are page mappings for pages in the kernel, and
page mappings for pages in each of the user processes, and that
these page s mapping pages aren't swappable, so you can run out
(and you did!).  When you run out, all additional attempts to
allocate memory will fail, because you can't map it, even if you
have it.  Basically, the KVA_PAGES controls the KVA space size,
which is basically how much total memory you can map into the
kernel at one time.  In your case, this is 1G.


> However, I don't believe we're putting so much pressure on
> KVA/KVM as to run out of 28 byte chunks for PV Entries to be
> made. Assuming, once again, that I understand things correctly, if
> we were putting that much pressure on KVA/KVM, cache would go
> nearer to zero while the system attempted to make room for those
> 28 byte PV Entries.  Even during peak usage and just prior to
> panic, the system still has over 100MB of cache showing.  I have a
> 'systat -vm' from a few seconds prior to one of the panics that
> showed over 200MB of KVM free.

I have to say you are wrong.

Here's why:

o	If you are not running out of KVA space to map pages
	(pages available to map pages); AND

o	If you are not running out of pages for those pages to
	map (allocable kernel memory); AND

o	You are calling get_pv_entry(); AND

o	Therefore the uma_zalloc() is not failing

THEN

	You aren't getting the panic you are seeing.

BUT

	You are seeing a panic.

So: something has to give: either your beliefs, or the evidence
of your eyes.


> So, I don't think the memory allocation in KVA/KVM associated
> with PV Entries is the culprit of our panics. Here's a copy of one of
> the panics and the trace I did on it.
> 
> Fatal trap 12: page fault while in kernel mode

[ ... ]

> (/usr/src/sys/i386/i386/pmap.c:1636).
> 1635            pv = get_pv_entry();
> 1636            pv->pv_va = va;

get_pv_entry() returned 0, meaning it failed, and the code did
not expect a failure; it expected it to block the process until
it succeeded, or until the planet Earth was orbiting inside the
heliopause of a great, big, Red Giant star.


> Now, unless it does it somewhere else, there is no bounds
> checking on pv_entry_count in that function.

Yes.  Exactly.  It wasn't necessary before, because the
entries were preallocated larger than they would ever need
to be, and now they are allocated on the fly, and it's possible
for the allocation to fail.

SO... your workaround, which I gave you, is to *preallocate*
them onto the free list, so that when you go to get one of them,
uma_zalloc() is never called, because *there's always one there
on the free list*.


> So, when the
> pv_entry_count exceeds the limit on PV Entries (pv_entry_max as
> defined in pmap_init2() in pmap.c), it just panics with a "page not
> present" when it goes to process line 1636 because it is
> impossible for a page to be present for a PV Entry with that
> pv_entry_count number being greater than pv_entry_max as
> defined in pmap_init2() in pmap.c.

You could check this by calling panic there.  Or by setting
vm.pmap.pv_entries large enough you continue allocating
forever (I suggest max int).

Or by preallocating it, and not worrying about the maximum,
since it will never be enforced (my own suggestion, which reverts
to historical behaviour).


> I suppose, that if nobody is worried about this issue, then a quick
> and dirty way to handle it would be to add bounds checking to
> pv_entry_count in get_pv_entry() and if pv_entry_count is outside
> the bounds, then produce a panic with a more informative
> message.  At least, with a useful panic, the problem would be
> readily identified on other systems and you guys would have a
> better opportunity to see how many other people run into this issue.

No amount of system load should be able to result in a panic.

Period.


> Now, that's my synopsis of the problem, though I'm still a newb
> with regard to my understanding of the BSD memory management
> system.  Based on the information I've given you, do you still think
> this panic was caused by running out of KVA/KVM?  If I'm wrong,
> I'd love to know it so I can revise my understanding of what is going
> on to cause the panic.

I think it was.  If you don't, then set vm.pmap.pv_entries to
an outrageously large value in the boot loader, and see if the
problem goes away (like you think) or if it persists (like I
think).

I will hazard a guess: it will not go away, because that tunable
didn't just appear out of thin air, and what's happening is the

	return uma_zalloc(pvzone, M_NOWAIT);

is failing, not because of the administrative limit (which was
there in the zalloci() case, too), but because you are out of
memory for it to allocate.

If you care, extern the pvzone, and instrument failure returns
from uma_zalloc() only for that zone.

In other words, go to /usr/src/sys/vm/uma_core.c, and change
uma_zalloc_internal to:

extern uma_zone_t pvzone;	/* from pmap.c */
static void *
uma_zalloc_internal(uma_zone_t zone, void *udata, int flags)
{
        uma_slab_t slab;
        void *item;

        item = NULL;

        /*
         * This is to stop us from allocating per cpu buckets while we're
         * running out of UMA_BOOT_PAGES.  Otherwise, we would exhaust the
         * boot pages.
         */
                
        if (bucketdisable && zone == bucketzone)
                return (NULL);
          
#ifdef UMA_DEBUG_ALLOC
        printf("INTERNAL: Allocating one item from %s(%p)\n", zone->uz_name, zon
e);
#endif
        ZONE_LOCK(zone);
                
        slab = uma_zone_slab(zone, flags);
        if (slab == NULL) {
                ZONE_UNLOCK(zone);
		if (zone == pvzone)
			panic("pvzone: uma_zone_slab failed");
                return (NULL);
        }

        item = uma_slab_alloc(zone, slab);

        ZONE_UNLOCK(zone);

        if (zone->uz_ctor != NULL)
                zone->uz_ctor(item, zone->uz_size, udata);
        if (flags & M_ZERO)
                bzero(item, zone->uz_size);

        return (item);
}

and uma_slab_alloc to:

static __inline void *
uma_slab_alloc(uma_zone_t zone, uma_slab_t slab)
{
        void *item;
        u_int8_t freei;

        freei = slab->us_firstfree;
        slab->us_firstfree = slab->us_freelist[freei];
        item = slab->us_data + (zone->uz_rsize * freei);

        slab->us_freecount--;
        zone->uz_free--;
#ifdef INVARIANTS
        uma_dbg_alloc(zone, slab, item);
#endif
        /* Move this slab to the full list */
        if (slab->us_freecount == 0) {
                LIST_REMOVE(slab, us_link);
                LIST_INSERT_HEAD(&zone->uz_full_slab, slab, us_link);
		if (item == NULL && zone == pvzone)
			panic( "pvzone: full");
        }

	if (item == NULL && zone == pvzone)
		panic( "pvzone: out of memory");
            
        return (item);
} 



> For now, I've solved the problem by limiting the number of Apache
> processes that are allowed to run based on my calculations of how
> many PV Entries are required by each child process, but it's
> painful to have all that RAM and not be able to put it to use
> because of an issue in the memory management code that shows
> up on large memory systems (>2GB).  IMHO, Apache shouldn't be
> able crash an OS before it ever starts using swap.

Agreed.  You need to know *why* the panic happens.  See the
above.  I think you can disable the "pvzone: full" panic by
cranking up the limits in the loader, and you will still panic.


> The only reason the problem doesn't show on systems with the
> typical amounts of RAM (2GB or less) is that if those systems ran
> Apache like we do, they'd spiral to a crash as swap usage
> increased and eventually swap was completely filled.

I doubt it.  You aren't swapping at all, and you have gigs and
gigs of swap.  If you would crash at 4G, you would crash at 4G,
even if half of it was swap.

-- Terry



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E822424.C0D6E8E9>