Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 26 Mar 2003 08:02:08 -0800
From:      "Andrew Kinney" <andykinney@advantagecom.net>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        freebsd-hackers@FreeBSD.ORG
Subject:   Re: shared mem and panics when out of PV Entries
Message-ID:  <3E815E80.18738.3AC0E20@localhost>
In-Reply-To: <3E811E52.198972EB@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 25 Mar 2003, at 19:28, Terry Lambert wrote:

> Basically, you don't really care about pv_entry_t's, you care
> about KVA space, and running out of it.
> 
> In a previous posting, you suggested increasing KVA_PAGES fixed
> the problem, but caused a pthreads problem.

Will running out of KVA space indirectly cause PV Entries to hit its 
limit as shown in sysctl vm.zone?  To my knowledge, I've never 
seen a panic on this system directly resulting from running out of 
KVA space.  They've all been traced back to running out of 
available PV Entries.  

I'm invariably hitting the panic in pmap_insert_entry() and I only get 
the panic when I run out of available PV Entries.  I've seen nothing 
to indicate that running out of KVA space is causing the panics, 
though I'm still learning the ropes of the BSD memory management 
code and recognize that there are many interactions with different 
portions of the memory management code that could have 
unforeseen results.

Regarding the other thread you mentioned, increasing 
KVA_PAGES was just a way to make it possible to squeeze a 
higher PV Entry limit out of the system because it would allow a 
higher value for PMAP_SHPGPERPROC while still allowing the 
system to boot.  I have not determined if it "fixed the problem" 
because I had to revert to an old kernel when MySQL wigged out 
on boot, apparently due to the threading issue in 4.7 that shows up 
with increased KVA_PAGES.  I never got a chance to increase 
PMAP_SHPGPERPROC after increasing KVA_PAGES because 
MySQL is an important service on this system and I had to get it 
back up and running.


> What you meant to say is that it caused a Linux threads kernel
> module mailbox location problem for the user space Linux threads
> library.  In other words, it's because you are using the Linux
> threads implementation, that you have this problem, not  
> FreeBSD's pthreads.

I may have misspoken in the previous thread about pthreads having 
a problem when KVA_PAGES was increased.  I was referencing a 
previous thread in which the author stated pthreads had a problem 
when KVA_PAGES was increased and had assumed that the 
author knew what he was talking about.  At any rate, this was 
apparently patched and included into the RELENG_4 tree after 4.7-
RELEASE.  I plan on grabbing RELENG_4_8 once it's officially 
released.  That should give me room to play with KVA_PAGES, if 
necessary, without breaking MySQL.

Also worth reiterating is that resource usage by Apache is the 
source of the panics.  The version I'm using is 1.3.27, so it doesn't 
even make use of threading, at least not like Apache 2.0.  I would 
just switch to Apache 2.0, but it doesn't support all the modules we 
need yet.  Threads were only an issue with MySQL when 
KVA_PAGES>256, which doesn't appear to be related to the 
panics happening while KVA_PAGES=256.


> In any case, the problem you are having is because the uma_zalloc()
> (UMA) allocator is feeling KVA space pressure.
> 
> One way to move this pressure somewhere else, rather than dealing with
> it in an area which results in a panic on you because the code was not
> properly retrofit for the limitations of UMA, is to decide to
> preallocate the UMA region used for the "PV ENTRY" zone.

I haven't read up on that section of the source, but I'll go do so now 
and determine if the changes you suggested would help in this 
case.  I know in some other posts you're a strong advocate for 
mapping all physical RAM into KVA right up front rather than 
messing around with some subset of physical RAM getting 
mapped into KVA.  That approach seems to make sense, at least 
for large memory systems, if I understand all the dynamics of the 
situation correctly.

> The way to do this is to modify /usr/src/sys/i386/i386/pmap.c
> at about line 122, where it says:
> 
>  #define MINPV 2048
> 

If I read the code correctly in pmap.c, MINPV just guarantees that 
the system will have at least *some* PV Entries available by 
preallocating the KVA (28 bytes each on my system) for those PV 
Entries specified by MINPV. See the section of 
/usr/src/sys/i386/i386/pmap.c labelled "init the pv free list".  I'm not 
certain it makes a lot of sense to preallocate KVA space for 
11,113,502 PV Entries when we don't appear to be completely 
KVA starved.

As I understand it (and as you seem to have suggested), 
increasing MINPV would only be useful if we were running out of 
KVA due to other KVA consumers (like buffers, cache, mbuf 
clusters, and etc.) before we could get enough PV Entries on the 
"free" list. I don't believe that is what's happening here.

Here's some sysctl's that are pertinent:
vm.zone_kmem_kvaspace: 350126080
vm.kvm_size: 1065353216
vm.kvm_free: 58720256

vm.zone_kmem_kvaspace indicates (if I understand it correctly) 
that kmem_alloc() allocated about 334MB of KVA at boot.  
vm.kvm_free indicates that KVM is only pressured after the system 
has been running awhile.  The sysctl's above were read after 
running for about 90 minutes after a reboot during non-peak usage 
hours.  At that time, there were 199MB allocated to buffers, 49MB 
allocated to cache, and 353MB wired.  During peak usage, we will 
typically have 199MB allocated to buffers, ~150MB allocated to 
cache, and 500MB to 700MB wired.  If I understand things 
correctly, that would mean we're peaking around the 1GB KVM 
mark and there's probably some recycling of memory used by 
cache to free up KVM for other uses when necessary.

However, I don't believe we're putting so much pressure on 
KVA/KVM as to run out of 28 byte chunks for PV Entries to be 
made. Assuming, once again, that I understand things correctly, if 
we were putting that much pressure on KVA/KVM, cache would go 
nearer to zero while the system attempted to make room for those 
28 byte PV Entries.  Even during peak usage and just prior to 
panic, the system still has over 100MB of cache showing.  I have a 
'systat -vm' from a few seconds prior to one of the panics that 
showed over 200MB of KVM free.

So, I don't think the memory allocation in KVA/KVM associated 
with PV Entries is the culprit of our panics. Here's a copy of one of 
the panics and the trace I did on it.

Fatal trap 12: page fault while in kernel mode
mp_lock = 01000002; cpuid = 1; lapic.id = 00000000
fault virtual address   = 0x4
fault code              = supervisor write, page not present
instruction pointer     = 0x8:0xc02292bd
stack pointer           = 0x10:0xed008e0c
frame pointer           = 0x10:0xed008e1c
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 61903 (httpd)
interrupt mask          = net tty bio cam  <- SMP: XXX
trap number             = 12
panic: page fault
mp_lock = 01000002; cpuid = 1; lapic.id = 00000000
boot() called on cpu#1


Instruction pointer trace:
# nm -n /kernel | grep c02292bd
# nm -n /kernel | grep c02292b
# nm -n /kernel | grep c02292
c022929c t pmap_insert_entry

exact line number of instruction:
----------------------------------
(kgdb) l *pmap_insert_entry+0x21
0xc02292bd is in pmap_insert_entry 
(/usr/src/sys/i386/i386/pmap.c:1636).
1631            int s;
1632            pv_entry_t pv;
1633
1634            s = splvm();
1635            pv = get_pv_entry();
1636            pv->pv_va = va;
1637            pv->pv_pmap = pmap;
1638            pv->pv_ptem = mpte;
1639
1640            TAILQ_INSERT_TAIL(&pmap->pm_pvlist, pv, pv_plist);

The instruction pointer is always the same on these panics and is 
almost invariably in a httpd process during the panic.

My interpretation is that it is actually failing on line 1635 of pmap.c 
in get_pv_entry().

Here's the code for get_pv_entry():

get_pv_entry(void)
{
        pv_entry_count++;
        if (pv_entry_high_water &&
                (pv_entry_count > pv_entry_high_water) &&
                (pmap_pagedaemon_waken == 0)) {
                pmap_pagedaemon_waken = 1;
                wakeup (&vm_pages_needed);
        }
        return zalloci(pvzone);
}

Now, unless it does it somewhere else, there is no bounds 
checking on pv_entry_count in that function.  So, when the 
pv_entry_count exceeds the limit on PV Entries (pv_entry_max as 
defined in pmap_init2() in pmap.c), it just panics with a "page not 
present" when it goes to process line 1636 because it is 
impossible for a page to be present for a PV Entry with that 
pv_entry_count number being greater than pv_entry_max as 
defined in pmap_init2() in pmap.c.  

I suppose, that if nobody is worried about this issue, then a quick 
and dirty way to handle it would be to add bounds checking to 
pv_entry_count in get_pv_entry() and if pv_entry_count is outside 
the bounds, then produce a panic with a more informative 
message.  At least, with a useful panic, the problem would be 
readily identified on other systems and you guys would have a 
better opportunity to see how many other people run into this issue.

Now, that's my synopsis of the problem, though I'm still a newb 
with regard to my understanding of the BSD memory management 
system.  Based on the information I've given you, do you still think 
this panic was caused by running out of KVA/KVM?  If I'm wrong, 
I'd love to know it so I can revise my understanding of what is going 
on to cause the panic.

For now, I've solved the problem by limiting the number of Apache 
processes that are allowed to run based on my calculations of how 
many PV Entries are required by each child process, but it's 
painful to have all that RAM and not be able to put it to use 
because of an issue in the memory management code that shows 
up on large memory systems (>2GB).  IMHO, Apache shouldn't be 
able crash an OS before it ever starts using swap.

The only reason the problem doesn't show on systems with the 
typical amounts of RAM (2GB or less) is that if those systems ran 
Apache like we do, they'd spiral to a crash as swap usage 
increased and eventually swap was completely filled.

Sincerely,
Andrew Kinney
President and
Chief Technology Officer
Advantagecom Networks, Inc.
http://www.advantagecom.net



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E815E80.18738.3AC0E20>