Date: Tue, 22 Apr 2003 13:09:51 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: "Kevin A. Pieckiel" <kpieckiel-freebsd-hackers@smartrafficenter.org> Cc: freebsd-hackers@freebsd.org Subject: Re: maxfiles, file table, descriptors, etc... Message-ID: <3EA5A18F.5150DB2@mindspring.com> References: <20030420210118.GA21255@pacer.dmz.smartrafficenter.org> <3EA32ADA.CC05B003@mindspring.com> <20030421113938.GA31530@pacer.dmz.smartrafficenter.org> <3EA43297.44FC6E44@mindspring.com> <20030422165702.GC31530@pacer.dmz.smartrafficenter.org>
next in thread | previous in thread | raw e-mail | index | archive | help
"Kevin A. Pieckiel" wrote: [ ... FreeBSD lazy allocation of KVA space for zalloci() use ... ] > What I fail to see is why this scheme is decidedly "better" than > that of the old memory allocator. I understand from the vm source > that uma wants to avoid allocating pools of unused memory for the > kernel--allocating memory on an as needed basis is a logical thing > to do. But losing the guarantee that the allocation routines will > not fail and not adjusting the calling functions of those routines > seems a bit dumb (since, as you state, the kernel panics). I think > this might be a trouble spot for me because of another question.... Eventually, the calling functions will be adjusted, I think. The reason the new code (Jeff's code) is better is that it doesn't task-commit a limited resource. When you compile a kernel with a specific MAXFILES (or set "kern.maxfiles" in the loader) in 4.x, you eat an unrecoverable chunk of the KVA, which is a scarce resource. This is a problem, if you are a general purpose system, since at any point you can't know what resource is going to be in the highest demand, at any given point in time. Even the 5.x has a fault here, in that, once allocated, the memory is type-stable; luckily, however, most systems maintain homogeneous loads over time, so you aren't going to see radical swings between being a scientific computation platform vs. a web server, vs. a shell machine, etc., without reboots in between, as the machine finds itself repurposed. For a platform allocated a specific task, it's also helpful to have the new code. In general, what happens in a platform that has a specific role that is never going to change is that it is manually tuned to that role. This tuning process is complex and time consuming, and requires both a lot of knowledge of the OS, and a lot of domain specific knowledge. Even then, people tend to make mistakes. By allowing the limits to be raised to the point that they are irrelevent, and then modifying the code to allocate resources, as necessary, it provides for a much simpler tuning experience to get from 30% of performance to 90% of performance, without a lot of work (the last 10% is still really hard, and requires domain specific knowledge). > What is the correct way to address this in the new allocator code? There are several ways of doing this. It's probably a good idea to make the kernel code in question insensitive to NULL returns, as a general rule. This helps the code be more resilient to future changes, and it allows immediate relief from high load situations: instead of hanging until a request can be satisfied, the request is failed, and the pressure on the system is reduced. It's the same theory you get from seeing a lot of cars jammed up in front of you, and turning right, instead of heading into the jam with everyone else, and making things worse. It's also probably a good idea to use this as an indicator of where code needs to be refactored. Most of the problems in 5.x are a result of legacy code that should be refactored, that's being locked down, instead. What happens in this case is that locks get held over function call boundaries, but they do not have to come back up over those boundaries in order to be released; e.g. A() locks X, A() calls B(), B() calls C(), C() unlocks X. Every time you see a "lock order reversal" or "LOR" posting to the list, it's either because someone has been confused about "locking code" vs. "locking data", or it's because there's a layering abstraction violation that makes some lock acquisition and relese non-reflexive, like this. Probably the easiest way of dealing with this problem is to establish page mappings for all of physical memory, up front, and all of KVA, and then modify the mappings and/or "give them away", instead of trying to allocate new ones when you're in a memory pressure situation. One obvious fix for the zalloci() code would be to modify the order in which page mappings are obtained, when new pages are rquired by a given zone, and then add a second administrative limit to the zone structure. Initially set the administrative limit equal to the hard limit on the zone, when the zone is created, and then if you fail to obtain the page mapping, lower the administrative limit to the current limit. The effect of this would be to cause the zalloci() to fail in a way that it's expected to fail: virtually, "because we have hit our administratively agreed limit", rather than "because we ran out of page mappings". > I can come up with an option or two on my own... such as that to > which I've already alluded: memory allocation routines that once > guaranteed success can no longer be used in such a manner, thus the > calling functions must be altered to take this into account. But > this is certainly not trivial! Yes. This is non-trivial, and it should be done anyway. 8-). See above. > > Basically, everywhere that calls zalloci() > > is at risk of panic'ing under heavy load. > > Am I not getting a point here? I can't find any reference to > zalloci() in the kernel source for 5.x (as of a 07 Apr 2003 cvs > update on HEAD), and such circumstances don't apply to 4.x (which, > of course, is where I DID find them after you mentioned them). The calls have been changed; I should say "everywhere zalloci() has been replaced with something which has a NULL-return semantic". > > Correct. The file descriptors are dynamically allocated; or rather, > > they are allocated incrementally, as needed, and since this is not > > at interrupt time, the standard system malloc() can be used. > > A quick tangent.... when file descriptors are assigned and given to > a running program, are they guaranteed to start from zero (or three > if you don't close stdin, stdout, and stderr)? Or is this a byproduct > of implementation across the realm of Unixes? The descriptor number is an index into the per process open file table. This table *always* starts at 0, but may start with some slots filled in already (usually stdin/stdout/stderr, but really, anything it's parent process didn't have marked "close on exec", and which doesn't force those semantics by failing dup2(), is copied). The place to look for this is: struct proc *p; /* sys/proc.h */ struct filedesc *fdescp; /* sys/filedesc.h */ struct file *fp0; /* sys/file.h */ fdescp = p->p_fd; fp0 = fdescp->fd_ofiles[ 0 /* this is my fd */ ]; The place you see these indices translated are in falloc(), fget(), etc., descriptor manipulation, which is located in the kernel source file /usr/src/sys/kern/kern_descrip.c. For an interesting case study, consider an already open file on which you want to call "fstat" from user space. Then look in /usr/src/sys/kern/kern_descrip.c for the definition of the function "fstat", which implements this system call (the struct "fstat_args" is defined in a block comment above the function, for convenience of the reader). It's not commented in detail, but what happens is: o You take a trap for the system call via INT 0x80 o The system call arguments are converted to a linear set, which is cast to a "struct fstat_args *" by the function entry (from a "void *"). o A lock is held to prevent reentrancy o fget() translates the index (descriptor) into a "struct file *"; as a side effect, this obtains a reference, so that if someone else tries to close the file out from under you, you hold it open. o The fo_ ("file operation") stat is called, which copies the stat information into the stack region "ub", which is a "struct stat". o The data in "ub" is copied out into the user process address space, into the buffer whose address argument was supplied to the system call. o fdrop() is called to release the reference; if this was the last reference (unlikely, given the specific lock being held here), then the fdp is released back to the system, and the file is truly closed. o The lock is released. o Any error which occurred is returned in %AX, which gets given back to the system as a -1 return, with errno set to the error. So, although it's abstracted by fget/fdrop, it's really accessing an allocated linear array of "struct file", for which the user space file descriptor is an index into that array. [ ... per process open file table allocation inefficiencies ... ] > Now this _IS_ interesting. I would think circumstances requiring > 100,000+ files or net connections, though not uncommon, are certainly > NOT in the vast majority, but would still have a bone to pick with this > implementation. For example, a web server--from which most users > expect (demand?) fast response time--that takes time to expand its > file table during a connection or request would seem to have > unreasonable response times. Yes. It's one of the things you rewrite when you are trying to get uniform and high performance out of a system. 100,000 net connections is uncommon; until two years ago, no one has really stressed FreeBSD above 32,768 connections, after which a credentials bug would cause a kernel panic when enough sockets had been closed. Even at smaller numbers of open files, though, the allocation causes "lurches" in server behaviour; you can see the dips as inverse spikes from the allocations on "webbench", for example, even for 10,000 and 20,000 connections. > One would think there is a better way. It's all about tradeoffs. One way is to force the table size large, to start with, using dup2(). > How much of an issue is this really? You mean compared to having to disable PSE and PG_G, or some other perforrmance issues? Not much of one. But every little bit hurts. > Excellent info, Terry. Thanks for sharing it! It's not all that great; I'm sure I'll be corrected on some things with regard to 5.x, since it's a moving target, and it's not really possible to state anything authoritatively about it, since it will be changed out from under you to address any easy complaints, so by the time someone goes and looks at it, what you've said is not true any more. 8-) 8-). Basically, I answered because you asked. I do that a lot, even in private email; this got to the list because you Cc:'ed the list, not because I would have put it there if you'd asked in private email. Lots of things never see the list; some people ask things in private because of competitive advantage, or because I've stated a non-disclosure requirement on a small set of topics. 8-). -- Terry
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3EA5A18F.5150DB2>