Date: Tue, 27 Jan 2004 11:55:56 -0600 (CST) From: Mike Silbersack <silby@silby.com> To: Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?= <des@des.no> Cc: current@freebsd.org Subject: Re: 'kern.maxpipekva exceeded' messages... Message-ID: <20040127110040.O4636@odysseus.silby.com> In-Reply-To: <xzpr7xl8i1a.fsf@dwp.des.no> References: <20040119233546.S39477@odysseus.silby.com> <xzpznc9tzgz.fsf@dwp.des.no><xzpr7xl8i1a.fsf@dwp.des.no>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 27 Jan 2004, Dag-Erling [iso-8859-1] Sm=F8rgrav wrote: > My problem is not idle pipes; my problem is that the following system > > # sysctl kern.ipc.maxpipekva > kern.ipc.maxpipekva: 8704000 > # sysctl kern.ipc.pipekva > kern.ipc.pipekva: 393216 > > runs out of pipe kva every monday morning when it tries to pipe a > level 0 dump through ssh. > > Is there some way to impose a limit on the memory consumed by a single > pipe? I don't care if dump blocks waiting for ssh to push out the > data, but I do care about the system crashing shortly after running > out of pipe kva. There is a limit, no single pipe can grow beyond BIG_PIPE_SIZE, which is presently defined as 64K. Well, unless there is a leak, of course. :) Is it really crashing? That's not supposed to happen. :( It occurs to me that the maxpipekva exceeded printf may be misleading, and should be moved to pipe_create so that they are not triggered when pipespace is called to resize a pipe buffer from pipe_write... it's possible that with 4K, 16K, and 64K pipes all sharing the same address space, we're getting fragmentation which is causing some large allocations to fail prematurely. > Another problem I have is with a system that runs out of pipe kva when > I create a large number of jails. I really need a way to find out > where all that memory goes... "fstat | grep pipe" should tell you all that you need to know; each pipe is presently created with buffers of 16K in each direction (until you reach half usage, when the size is dropped to 4K.) So, in general, "fstat | grep pipe | wc -l" * 16384 should add up to kern.ipc.pipekva. Pipes which have grown to 64K in size will break this assumption slightly. Also note that the property above is an accident; fstat shows both sides of a pipe, so we're really doubly counting each pipe. However, each pipe is bidirectional (and few programs take advantage of that), so fstat's doubling accounts for the fact that we're not taking into account the unused direction's buffer. :) > > If you're interested in working on this right now, I can send you what = I > > had planned to do for #1, it would be a very small amount of code, > > although it would require a bit of testing to ensure that it does not > > degrade the performance of pipes by a noticeable amount. > > That would be nice. I have several systems I can test it on. > > DES Ok, what it comes down to is that we account for the space allocated, rather than the space actually used; trying to account for the space actually used would turn this into a much more complex beast, and I tried to avoid that. So, in order to save memory, you'll need to change how much memory we allocate, and dynamically size the buffer upward as needed. The first part of this would be to _not_ allocate a buffer for the reverse direction of the pipe; to do this, you'll need to add an extra argument to pipe_create which tells it whether to call pipespace or not, and then you'll have to add code to pipe_write which will allocate space at that time if one ever uses the reverse direction of the pipe. This will net you a 2x memory savings right away with 0 cost. Secondly, you could change pipe_create so that pipespace is always told to allocate SMALL_PIPE_SIZE pipes. Then, go into pipe_write and find the section of code under: /* * If it is advantageous to resize the pipe buffer, do * so. */ And rewrite the loop to something more like int tempsize =3D wpipe->pipe_buffer.size; =09while ((uio->uio_resid > wpipe->pipe_buffer.size) && =09=09(tempsize < BIG_PIPE_SIZE) && =09=09(amountpipekva < maxpipekva / 2)) { =09=09tempsize *=3D 2; =09} =09if ((tempsize > wpipe->pipe_buffer.size) && (wpipe->pipe_state & PIPE_DIRECTW) =3D=3D 0 && (wpipe->pipe_buffer.size <=3D PIPE_SIZE) && (wpipe->pipe_buffer.cnt =3D=3D 0)) { if ((error =3D pipelock(wpipe, 1)) =3D=3D 0) { PIPE_UNLOCK(wpipe); pipespace(wpipe, tempsize); PIPE_LOCK(wpipe); pipeunlock(wpipe); } } Note that I took out the amountbigpipe count; if you rewrite everything to grow dynamically, the bigpipecount can probably be thrown out. In fact, you could probably increase BIG_PIPE_SIZE to 128K if that would improve performance for some application. On the other hand, maybe 32K is a better limit... you'd have to do some testing to see how dynamic resizing would affect the operation, which is why I didn't look into this much. As far as the implementation of this change goes, it should be extremely safe; pipespace has been resizing pipes upwards in size for years, this should be no different. Memory savings here: Well, PIPE_SIZE is 16K, SMALL_PIPE_SIZE is PAGE_SIZE (4K on i386), and BIG_PIPE_SIZE is 64K. So if you have all idle pipes, this would save you 4x memory (up to the point where you reach half usage, where *everything* is allocated as 4K), and it could also save you memory if only 32K buffers are needed and we've been allocating 64K for some app. Now, there are a few implementation issues that may affect the performance you see as a result of the preceeding changes: 1. pipespace can't resize if there is currently any data in the pipe. I believe that copying over the old data during a resize should be doable, but I haven't attempted it. Not allowing resizes may penalize some application which writes an initial small piece of data, followed by larger blocks which would warrant a resize. 1a. If you could resize with data currently in the buffer, then you could also resize _down_, allowing pipes to shrink in size when memory is short. This could be useful as well. 2. Direct writes cannot be followed by non-direct writes until the buffer has been emptied. It seems possible that applications which do a large write and then a small write may be unnecessarily blocked. However, changing this behavior would require a large rewrite, and I would not recommend it unless you can generate statistics which prove that this is an issue. 3. Alc has mentioned that direct writes could be optimized a bit more by not making the direct mapping until the read is performed. However, as most pipes never get into direct mode, this is mostly inconsequential. Overall, I think that implementing the first two changes from earlier in the message and #1 above should not take much time at all, and would provide a substantial memory savings. Mike "Silby" Silbersack
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040127110040.O4636>