Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 5 Mar 2020 01:39:06 +0200
From:      Konstantin Belousov <kib@freebsd.org>
To:        Keno Fischer <keno@juliacomputing.com>
Cc:        freebsd-hackers@freebsd.org, Elliot Saba <elliot.saba@juliacomputing.com>
Subject:   Re: FreeBSD Pipe behavior in pipe OOM situations
Message-ID:  <20200304233906.GB98340@kib.kiev.ua>
In-Reply-To: <CABV8kRy2Uu6fZwQR37135LvgUCxYFd6eiNt4NMQLg_jpHq42Lg@mail.gmail.com>
References:  <CABV8kRy2Uu6fZwQR37135LvgUCxYFd6eiNt4NMQLg_jpHq42Lg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Mar 04, 2020 at 04:42:56PM -0500, Keno Fischer wrote:
> Greetings,
> 
> I am debugging intermittent failures we see on the CI system for the Julia
> programming language on FreeBSD, but not elsewhere. The Julia ticket
> for this issue can be found at
> https://github.com/JuliaLang/julia/issues/23143.
> 
> The symptom is an ENOMEM error on a write to a pipe,
> together with the following message in dmesg:
> 
>     kern.ipc.maxpipekva exceeded; see tuning(7)
> 
> Now, as far as I understand it, what's happening here is that FreeBSD has a
> hard limit on the amount of kernel memory that can be used for pipe buffers,
> which we are exceeding by creating too many pipes (not entirely surprising,
> our test suites spawns many processes and uses lots of pipes).
> 
> I understand that we can likely work around this issue by increasing the
> referenced sysctl. However, I am a bit puzzled by the ENOMEM behavior.
> I don't have very much experience with the FreeBSD kernel, but from my
> experience from working on other operating systems,
> I would have expected that either:
> 
> 1) Some minimal buffer is allocated anyway and exempt from such
>     pipe-specific memory limits (e.g. a few bytes of the pipe struct), or,
No.
> 2) The writing process is blocked until pipe buffer space becomes available
>      (e.g. by a different pipe draining and freeing up space), or,
Yes, but only as the space inside the allocated buffer, i.e. the bytes
that are consumed by reader, not as a space that is provided for buffer.

> 3) The writing process is blocked until a reader comes along, at which point
>     the write is performed directly without intermediate kernel buffer.

First, there is a requirement that an atomic write size exists, i.e. writes
less than SC_PIPE_BUF are guaranteed to not interleave if succeeded.  Our
PIPE_BUF is 512 bytes.

We pre-allocate some buffers on the pipe creation, and then might adjust
it at start of the write. The buffers initially consume only kernel
virtual address space (KVA). Physical memory is instantiated when
touched and can be swapped out (this is somewhat simplified, but details
are not important).

The atomicity requirement means that we must not allocate less than
PIPE_BUF, but since we are using VM interfaces, we make the lowest limit
4K (actually page size). When there is enough space, we might go to up
to 64K per pipe, but retract down when pipe KVA is filled.

The KVA used for pipe buffers is shared by all pipes in system among all
users.  Currently allocation of pipe buffers does not wait for space, if
there is no space it fails with ENOMEM.  Waiting for the space means that
the writer is blocked until some unrelated process does some action that
frees pipe buffer, perhaps closes its pipe.

I think that unexplained blocking (it is very hard to track down such
state) is worse then ENOMEM outcome.

> 
> I.e. I would have expected such an OOM situation for pipe buffers to
> degrade pipe performance, but not to have it exposed to the user. Indeed, a
> cursory
> read of the FreeBSD kernel source seems to reinforce this notion.
> In pipe_create, we see the following comment:
> 
> ```
> /*
> * Note that these functions can fail if pipe map is exhausted
> * (as a result of too many pipes created), but we ignore the
> * error as it is not fatal and could be provoked by
> * unprivileged users. The only consequence is worse performance
> * with given pipe.
> */
> if (amountpipekva > maxpipekva / 2)
>     (void)pipespace_new(pipe, SMALL_PIPE_SIZE);
> else
>     (void)pipespace_new(pipe, PIPE_SIZE);
> ```
This happens at pipe open.  As you see, we might preallocate only
SMALL_PIPE_SIZE (4K) if low on KVA, or not preallocate at all if KVA
is exhausted, hoping that at the time of write(2) the situation changes.

> 
> But then later, in pipe_write, we see:
> ```
> if (wpipe->pipe_buffer.size == 0) {
>     /*
>      * This can only happen for reverse direction use of pipes
>      * in a complete OOM situation.
>      */
>      error = ENOMEM;
> ```
> 
> >From my (admittedly limited) understanding of the code, it doesn't
> seem that either comment is accurate. If the pipe buffer allocation
> fails, then `write`s will return `ENOMEM`, even in the forward direction
> (the buffer for the reverse direction isn't allocated by default, but
> as indicated by the first comment, the allocation for the forward
> direction can certainly fail).
Yes, this comment is confusing and if both preallocation at the pipe
creation time, and then allocation at first write both failed, we return
ENOMEM.

We reserve 1/64 of the physical memory for pipekva.  It costs nothing
to increase this number initially for 64bit systems because it is only KVA,
but note that eventually this memory will be instantiated with physical
backing pages.  E.g. on my workstation with 32G RAM I see
	kern.ipc.maxpipekva: 534261760 (512M)
and I do not want to make it larger.

What is the amount of memory on the machine where you see ENOMEM ?

> 
> I was hoping a FreeBSD kernel developer could shed some light on
> whether the kernel behavior we're experiencing here is indeed expected
> on FreeBSD, or whether it would be expected that the kernel would try
> harder to service the pipe request in such a situation.
> 
> Thanks,
> Keno
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20200304233906.GB98340>