From owner-freebsd-current@FreeBSD.ORG Thu Sep 21 13:59:10 2006 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E3DDD16A403; Thu, 21 Sep 2006 13:59:10 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 676B543D46; Thu, 21 Sep 2006 13:59:09 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id F31DC46CAE; Thu, 21 Sep 2006 09:59:07 -0400 (EDT) Date: Thu, 21 Sep 2006 14:59:07 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Andre Oppermann In-Reply-To: <4512850A.5000107@freebsd.org> Message-ID: <20060921145002.K37863@fledge.watson.org> References: <4511B9B1.2000903@freebsd.org> <20060921114431.GF27667@FreeBSD.org> <4512850A.5000107@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: gallatin@cs.duke.edu, freebsd-current@freebsd.org, alc@freebsd.org, freebsd-net@freebsd.org, Gleb Smirnoff , tegge@freebsd.org Subject: Re: Much improved sendfile(2) kernel implementation X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Sep 2006 13:59:11 -0000 On Thu, 21 Sep 2006, Andre Oppermann wrote: >> There should be unconditional M_NOWAIT. Oops, the M_DONTWAIT in the current >> code is incorrect. It is present since rev. 1.171. If the m_uiotombuf() >> fails the current code returns from syscall without error! Before rev. >> 1.171, there wasn't m_uiotombuf(), the mbuf header was allocated below, >> with correct wait argument. >> >> The wait argument for m_uiotombuf() should be changed to M_WAITOK, but in a >> separate commit. >> This one should be M_WAITOK always. It is M_TRYWAIT (equal to M_WAITOK) in >> the current code. > > The reason why I changed the mbuf allocations with SS_NBIO is the rationale > of sendfile() and the performance evaluation that was done by alc@ students. > sendfile() has two flags which control its blocking behavior. Non blocking > socket (SS_NBIO) and SF_NODISKIO. The latter is necessary because file > reads or writes are normally not considered to be blocking. The most > optimal sendfile() is usage is with a single process doing accept(), parsing > and then sendfile that should never ever block on anything. This way the > main process then can use kqueue for all the socket stuff and it can > transfer all sends that require disk I/O to a child process or thread to > provide a context for the read. Meanwhile the main process is free to > accept further connections and to continue serving existing connections. > Having sendfile() block in mbuf allocation for the header, on sfbufs or > anything else is not desirable and must be avoided. I know I'm extending > the traditional definition of SS_NBIO a bit but it's fully in line with the > semantics and desired operational behavior of sendfile(). The paper by > alc@'s students clearly identifies this as the main property of a sendfile > implementation besides its zero copy nature. The semantics with regard to waiting are a bit confusing, but the existing model has a fairly specific meaning that has some benefits. Normally we have three dispositions for a network I/O operation: (1) Fully blocking -- the default disposition. The operation may block for several reasons, but most usually due to either insufficient buffer space/data in the socket buffer, insufficient memory for the kernel to perform the operation (usually mbufs), or due to a user space page fault in reading or writing the data. (2) Non-blocking -- SS_NBIO, MSG_NBIO, etc. The operation will not block if there is insufficient data/buffer space. Typically, this is aligned with select()/poll()/kqueue()'s notion of data or space. (3) Non-waiting -- MSG_DONTWAIT. The operation will not sleep in kernel for any reason, either as part of I/O blocking, or for memory allocation. It may still sleep if a page fault occurs, but as kernel senders send using pinned kernel memory, this isn't an issue. There are a few known bugs -- for example, in zero-copy mode, we may block waiting for an sf_buf with MSG_DONTWAIT set (this used to be the case, haven't checked lately). However, for applications, you typically run in (1) or (2) of the above, where the notion of blocking is aligned with a notion of buffer space or data, not with a notion of kernel sleeping. In particular, it has to do with the definition used by select()/kqueue()/poll(). If you make SS_NBIO sockets return immediately if there is no memory free for sendfile(), this will be inconsistent with the normal behavior in which select() returning writable means that you will be able to write -- so an application that shows the socket as writable via select() might sit there spinning performing the I/O operation, with it repeatedly returning an error saying it wasn't ready. My feeling is that we should constrain absolutely non-sleeping to the MSG_DONTWAIT case -- if desired, we could add SF_DONTWAIT to determine if sleeping ever at all happens. SS_NBIO should not return an error in a limited memory case, it should sleep waiting on memory, as sleeping (mutexes, memory allocation, ...) is not considered blocking. Blocking should continue to refer to the socket buffer-related behavior, and specifically sbwait(). However, we should fix any bugs in MSG_DONTWAIT for sosend/soreceive (and hence sendmsg, recvmsg) that cause it to sleep improperly -- I'm not sure if the zero-copy case still does it wrong, but that's potentially a problem if we ever support zero-copy send from in kernel space, as sosend/soreceive can be called while a mutex is held or in network interrupt context, hence needing the flag. Robert N M Watson Computer Laboratory University of Cambridge