From owner-freebsd-current@FreeBSD.ORG Tue Apr 3 13:49:58 2012 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 6FCDB106564A; Tue, 3 Apr 2012 13:49:58 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail11.syd.optusnet.com.au (mail11.syd.optusnet.com.au [211.29.132.192]) by mx1.freebsd.org (Postfix) with ESMTP id 084CF8FC16; Tue, 3 Apr 2012 13:49:57 +0000 (UTC) Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au (c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136]) by mail11.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q33Dnm28004851 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 3 Apr 2012 23:49:49 +1000 Date: Tue, 3 Apr 2012 23:49:48 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Andre Oppermann In-Reply-To: <4F7ADF5D.1060807@freebsd.org> Message-ID: <20120403232051.V1450@besplex.bde.org> References: <201204021821.37437.alexandre.martins@netasq.com> <4F7ADF5D.1060807@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Mailman-Approved-At: Tue, 03 Apr 2012 14:08:53 +0000 Cc: Alexandre Martins , freebsd-current@freebsd.org, bde@freebsd.org Subject: Re: Potential deadlock on mbuf X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 Apr 2012 13:49:58 -0000 On Tue, 3 Apr 2012, Andre Oppermann wrote: > On 02.04.2012 18:21, Alexandre Martins wrote: >> Dear, >> >> I have currently having troubles with a basic socket stress. >> >> The socket are setup to use non-blocking I/O. >> >> During this stress-test, the kernel is running mbuf exhaustion, the goal is >> to >> see system limits. >> >> If the program make a write on a socket during this mbuf exhaustion, it >> become >> blocked in "write" system call. The status of the process is "zonelimit" >> and >> whole network I/O fall in timeout. >> >> I have found the root cause of the block : >> http://svnweb.freebsd.org/base/head/sys/kern/uipc_socket.c?view=markup#l1279 >> >> So, the question is : Why m_uiotombuf is called with a blocking parameter >> (M_WAITOK) even if is for a non-blocking socket ? >> >> Then, if M_NOWAIT is used, maybe it will be usefull to have an 'ENOMEM' >> error. I'm surprised you can even see blocking of malloc(... M_WAITOK). O_NONBLOCK is mostly for operations that might block for a long time, but malloc() is not expected to block for long. Regular files are always so non-blocking that most file systems have no references to O_NONBLOCK (or FNONBLOCK), but file systems often execute memory allocation code that can easily block for as long as malloc() does. When malloc() starts blocking for a long time, lots of things will fail. > This is a bit of an catch-22 we have here. Trouble is that when > we return with EAGAIN the next select/poll cycle will tell you > that this and possibly other sockets are writeable again, when in > fact they are not due to kernel memory shortage. Then the application > will tightly loop around the "writeable" non-writeable sockets. > It's about the interaction of write with O_NONBLOCK and select/poll > on the socket. This would be difficult to handle better. > Do you have any references how other OSes behave, in particular > Linux? > > I've added bde@ as our resident standards compliance expert. > Hopefully he can give us some more insight on this issue. Standards won't say what happens at this level of detail. Blocking for network i/o is still completely broken at levels below sockets AFAIK. I (and ttcp) mainly wanted it to work for send() of udp. I saw no problems at the socket level, but driver queues just filled up and send() returned ENOBUFS. I wanted either the opposite of O_NONBLOCK (block until !ENOBUFS), or at least for select() to work for waiting until !ENOBUFS. But select() doesn't work at all for this. It seemed to work better in Linux. Bruce