From owner-freebsd-current@FreeBSD.ORG Mon Jun 16 04:09:32 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id ADA3937B401; Mon, 16 Jun 2003 04:09:32 -0700 (PDT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id AA0D043F85; Mon, 16 Jun 2003 04:09:30 -0700 (PDT) (envelope-from truckman@freebsd.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.12.9/8.12.9) with ESMTP id h5GB9MM7048819; Mon, 16 Jun 2003 04:09:26 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <200306161109.h5GB9MM7048819@gw.catspoiler.org> Date: Mon, 16 Jun 2003 04:09:22 -0700 (PDT) From: Don Lewis To: bde@zeta.org.au In-Reply-To: <20030616200848.U27906@gamplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii cc: current@freebsd.org cc: tjr@freebsd.org Subject: Re: qmail uses 100% cpu after FreeBSD-5.0 to 5.1 upgrade X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jun 2003 11:09:33 -0000 On 16 Jun, Bruce Evans wrote: > On Mon, 16 Jun 2003, Don Lewis wrote: > >> On 16 Jun, I wrote: >> > On 16 Jun, Tim Robbins wrote: >> >> >>> This looks like a bug in the named pipe code. Reverting >> >>> sys/fs/fifofs/fifo_vnops.c to the RELENG_5_0 version makes the problem go >> >>> away. I haven't tracked down exactly what change between RELENG_5_0 and >> >>> RELENG_5_1 caused the problem. >> >> >> >> Looks like revision 1.86 works, but it stops working with 1.87. Moving the >> >> soclose() calls to fifo_inactive() may have caused it. >> > >> > This is an interesting observation, but I'm not sure why it would make a >> > difference. I haven't looked at the qmail source, but it looks like it >> > is doing a non-blocking open on the fifo, calling select() on the fd, >> > and hoping that select() waits for a writer to open the fifo before >> > returning with an indication that the descriptor is readable. > > In my review of 1.87, I forgot to ask you how atomic the close is with part > of it moved out to fifo_inactive(). I think it's important that all > traces of the old open have gone away (as far as applications can tell) > when the last close returns. I hadn't taken queued data into consideration. Now that I've looked at this more closely, there are other problems in both the old and new code. If a process calls fcntl(fd, F_SETOWN, ...) on one end of the fifo, that should be undone when that end of the fifo is closed. In the old implementation, that only happens when both ends of the fifo are closed and the sockets are deleted. >> On 5.1-current, select() waits forever, even if the fifo has been opened >> for writing by another process. Select() only returns when something >> has actually been written to the fifo, and since this process doesn't >> read anything from the fifo, it spins on select() forever. >> >> If some data is getting written to the fifo, it doesn't look like qmail >> consumes it, and since fifo_close in 1.87 doesn't destroy the sockets, >> it looks like the data is hanging around in the fifo while neither end >> is open, and qmail stumbles across this data when it calls select() >> after re-opening the fifo. >> >> Now there are two questions that I can't answer: >> >> Why is my analysis of select() and the SS_CANTRCVMORE flag >> incorrect in 5.1-current with version 1.87 or 1.88 of >> fifo_vnops.c. > > I think it is correct, assuming that something writes to the fifo. > Writing might be part of synchronization but actually reading the > data should not be necessary since the last close must discard the > data (POSIX spec). It sure looks to me like SS_CANTRCVMORE is always set when the write end of the fifo is closed, no matter whether the the sockets were freshly allocated by a fifo_open() call on the read end of the fifo, or because the the last writer closed the write end of the fifo. It sure looks like select() should immediately return if this flag is set, but it is not returning ... Actually, something seems broken. I modified my little test program to actually read the data, which works just fine, but select() still blocks when the writer closes the fifo, so there doesn't seem to be a way to detect the EOF. >> Why doesn't qmail get stuck in a similar loop in 4.8-stable, >> since select always returns true for reading on a fifo with no >> writers? > > Don't know. Maybe it uses autoconfig to handle the 4.8 behaviour. > The 4.8 behaviour is normal compared with the buggy behaviour of > not discarding data on last close, so applications should handle it > better :-). Maybe qmain spins under 4.8 too, but only until > synchronization is achieved. > > Bruce