From owner-freebsd-current@FreeBSD.ORG Sun Mar 10 23:50:40 2013 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 9D42FBDA; Sun, 10 Mar 2013 23:50:40 +0000 (UTC) (envelope-from jilles@stack.nl) Received: from mx1.stack.nl (relay02.stack.nl [IPv6:2001:610:1108:5010::104]) by mx1.freebsd.org (Postfix) with ESMTP id 4DB597D7; Sun, 10 Mar 2013 23:50:40 +0000 (UTC) Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131]) by mx1.stack.nl (Postfix) with ESMTP id F15C83592FE; Mon, 11 Mar 2013 00:50:37 +0100 (CET) Received: by snail.stack.nl (Postfix, from userid 1677) id DAD762848C; Mon, 11 Mar 2013 00:50:37 +0100 (CET) Date: Mon, 11 Mar 2013 00:50:37 +0100 From: Jilles Tjoelker To: Jan Beich Subject: Re: r247839: broken pipe - for top, sudo and ports Message-ID: <20130310235037.GE94667@stack.nl> References: <5135B7E1.3050002@zedat.fu-berlin.de> <5135BBD9.9090009@FreeBSD.org> <51364914.2010104@zedat.fu-berlin.de> <51364E8D.5020608@zedat.fu-berlin.de> <20130306212408.GA15814@stack.nl> <1UDUQ3-0009HU-83@internal.tormail.org> <20130310212256.GD94667@stack.nl> <1UEoh3-000Ly7-Mf@internal.tormail.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1UEoh3-000Ly7-Mf@internal.tormail.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Davide Italiano , Dimitry Andric , FreeBSD Current , "Hartmann, O." X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Mar 2013 23:50:40 -0000 On Sun, Mar 10, 2013 at 08:26:03PM -0200, Jan Beich wrote: > Jilles Tjoelker writes: > > On Thu, Mar 07, 2013 at 04:54:01AM -0100, Jan Beich wrote: > >> Jilles Tjoelker writes: > >> > On Tue, Mar 05, 2013 at 08:59:09PM +0100, Hartmann, O. wrote: > >> >> A "truss top" reveals this, is this of help? > >> >> [...] > >> >> stat("/etc/nsswitch.conf",{ mode=-rw-r--r-- > >> >> ,inode=162310,size=1007,blksize=32768 }) = 0 (0x0) > >> >> stat("/etc/nsswitch.conf",{ mode=-rw-r--r-- > >> >> ,inode=162310,size=1007,blksize=32768 }) = 0 (0x0) > >> >> stat("/etc/nsswitch.conf",{ mode=-rw-r--r-- > >> >> ,inode=162310,size=1007,blksize=32768 }) = 0 (0x0) > >> >> stat("/etc/nsswitch.conf",{ mode=-rw-r--r-- > >> >> ,inode=162310,size=1007,blksize=32768 }) = 0 (0x0) > >> >> stat("/etc/nsswitch.conf",{ mode=-rw-r--r-- > >> >> ,inode=162310,size=1007,blksize=32768 }) = 0 (0x0) > >> >> socket(PF_LOCAL,SOCK_STREAM,0) = 4 (0x4) > >> >> connect(4,{ AF_UNIX "/var/run/nscd" },15) = 0 (0x0) > >> >> fcntl(4,F_SETFL,O_NONBLOCK) = 0 (0x0) > >> >> kqueue(0x80183b000,0x80122fc58,0x10,0x80062b308,0x80183b010,0x2) > >> >> = 5 (0x5) > >> >> kevent(5,{0x4,EVFILT_WRITE,EV_ADD,0,0x0,0x0},1,0x0,0,0x0) = 0 (0x0) > >> >> kqueue(0x5,0x7fffffffd2e0,0x1,0x0,0x0,0x0) = 6 (0x6) > >> >> kevent(6,{0x4,EVFILT_READ,EV_ADD,0,0x0,0x0},1,0x0,0,0x0) = 0 (0x0) > >> >> kevent(5,{0x4,EVFILT_WRITE,EV_ADD,1,0x4,0x0},1,0x0,0,0x0) = 0 (0x0) > >> >> kevent(5,0x0,0,{0x4,EVFILT_WRITE,EV_EOF,0,0x2000,0x0},1,0x0) = 1 (0x1) > >> >> sendmsg(0x4,0x7fffffffd290,0x0,0x1,0x1,0x0) ERR#32 'Broken pipe' > >> >> SIGNAL 13 (SIGPIPE) > >> >> process exit, rval = 0 > >> > Apparently there is a bug that causes nscd to close the connection > >> > immediately but even then it is wrong that this terminates the calling > >> > program with SIGPIPE. > >> > The below patch prevents the SIGPIPE but cannot revive the connection to > >> > nscd. This may cause numeric UIDs in top or increase the load on the > >> > directory server. It is compile tested only. > >> [...] > >> The patch seems to fix the issue in a world after r247804. I don't see > >> numeric UIDs in top but without the patch top crashes with SIGPIPE a lot > >> less frequently than sudo or make install (in base/ports) for me. > >> In my case shutting down nscd helped, too. Compared to stock > >> nsswitch.conf I only have "cache" added. > > Can you find what causes nscd to close the connection quickly, such as > > using ktrace? > # single user mode > $ ktrace -p $(pgrep nscd); top -b; ktrace -c; kdump > 71 nscd GIO fd 5 wrote 0 bytes > "" > 71 nscd GIO fd 5 read 32 bytes > 0x0000 0400 0000 0000 0000 ffff 1000 0000 0000 0100 |..................| > 0x0012 0000 0000 0000 0000 0000 0000 0000 |..............| > > 71 nscd RET kevent 1 > 71 nscd CALL accept(0x4,0,0) > 71 nscd RET accept 6 We are in usr.sbin/nscd/nscd.c accept_connection() here. > 71 nscd CALL getsockopt(0x6,0,0x1,0x7fffff9fce28,0x7fffff9fce24) > 71 nscd RET getsockopt 0 Probably getpeereid(). On another note, nscd leaks the file descriptor if this, the below init_query_state() or the below kevent() fails. > 71 nscd CALL kevent(0x5,0x7fffff9fcf00,0x2,0,0,0x7fffff9fcf50) > 71 nscd GIO fd 5 wrote 64 bytes > 0x0000 0600 0000 0000 0000 f9ff 1100 0000 0000 401f |................@.| > 0x0012 0000 401f 0000 40e6 4002 0800 0000 0600 0000 |..@...@.@.........| > 0x0024 0000 0000 ffff 1100 0100 0000 0400 0000 0000 |..................| > 0x0036 0000 40e6 4002 0800 0000 |..@.@.....| > Adding an EVFILT_TIMER and an EVFILT_READ. The data field for the EVFILT_TIMER is a bit strange. I would expect 0x0000000000001f40 (8000 decimal) but it puts instead 0x00001f4000001f40. This does not happen when I run tools/regression/kqueue/kqtest on a stable/9 amd64 machine or on ref10-amd64 which currently runs r247722. On a head (r248047) i386 machine, the data field looks right. > 71 nscd GIO fd 5 read 0 bytes > "" > 71 nscd RET kevent 0 > 71 nscd CALL kevent(0x5,0x7fffff9fcec0,0x1,0,0,0x7fffff9fcee0) > 71 nscd GIO fd 5 wrote 32 bytes > 0x0000 0400 0000 0000 0000 ffff 1100 0000 0000 0000 |..................| > 0x0012 0000 0000 0000 0000 0000 0000 0000 |..............| > Probably registering interest for the next connection. > 71 nscd GIO fd 5 read 0 bytes > "" > 71 nscd RET kevent 0 > 71 nscd CALL kevent(0x5,0,0,0x7fffff9fcec0,0x1,0) > 71 nscd GIO fd 5 wrote 0 bytes > "" > 71 nscd GIO fd 5 read 32 bytes > 0x0000 0600 0000 0000 0000 f9ff 3000 0000 0000 0100 |..........0.......| > 0x0012 0000 0000 0000 40e6 4002 0800 0000 |......@.@.....| > The timer has already expired. This cannot be right. (It cannot be that EVFILT_READ is broken and eight seconds actually passed because the send calls would have worked in that case.) tools/regression/kqueue/kqtest works correctly on the aforementioned r248047 head i386 machine. > 71 nscd RET kevent 1 > 71 nscd CALL close(0x6) > 71 nscd RET close 0 > 71 nscd CALL kevent(0x5,0,0,0x7fffff9fcec0,0x1,0) -- Jilles Tjoelker