From owner-freebsd-net@freebsd.org Tue Jul 26 22:57:10 2016 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 071F8BA682C for ; Tue, 26 Jul 2016 22:57:10 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id A6FD414D2; Tue, 26 Jul 2016 22:57:09 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 091771048F19; Wed, 27 Jul 2016 08:57:01 +1000 (AEST) Date: Wed, 27 Jul 2016 08:57:00 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Don Lewis cc: freebsd-net@freebsd.org Subject: Re: IPv6 -> IPv4 fallback broken in serf, kernel bug? In-Reply-To: <201607261559.u6QFxF8a081339@gw.catspoiler.org> Message-ID: <20160727054616.X990@besplex.bde.org> References: <201607261559.u6QFxF8a081339@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=EfU1O6SC c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=wJ6eUo_-nK8opg67eDAA:9 a=9UXwsmE3bJ0M3TOt:21 a=_6LvzLDokv7YOv9B:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 26 Jul 2016 22:57:10 -0000 On Tue, 26 Jul 2016, Don Lewis wrote: > Serf has some code to fall back from IPv4 if an IPv6 and more generally > try different addresses on multi-homed servers if connection attempts > fail, but it does not work properly on recent versions of FreeBSD. I've > tested both recent FreeBSD 10.3-STABLE and HEAD. > > The way that it is supposed to work is that serf creates a socket, sets > it non-blocking, calls connect(), and then passes the fd to poll(). When > the connection attempt fails, it expects to see a POLLERR event. The > POLLERR event handler will then call getsockopt(fd, SOL_SOCKET, > SO_ERROR, &error, ...). If the returned error is ECONNREFUSED or one of > a couple of other errors, then serf will move on to the next address. > > Instead what happens is that serf also(?) sees POLLIN set, which it > processes first by calling read(), which returns an ECONNREFUSED error. > That not a documented error return from read(). FreeBSD still bogusly returns POLLIN (and POLLRDNORM) together with POLLHUP at EOF when there is no data (both set should mean both), and still has the bogus POLLINIGNEOF, but it it almost never returns POLLERR. My regression tests in tools/regression/poll check for not having this bug The only setting of POLLERR in kern is in kqueue_poll() for errors in initialization, and this doesn't set the other flags. The only uses of POLLERR in kern are: - in select(), to turn POLLERR into "set" for any backend that sets it (and there seems to be only 1 backend that sets it) - in vop_stdpoll() and poll_no_poll(), there is inconsistent bogus masking using POLLSTANDARD to obfuscate that standard flags which must be ignored are _not_ masked. So I don't see how you can get POLLIN with POLLERR. > An easy way to test this is to truss svn and attempt to do an http > checkout from a host that has both IPv6 and IPv4 addresses, but is not > listening on port 80. The only connection attempt will be to the IPv6 > address. > > socket(PF_INET6,SOCK_STREAM|SOCK_CLOEXEC,6) = 4 (0x4) > fcntl(4,F_GETFL,) = 2 (0x2) > fcntl(4,F_SETFL,O_NONBLOCK|0x2) = 0 (0x0) > setsockopt(0x4,0x6,0x1,0x7fffffffdda4,0x4) = 0 (0x0) > gettimeofday({ 1469515046.979461 },0x0) = 0 (0x0) > connect(4,{ AF_INET6 [xxxx:xxxx:xxxx:xxxx::xxxx]:80 },28) ERR#36 'Operation now in progress' > gettimeofday({ 1469515046.979614 },0x0) = 0 (0x0) > kevent(3,{ 4,EVFILT_READ,EV_ADD,0x0,0x0,0x805491300 },1,0x0,0,0x0) = 0 (0x0) > kevent(3,{ 4,EVFILT_WRITE,EV_ADD,0x0,0x0,0x805491300 },1,0x0,0,0x0) = 0 (0x0) > kevent(3,0x0,0,{ 4,EVFILT_READ,EV_EOF,NOTE_LOWAT|0x3c,0x0,0x805491300 4,EVFILT_WRITE,EV_EOF,NOTE_LOWAT|0x3c,0x8000,0x805491300 },32,{ 0.500000000 }) = 2 (0x2) I don't see any POLL* there or completely understand the notation or kqueue, but this looks like the poll() bug with POLLIN together with POLLHUP, not POLLIN together with POLLERR. Everything here seems to be correct. Not very good, but good enough here. EV_EOF is set by filt_soread() when SBS_CANTRECVMORE is set. SBS_CANTRECVMORE means hangup, not EOF, and I think there can be readable data from a socket in general but not after a connection error. So this translation is incorrect in general but correct after a connection error. kqueue just can't represent hangup and conflates it with EOF. When filt_soread() sets EV_EOF, it doesn't clear other flags, so NOTE_LOWAT remains set. This happens to be correct. But since NOTE_LOWAT really means low water, you can't use it to determine if (non-null) data can be read. (POSIX is unclear about whether the "data" for select() and poll() is actual data or just EOF.) poll() has almost the opposite problems. It can represent hangup but can't represent EOF. It can represent no data, but this doesn't mean EOF when the file is open. It can't represent low-water. so_poll_generic() starts carefully by setting POLLIN iff soreadable(). soreadable() is true above the watermark. So POLLIN for a socket normally means that (non-null) data above the watermark can be read (without blocking because it is above the watermark). This is correct semantics. But then so_poll_generic() sets POLLIN if it sets POLLHUP. This makes POLLIN worse than useless. A naive reader won't look at POLLHUP, but will trust POLLIN and spin reading at EOF. A non-naive reader will see POLLHUP but can't trust POLLIN then. It must spin reading until read returns EOF, and poll() is useless for avoiding this busy-waiting. Turning off O_NONBLOCK to avoid spinning is unsafe if the EOF is not sticky. Just having watermarks further complicates the idea of what "data" is. Null data is a special case of data that it is too small to be worth reading. It corresponds to a low watermark of 0 or 1. With watermarks, non-null datai below low water should be considered as not being there for the purposes of select() and poll(), but there if you try to read it. POSIX is unclear about this too. kqueue has the opposite problem. It handles watermarks directly, but seems to be missing support for transient EOF. This causes problems for tty devices too. In Net/2, select() basically uses a hard-coded watermark of 1, and this doesn't even work to give tinygrams because read() blocks after select() returns "set" for certain MIN/TIME combinations where the watermark should be MIN. This was fixed in FreeBSD-1, basically by copying the socket code. This was broken in 4.4BSD. This was broken in FreeBSD-2.early by copying 4.4BSD. This was fixed in FreeBSD-2 by restoring fixes. The fixes were refined in FreeBSD-[2-7]. All of the fixes were lost in FreeBSD-8. Most of the fixes are restored in my version. > read(4,0x80549c064,8000) ERR#61 'Connection refused' > kevent(3,{ 4,EVFILT_READ,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) = 0 (0x0) > kevent(3,{ 4,EVFILT_WRITE,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) = 0 (0x0) > kevent(3,{ 4,EVFILT_READ,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) ERR#2 'No such file or directory' > kevent(3,{ 4,EVFILT_WRITE,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) ERR#2 'No such file or directory' > close(4) = 0 (0x0) > close(3) = 0 (0x0) > svn: E170013: Unable to connect to a repository at URL ... > > It looks like it should be possible to patch serf to handle this, but: > * Should POLLIN be set for this event? I think there never was any data, so no for poll(). kqueue just cannot represent the no-data condition. > * What errno value should read() return in this case, if it is > ECONNREFUSED, then that should be documented. Don't know. Bruce