From owner-freebsd-net@freebsd.org Tue Jul 26 23:41:02 2016 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B9238BA515E for ; Tue, 26 Jul 2016 23:41:02 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 89F8F1BCB for ; Tue, 26 Jul 2016 23:41:02 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id u6QNes2t082436; Tue, 26 Jul 2016 16:40:58 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <201607262340.u6QNes2t082436@gw.catspoiler.org> Date: Tue, 26 Jul 2016 16:40:54 -0700 (PDT) From: Don Lewis Subject: Re: IPv6 -> IPv4 fallback broken in serf, kernel bug? To: brde@optusnet.com.au cc: freebsd-net@freebsd.org In-Reply-To: <20160727054616.X990@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 26 Jul 2016 23:41:02 -0000 On 27 Jul, Bruce Evans wrote: > On Tue, 26 Jul 2016, Don Lewis wrote: > >> Serf has some code to fall back from IPv4 if an IPv6 and more generally >> try different addresses on multi-homed servers if connection attempts >> fail, but it does not work properly on recent versions of FreeBSD. I've >> tested both recent FreeBSD 10.3-STABLE and HEAD. >> >> The way that it is supposed to work is that serf creates a socket, sets >> it non-blocking, calls connect(), and then passes the fd to poll(). When >> the connection attempt fails, it expects to see a POLLERR event. The >> POLLERR event handler will then call getsockopt(fd, SOL_SOCKET, >> SO_ERROR, &error, ...). If the returned error is ECONNREFUSED or one of >> a couple of other errors, then serf will move on to the next address. >> >> Instead what happens is that serf also(?) sees POLLIN set, which it >> processes first by calling read(), which returns an ECONNREFUSED error. >> That not a documented error return from read(). > > FreeBSD still bogusly returns POLLIN (and POLLRDNORM) together with > POLLHUP at EOF when there is no data (both set should mean both), and > still has the bogus POLLINIGNEOF, but it it almost never returns POLLERR. > My regression tests in tools/regression/poll check for not having this > bug > > The only setting of POLLERR in kern is in kqueue_poll() for errors in > initialization, and this doesn't set the other flags. > > The only uses of POLLERR in kern are: > - in select(), to turn POLLERR into "set" for any backend that sets it > (and there seems to be only 1 backend that sets it) > - in vop_stdpoll() and poll_no_poll(), there is inconsistent bogus masking > using POLLSTANDARD to obfuscate that standard flags which must be > ignored are _not_ masked. > > So I don't see how you can get POLLIN with POLLERR. > >> An easy way to test this is to truss svn and attempt to do an http >> checkout from a host that has both IPv6 and IPv4 addresses, but is not >> listening on port 80. The only connection attempt will be to the IPv6 >> address. >> >> socket(PF_INET6,SOCK_STREAM|SOCK_CLOEXEC,6) = 4 (0x4) >> fcntl(4,F_GETFL,) = 2 (0x2) >> fcntl(4,F_SETFL,O_NONBLOCK|0x2) = 0 (0x0) >> setsockopt(0x4,0x6,0x1,0x7fffffffdda4,0x4) = 0 (0x0) >> gettimeofday({ 1469515046.979461 },0x0) = 0 (0x0) >> connect(4,{ AF_INET6 [xxxx:xxxx:xxxx:xxxx::xxxx]:80 },28) ERR#36 'Operation now in progress' >> gettimeofday({ 1469515046.979614 },0x0) = 0 (0x0) >> kevent(3,{ 4,EVFILT_READ,EV_ADD,0x0,0x0,0x805491300 },1,0x0,0,0x0) = 0 (0x0) >> kevent(3,{ 4,EVFILT_WRITE,EV_ADD,0x0,0x0,0x805491300 },1,0x0,0,0x0) = 0 (0x0) >> kevent(3,0x0,0,{ 4,EVFILT_READ,EV_EOF,NOTE_LOWAT|0x3c,0x0,0x805491300 4,EVFILT_WRITE,EV_EOF,NOTE_LOWAT|0x3c,0x8000,0x805491300 },32,{ 0.500000000 }) = 2 (0x2) > > I don't see any POLL* there or completely understand the notation or kqueue, > but this looks like the poll() bug with POLLIN together with POLLHUP, not > POLLIN together with POLLERR. I didn't try to decipher out the kqueue stuff. I was thinking that our poll() was using kqueue under the hood, but it turns out that the poll emulation is actually being done by apr. Sigh ... A comment in the emulation code says: /* APR_POLLPRI, APR_POLLERR, and APR_POLLNVAL are not handled by this * implementation. ... double sigh. > Everything here seems to be correct. Not very good, but good enough here. > > EV_EOF is set by filt_soread() when SBS_CANTRECVMORE is set. > SBS_CANTRECVMORE means hangup, not EOF, and I think there can be > readable data from a socket in general but not after a connection > error. So this translation is incorrect in general but correct after > a connection error. kqueue just can't represent hangup and conflates > it with EOF. But should there be a hangup or EOF if we never got connected in the first place? > When filt_soread() sets EV_EOF, it doesn't clear other flags, so > NOTE_LOWAT remains set. This happens to be correct. But since NOTE_LOWAT > really means low water, you can't use it to determine if (non-null) data > can be read. (POSIX is unclear about whether the "data" for select() and > poll() is actual data or just EOF.) > > poll() has almost the opposite problems. It can represent hangup but > can't represent EOF. It can represent no data, but this doesn't mean > EOF when the file is open. It can't represent low-water. > so_poll_generic() starts carefully by setting POLLIN iff soreadable(). > soreadable() is true above the watermark. So POLLIN for a socket > normally means that (non-null) data above the watermark can be read > (without blocking because it is above the watermark). This is correct > semantics. But then so_poll_generic() sets POLLIN if it sets POLLHUP. > This makes POLLIN worse than useless. A naive reader won't look at > POLLHUP, but will trust POLLIN and spin reading at EOF. A non-naive > reader will see POLLHUP but can't trust POLLIN then. It must spin > reading until read returns EOF, and poll() is useless for avoiding > this busy-waiting. Turning off O_NONBLOCK to avoid spinning is unsafe > if the EOF is not sticky. > > Just having watermarks further complicates the idea of what "data" is. > Null data is a special case of data that it is too small to be worth > reading. It corresponds to a low watermark of 0 or 1. With watermarks, > non-null datai below low water should be considered as not being there > for the purposes of select() and poll(), but there if you try to read > it. POSIX is unclear about this too. kqueue has the opposite problem. > It handles watermarks directly, but seems to be missing support for > transient EOF. > > This causes problems for tty devices too. In Net/2, select() basically > uses a hard-coded watermark of 1, and this doesn't even work to give > tinygrams because read() blocks after select() returns "set" for certain > MIN/TIME combinations where the watermark should be MIN. This was fixed > in FreeBSD-1, basically by copying the socket code. This was broken in > 4.4BSD. This was broken in FreeBSD-2.early by copying 4.4BSD. This was > fixed in FreeBSD-2 by restoring fixes. The fixes were refined in > FreeBSD-[2-7]. All of the fixes were lost in FreeBSD-8. Most of the > fixes are restored in my version. > >> read(4,0x80549c064,8000) ERR#61 'Connection refused' >> kevent(3,{ 4,EVFILT_READ,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) = 0 (0x0) >> kevent(3,{ 4,EVFILT_WRITE,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) = 0 (0x0) >> kevent(3,{ 4,EVFILT_READ,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) ERR#2 'No such file or directory' >> kevent(3,{ 4,EVFILT_WRITE,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) ERR#2 'No such file or directory' >> close(4) = 0 (0x0) >> close(3) = 0 (0x0) >> svn: E170013: Unable to connect to a repository at URL ... >> >> It looks like it should be possible to patch serf to handle this, but: >> * Should POLLIN be set for this event? > > I think there never was any data, so no for poll(). kqueue just cannot > represent the no-data condition. > >> * What errno value should read() return in this case, if it is >> ECONNREFUSED, then that should be documented. > > Don't know. > > Bruce