From owner-freebsd-net@freebsd.org  Tue Jul 26 22:57:10 2016
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 071F8BA682C
 for <freebsd-net@mailman.ysv.freebsd.org>;
 Tue, 26 Jul 2016 22:57:10 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249])
 by mx1.freebsd.org (Postfix) with ESMTP id A6FD414D2;
 Tue, 26 Jul 2016 22:57:09 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au
 (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 091771048F19;
 Wed, 27 Jul 2016 08:57:01 +1000 (AEST)
Date: Wed, 27 Jul 2016 08:57:00 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Don Lewis <truckman@freebsd.org>
cc: freebsd-net@freebsd.org
Subject: Re: IPv6 -> IPv4 fallback broken in serf, kernel bug?
In-Reply-To: <201607261559.u6QFxF8a081339@gw.catspoiler.org>
Message-ID: <20160727054616.X990@besplex.bde.org>
References: <201607261559.u6QFxF8a081339@gw.catspoiler.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=EfU1O6SC c=1 sm=1 tr=0
 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10
 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=wJ6eUo_-nK8opg67eDAA:9
 a=9UXwsmE3bJ0M3TOt:21 a=_6LvzLDokv7YOv9B:21 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 26 Jul 2016 22:57:10 -0000

On Tue, 26 Jul 2016, Don Lewis wrote:

> Serf has some code to fall back from IPv4 if an IPv6 and more generally
> try different addresses on multi-homed servers if connection attempts
> fail, but it does not work properly on recent versions of FreeBSD. I've
> tested both recent FreeBSD 10.3-STABLE and HEAD.
>
> The way that it is supposed to work is that serf creates a socket, sets
> it non-blocking, calls connect(), and then passes the fd to poll(). When
> the connection attempt fails, it expects to see a POLLERR event.  The
> POLLERR event handler will then call getsockopt(fd, SOL_SOCKET,
> SO_ERROR, &error, ...).  If the returned error is ECONNREFUSED or one of
> a couple of other errors, then serf will move on to the next address.
>
> Instead what happens is that serf also(?) sees POLLIN set, which it
> processes first by calling read(), which returns an ECONNREFUSED error.
> That not a documented error return from read().

FreeBSD still bogusly returns POLLIN (and POLLRDNORM) together with
POLLHUP at EOF when there is no data (both set should mean both), and
still has the bogus POLLINIGNEOF, but it it almost never returns POLLERR.
My regression tests in tools/regression/poll check for not having this
bug

The only setting of POLLERR in kern is in kqueue_poll() for errors in
initialization, and this doesn't set the other flags.

The only uses of POLLERR in kern are:
- in select(), to turn POLLERR into "set" for any backend that sets it
   (and there seems to be only 1 backend that sets it)
- in vop_stdpoll() and poll_no_poll(), there is inconsistent bogus masking
   using POLLSTANDARD to obfuscate that standard flags which must be
   ignored are _not_ masked.

So I don't see how you can get POLLIN with POLLERR.

> An easy way to test this is to truss svn and attempt to do an http
> checkout from a host that has both IPv6 and IPv4 addresses, but is not
> listening on port 80.  The only connection attempt will be to the IPv6
> address.
>
> socket(PF_INET6,SOCK_STREAM|SOCK_CLOEXEC,6)	 = 4 (0x4)
> fcntl(4,F_GETFL,)				 = 2 (0x2)
> fcntl(4,F_SETFL,O_NONBLOCK|0x2)			 = 0 (0x0)
> setsockopt(0x4,0x6,0x1,0x7fffffffdda4,0x4)	 = 0 (0x0)
> gettimeofday({ 1469515046.979461 },0x0)		 = 0 (0x0)
> connect(4,{ AF_INET6 [xxxx:xxxx:xxxx:xxxx::xxxx]:80 },28) ERR#36 'Operation now in progress'
> gettimeofday({ 1469515046.979614 },0x0)		 = 0 (0x0)
> kevent(3,{ 4,EVFILT_READ,EV_ADD,0x0,0x0,0x805491300 },1,0x0,0,0x0) = 0 (0x0)
> kevent(3,{ 4,EVFILT_WRITE,EV_ADD,0x0,0x0,0x805491300 },1,0x0,0,0x0) = 0 (0x0)
> kevent(3,0x0,0,{ 4,EVFILT_READ,EV_EOF,NOTE_LOWAT|0x3c,0x0,0x805491300 4,EVFILT_WRITE,EV_EOF,NOTE_LOWAT|0x3c,0x8000,0x805491300 },32,{ 0.500000000 }) = 2 (0x2)

I don't see any POLL* there or completely understand the notation or kqueue,
but this looks like the poll() bug with POLLIN together with POLLHUP, not
POLLIN together with POLLERR.

Everything here seems to be correct.  Not very good, but good enough here.

EV_EOF is set by filt_soread() when SBS_CANTRECVMORE is set.
SBS_CANTRECVMORE means hangup, not EOF, and I think there can be
readable data from a socket in general but not after a connection
error.  So this translation is incorrect in general but correct after
a connection error.  kqueue just can't represent hangup and conflates
it with EOF.

When filt_soread() sets EV_EOF, it doesn't clear other flags, so
NOTE_LOWAT remains set.  This happens to be correct.  But since NOTE_LOWAT
really means low water, you can't use it to determine if (non-null) data
can be read.  (POSIX is unclear about whether the "data" for select() and
poll() is actual data or just EOF.)

poll() has almost the opposite problems.  It can represent hangup but
can't represent EOF.  It can represent no data, but this doesn't mean
EOF when the file is open.  It can't represent low-water.
so_poll_generic() starts carefully by setting POLLIN iff soreadable().
soreadable() is true above the watermark.  So POLLIN for a socket
normally means that (non-null) data above the watermark can be read
(without blocking because it is above the watermark).  This is correct
semantics.  But then so_poll_generic() sets POLLIN if it sets POLLHUP.
This makes POLLIN worse than useless.  A naive reader won't look at
POLLHUP, but will trust POLLIN and spin reading at EOF.  A non-naive
reader will see POLLHUP but can't trust POLLIN then.   It must spin
reading until read returns EOF, and poll() is useless for avoiding
this busy-waiting.  Turning off O_NONBLOCK to avoid spinning is unsafe
if the EOF is not sticky.

Just having watermarks further complicates the idea of what "data" is.
Null data is a special case of data that it is too small to be worth
reading.  It corresponds to a low watermark of 0 or 1.  With watermarks,
non-null datai below low water should be considered as not being there
for the purposes of select() and poll(), but there if you try to read
it.  POSIX is unclear about this too.  kqueue has the opposite problem.
It handles watermarks directly, but seems to be missing support for
transient EOF.

This causes problems for tty devices too.  In Net/2, select() basically
uses a hard-coded watermark of 1, and this doesn't even work to give
tinygrams because read() blocks after select() returns "set" for certain
MIN/TIME combinations where the watermark should be MIN.  This was fixed
in FreeBSD-1, basically by copying the socket code.  This was broken in
4.4BSD.  This was broken in FreeBSD-2.early by copying 4.4BSD.  This was
fixed in FreeBSD-2 by restoring fixes.  The fixes were refined in
FreeBSD-[2-7].  All of the fixes were lost in FreeBSD-8.  Most of the
fixes are restored in my version.

> read(4,0x80549c064,8000)			 ERR#61 'Connection refused'
> kevent(3,{ 4,EVFILT_READ,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) = 0 (0x0)
> kevent(3,{ 4,EVFILT_WRITE,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) = 0 (0x0)
> kevent(3,{ 4,EVFILT_READ,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) ERR#2 'No such file or directory'
> kevent(3,{ 4,EVFILT_WRITE,EV_DELETE,0x0,0x0,0x0 },1,0x0,0,0x0) ERR#2 'No such file or directory'
> close(4)					 = 0 (0x0)
> close(3)					 = 0 (0x0)
> svn: E170013: Unable to connect to a repository at URL ...
>
> It looks like it should be possible to patch serf to handle this, but:
>  * Should POLLIN be set for this event?

I think there never was any data, so no for poll().  kqueue just cannot
represent the no-data condition.

>  * What errno value should read() return in this case, if it is
>    ECONNREFUSED, then that should be documented.

Don't know.

Bruce