From owner-cvs-all@FreeBSD.ORG  Tue Aug 12 14:07:48 2008
Return-Path: <owner-cvs-all@FreeBSD.ORG>
Delivered-To: cvs-all@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7F2AA106566C;
	Tue, 12 Aug 2008 14:07:48 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au
	[211.29.132.187])
	by mx1.freebsd.org (Postfix) with ESMTP id 0BF2A8FC13;
	Tue, 12 Aug 2008 14:07:47 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c220-239-252-11.carlnfd3.nsw.optusnet.com.au
	[220.239.252.11])
	by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	m7CE7hXM016404
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 13 Aug 2008 00:07:44 +1000
Date: Wed, 13 Aug 2008 00:07:43 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20080812014937.E21092@besplex.bde.org>
Message-ID: <20080812231130.D760@besplex.bde.org>
References: <200808081343.m78DhwYE068477@repoman.freebsd.org>
	<200808081226.32089.jhb@freebsd.org>
	<20080809130929.P77335@delplex.bde.org>
	<200808091555.25020.jhb@freebsd.org>
	<20080812014937.E21092@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: cvs-src@FreeBSD.org, src-committers@FreeBSD.org,
	Ed Schouten <ed@FreeBSD.org>, cvs-all@FreeBSD.org,
	John Baldwin <jhb@FreeBSD.org>
Subject: Re: cvs commit: src/sys/dev/io iodev.c
X-BeenThere: cvs-all@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: CVS commit messages for the entire tree <cvs-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-all>,
	<mailto:cvs-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/cvs-all>
List-Post: <mailto:cvs-all@freebsd.org>
List-Help: <mailto:cvs-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-all>,
	<mailto:cvs-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Aug 2008 14:07:48 -0000

On Tue, 12 Aug 2008, Bruce Evans wrote:

> On Sat, 9 Aug 2008, John Baldwin wrote:

>> You failed to note that not using D_TRACKCLOSE doesn't actually reliable 
>> close
>> devices.
>
> No, I noted that vfs doesn't count last closes right.  Unreliability of
> last-close is a special case of that.  But using D_TRACKCLOSE mostly makes
> things worse.  To avoid the vfs bugs in tracked closes, it is necessary
> for individual drivers to understand vfs's bugs better than vfs does.
>
>> I set D_TRACKCLOSE on bpf because under load at work (lots of
>> concurrent nmap's) bpf devices would sometimes never get fully closed, so
>> you'd have an unopened bpf device that was no longer available.
>
> It is unclear how that helps.  vfs's bugs must be really large for the
> last-close to go completely missing.  kib fixed some of them a few months
> ago (before you changed bpf I think).  bpf is especially simple since it
> is exclusive-accesses, so the vfs bugs should be smaller for it.
>
>> D_TRACKCLOSE
>> gives much saner semantics for devices, each open() of an fd calls
>> the 'd_open' method, and the last close of that fd (which could be in 
>> another
>> process) calls 'd_close'.
>
> This is insane semantics for most devices, and D_TRACKCLOSE doesn't give it.
> Most devices don't care about the difference between file descriptors and
> files -- they want d_close to be called only when the last file descriptor
> on the device is closed.  !D_TRACKCLOSE gives this, modulo the vfs bugs.
>
> D_TRACKCLOSE doesn't give these semantics: revoke() closes everything
> in 1 step and then calls d_close(), giving an N-1 correspondence between
> d_open()s and d_close()s (modulo the vfs bugs giving a fuzzier
> correspondence).  (IIRC, at least before kib's fixes, it was possible
> for revoke() to cause any of 0, 1 or 2 calls to d_close().  I think 1
> call always gets as far as devfs_close() but this is only guaranteed
> to reaach d_close() with D_TRACKCLOSE.)  This is not a bug, but all
> drivers that depend on the correspondence being 1-1 are broken.  Drivers
> that enforce "exclusive" access like bpf have N=1, so they probably
> work modulo the vfs bugs, and they have a good chance of not being
> affected by the vfs bug that gives an extra devfs_close() and thus an
> extra d_close().

I checked that bpf panics (even under UP) due to the obvious bugs in
its d_close():

     # Generate lots of network activity using something like:
     sysctl net.inet.icmp.icmplim=0; ping -fq localhost &

     # Race to panic eventually:
     while :; do tcpdump -i lo0 & sleep 0.001; revoke /dev/bpf0

Most or all device drivers have obvious bugs in their d_close(); bpf
is just a bit easier to understand and more likely to cause a panic
than most device drivers, since it is simple and frees resources.  A
panic is very likely when si_drv1 is freed, and si_drv1 is only locked
accidentally.

% static	int
% bpfclose(struct cdev *dev, int flags, int fmt, struct thread *td)
% {
% 	struct bpf_d *d = dev->si_drv1;
% 
% 	BPFD_LOCK(d);
% 	if (d->bd_state == BPF_WAITING)
% 		callout_stop(&d->bd_callout);
% 	d->bd_state = BPF_IDLE;
% 	BPFD_UNLOCK(d);
% 	funsetown(&d->bd_sigio);
% 	mtx_lock(&bpf_mtx);
% 	if (d->bd_bif)
% 		bpf_detachd(d);
% 	mtx_unlock(&bpf_mtx);
% 	selwakeuppri(&d->bd_sel, PRINET);
% #ifdef MAC
% 	mac_bpfdesc_destroy(d);
% #endif /* MAC */
% 	knlist_destroy(&d->bd_sel.si_note);
% 	bpf_freed(d);
% 	dev->si_drv1 = NULL;
% 	free(d, M_BPF);

Lots of possibly-in-used data structures are destroyed here.

% 
% 	return (0);
% }
% 
% static	int
% bpfread(struct cdev *dev, struct uio *uio, int ioflag)
% {
% 	struct bpf_d *d = dev->si_drv1;

There is no locking for si_drv1 here and elsewhere.  There used to be
locking -- long ago, there was implicit locking for non-SMP non-preemptive
kernels.  Then there was Giant locking.

d can be either NULL or garbage here.  (It is null if revoke() just
cleared si_drv1 and there is an implicit or accidental memory barrier
that makes the change visible here.  It is garbage if revoke() has
started destroying it but hasn't yet set si_drv1 to NULL when we read
si_drv1, or if revoke() starts destroying it after we read si_drv1.)

% 	int timed_out;
% 	int error;
% 
% 	/*
% 	 * Restrict application to use a buffer the same size as
% 	 * as kernel buffers.
% 	 */
% 	if (uio->uio_resid != d->bd_bufsize)
% 		return (EINVAL);

If we're lucky after losing the race, then d is null and we get a null
pointer panic here.  I didn't observe this.

% 
% 	BPFD_LOCK(d);

If we're unlucky after losing the race, then we start corrupting *d here.

% 		error = msleep(d, &d->bd_mtx, PRINET|PCATCH,
% 		     "bpf", d->bd_rtout);

Since we block, we implemented panic(9) even with implicit or explicit
Giant locking.  The sleep gives a very large race window...

% 		if (error == EINTR || error == ERESTART) {
% 			BPFD_UNLOCK(d);
% 			return (error);
% 		}

...however, we now have locking for d, and bpfclose() blocks on this,
so I think we don't implement panic(9) here.  We have a broken revoke()
instead -- it should do a forced close without waiting, but instead it
waits for the unlock in the above or later.

% 		if (error == EWOULDBLOCK) {
% 			/*
% 			 * On a timeout, return what's in the buffer,
% 			 * which may be nothing.  If there is something
% 			 * in the store buffer, we can rotate the buffers.
% 			 */
% 			if (d->bd_hbuf)
% 				/*
% 				 * We filled up the buffer in between
% 				 * getting the timeout and arriving
% 				 * here, so we don't need to rotate.
% 				 */
% 				break;
% 
% 			if (d->bd_slen == 0) {
% 				BPFD_UNLOCK(d);
% 				return (0);
% 			}
% 			ROTATE_BUFFERS(d);
% 			break;
% 		}
% 	}
% 	/*
% 	 * At this point, we know we have something in the hold slot.
% 	 */
% 	BPFD_UNLOCK(d);

If bpfclose() was blocked on d, then our d is especially likely to become
corrupt just after here.

% 
% 	/*
% 	 * Move data from hold buffer into user space.
% 	 * We know the entire buffer is transferred since
% 	 * we checked above that the read buffer is bpf_bufsize bytes.
% 	 *
% 	 * XXXRW: More synchronization needed here: what if a second thread
% 	 * issues a read on the same fd at the same time?  Don't want this
% 	 * getting invalidated.
% 	 */
% 	error = bpf_uiomove(d, d->bd_hbuf, d->bd_hlen, uio);
% 
% 	BPFD_LOCK(d);
% 	d->bd_fbuf = d->bd_hbuf;
% 	d->bd_hbuf = NULL;
% 	d->bd_hlen = 0;
% 	bpf_buf_reclaimed(d);
% 	BPFD_UNLOCK(d);
% 
% 	return (error);
% }

Bruce