From owner-cvs-all@FreeBSD.ORG Tue Aug 12 14:07:48 2008 Return-Path: Delivered-To: cvs-all@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7F2AA106566C; Tue, 12 Aug 2008 14:07:48 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au [211.29.132.187]) by mx1.freebsd.org (Postfix) with ESMTP id 0BF2A8FC13; Tue, 12 Aug 2008 14:07:47 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c220-239-252-11.carlnfd3.nsw.optusnet.com.au [220.239.252.11]) by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m7CE7hXM016404 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 13 Aug 2008 00:07:44 +1000 Date: Wed, 13 Aug 2008 00:07:43 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans In-Reply-To: <20080812014937.E21092@besplex.bde.org> Message-ID: <20080812231130.D760@besplex.bde.org> References: <200808081343.m78DhwYE068477@repoman.freebsd.org> <200808081226.32089.jhb@freebsd.org> <20080809130929.P77335@delplex.bde.org> <200808091555.25020.jhb@freebsd.org> <20080812014937.E21092@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: cvs-src@FreeBSD.org, src-committers@FreeBSD.org, Ed Schouten , cvs-all@FreeBSD.org, John Baldwin Subject: Re: cvs commit: src/sys/dev/io iodev.c X-BeenThere: cvs-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: CVS commit messages for the entire tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Aug 2008 14:07:48 -0000 On Tue, 12 Aug 2008, Bruce Evans wrote: > On Sat, 9 Aug 2008, John Baldwin wrote: >> You failed to note that not using D_TRACKCLOSE doesn't actually reliable >> close >> devices. > > No, I noted that vfs doesn't count last closes right. Unreliability of > last-close is a special case of that. But using D_TRACKCLOSE mostly makes > things worse. To avoid the vfs bugs in tracked closes, it is necessary > for individual drivers to understand vfs's bugs better than vfs does. > >> I set D_TRACKCLOSE on bpf because under load at work (lots of >> concurrent nmap's) bpf devices would sometimes never get fully closed, so >> you'd have an unopened bpf device that was no longer available. > > It is unclear how that helps. vfs's bugs must be really large for the > last-close to go completely missing. kib fixed some of them a few months > ago (before you changed bpf I think). bpf is especially simple since it > is exclusive-accesses, so the vfs bugs should be smaller for it. > >> D_TRACKCLOSE >> gives much saner semantics for devices, each open() of an fd calls >> the 'd_open' method, and the last close of that fd (which could be in >> another >> process) calls 'd_close'. > > This is insane semantics for most devices, and D_TRACKCLOSE doesn't give it. > Most devices don't care about the difference between file descriptors and > files -- they want d_close to be called only when the last file descriptor > on the device is closed. !D_TRACKCLOSE gives this, modulo the vfs bugs. > > D_TRACKCLOSE doesn't give these semantics: revoke() closes everything > in 1 step and then calls d_close(), giving an N-1 correspondence between > d_open()s and d_close()s (modulo the vfs bugs giving a fuzzier > correspondence). (IIRC, at least before kib's fixes, it was possible > for revoke() to cause any of 0, 1 or 2 calls to d_close(). I think 1 > call always gets as far as devfs_close() but this is only guaranteed > to reaach d_close() with D_TRACKCLOSE.) This is not a bug, but all > drivers that depend on the correspondence being 1-1 are broken. Drivers > that enforce "exclusive" access like bpf have N=1, so they probably > work modulo the vfs bugs, and they have a good chance of not being > affected by the vfs bug that gives an extra devfs_close() and thus an > extra d_close(). I checked that bpf panics (even under UP) due to the obvious bugs in its d_close(): # Generate lots of network activity using something like: sysctl net.inet.icmp.icmplim=0; ping -fq localhost & # Race to panic eventually: while :; do tcpdump -i lo0 & sleep 0.001; revoke /dev/bpf0 Most or all device drivers have obvious bugs in their d_close(); bpf is just a bit easier to understand and more likely to cause a panic than most device drivers, since it is simple and frees resources. A panic is very likely when si_drv1 is freed, and si_drv1 is only locked accidentally. % static int % bpfclose(struct cdev *dev, int flags, int fmt, struct thread *td) % { % struct bpf_d *d = dev->si_drv1; % % BPFD_LOCK(d); % if (d->bd_state == BPF_WAITING) % callout_stop(&d->bd_callout); % d->bd_state = BPF_IDLE; % BPFD_UNLOCK(d); % funsetown(&d->bd_sigio); % mtx_lock(&bpf_mtx); % if (d->bd_bif) % bpf_detachd(d); % mtx_unlock(&bpf_mtx); % selwakeuppri(&d->bd_sel, PRINET); % #ifdef MAC % mac_bpfdesc_destroy(d); % #endif /* MAC */ % knlist_destroy(&d->bd_sel.si_note); % bpf_freed(d); % dev->si_drv1 = NULL; % free(d, M_BPF); Lots of possibly-in-used data structures are destroyed here. % % return (0); % } % % static int % bpfread(struct cdev *dev, struct uio *uio, int ioflag) % { % struct bpf_d *d = dev->si_drv1; There is no locking for si_drv1 here and elsewhere. There used to be locking -- long ago, there was implicit locking for non-SMP non-preemptive kernels. Then there was Giant locking. d can be either NULL or garbage here. (It is null if revoke() just cleared si_drv1 and there is an implicit or accidental memory barrier that makes the change visible here. It is garbage if revoke() has started destroying it but hasn't yet set si_drv1 to NULL when we read si_drv1, or if revoke() starts destroying it after we read si_drv1.) % int timed_out; % int error; % % /* % * Restrict application to use a buffer the same size as % * as kernel buffers. % */ % if (uio->uio_resid != d->bd_bufsize) % return (EINVAL); If we're lucky after losing the race, then d is null and we get a null pointer panic here. I didn't observe this. % % BPFD_LOCK(d); If we're unlucky after losing the race, then we start corrupting *d here. % error = msleep(d, &d->bd_mtx, PRINET|PCATCH, % "bpf", d->bd_rtout); Since we block, we implemented panic(9) even with implicit or explicit Giant locking. The sleep gives a very large race window... % if (error == EINTR || error == ERESTART) { % BPFD_UNLOCK(d); % return (error); % } ...however, we now have locking for d, and bpfclose() blocks on this, so I think we don't implement panic(9) here. We have a broken revoke() instead -- it should do a forced close without waiting, but instead it waits for the unlock in the above or later. % if (error == EWOULDBLOCK) { % /* % * On a timeout, return what's in the buffer, % * which may be nothing. If there is something % * in the store buffer, we can rotate the buffers. % */ % if (d->bd_hbuf) % /* % * We filled up the buffer in between % * getting the timeout and arriving % * here, so we don't need to rotate. % */ % break; % % if (d->bd_slen == 0) { % BPFD_UNLOCK(d); % return (0); % } % ROTATE_BUFFERS(d); % break; % } % } % /* % * At this point, we know we have something in the hold slot. % */ % BPFD_UNLOCK(d); If bpfclose() was blocked on d, then our d is especially likely to become corrupt just after here. % % /* % * Move data from hold buffer into user space. % * We know the entire buffer is transferred since % * we checked above that the read buffer is bpf_bufsize bytes. % * % * XXXRW: More synchronization needed here: what if a second thread % * issues a read on the same fd at the same time? Don't want this % * getting invalidated. % */ % error = bpf_uiomove(d, d->bd_hbuf, d->bd_hlen, uio); % % BPFD_LOCK(d); % d->bd_fbuf = d->bd_hbuf; % d->bd_hbuf = NULL; % d->bd_hlen = 0; % bpf_buf_reclaimed(d); % BPFD_UNLOCK(d); % % return (error); % } Bruce