Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 2 Mar 2013 15:11:21 +0000 (UTC)
From:      "Alexander V. Chernikov" <melifaro@FreeBSD.org>
To:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-9@freebsd.org
Subject:   svn commit: r247629 - in stable/9/sys: kern net security/mac
Message-ID:  <201303021511.r22FBLod031840@svn.freebsd.org>

next in thread | raw e-mail | index | archive | help
Author: melifaro
Date: Sat Mar  2 15:11:20 2013
New Revision: 247629
URL: http://svnweb.freebsd.org/changeset/base/247629

Log:
  Merge
  * r233937 - Improve BPF locking model
  * r233938 - Improve performace for writer-only BPF users
  * r233946 - Fix build
  * r235744 - Fix (new) panic on attaching to non-existent interface
  * r235745 - Fix old panic when BPF consumer attaches to destroying interface
  * r235746 - Call bpf_jitter() before acquiring BPF global lock
  * r235747 - Make most BPF ioctls() SMP-safe.
  * r236231 - Fix BPF_JITTER code broken by r235746.
  * r236251 - Fix shim for BIOCSETF to drop all packets buffered on the descriptor.
  * r236261 - Save the previous filter right before we set new one.
  * r236262 - Fix style(9) nits, reduce unnecessary type castings.
  * r236559 - Fix panic introduced by r235745
  * r236806 - Fix typo introduced in r236559.
  
  r233937
    - Improve BPF locking model.
  
    Interface locks and descriptor locks are converted from mutex(9) to rwlock(9).
    This greately improves performance: in most common case we need to acquire 1
    reader lock instead of 2 mutexes.
  
    - Remove filter(descriptor) (reader) lock in bpf_mtap[2]
    This was suggested by glebius@. We protect filter by requesting interface
    writer lock on filter change.
  
    - Cover struct bpf_if under BPF_INTERNAL define. This permits including bpf.h
    without including rwlock stuff. However, this is is temporary solution,
    struct bpf_if should be made opaque for any external caller.
  
  r233938
    - Improve performace for writer-only BPF users.
  
    Linux and Solaris (at least OpenSolaris) has PF_PACKET socket families to send
    raw ethernet frames. The only FreeBSD interface that can be used to send raw
    frames is BPF. As a result, many programs like cdpd, lldpd, various dhcp stuff
    uses BPF only to send data. This leads us to the situation when software like
    cdpd, being run on high-traffic-volume interface significantly reduces overall
    performance since we have to acquire additional locks for every packet.
  
    Here we add sysctl that changes BPF behavior in the following way:
    If program came and opens BPF socket without explicitly specifyin read filter
    we assume it to be write-only and add it to special writer-only per-interface
    list. This makes bpf_peers_present() return 0, so no additional overhead is
    introduced. After filter is supplied, descriptor is added to original
    per-interface list permitting packets to be captured.
  
    Unfortunately, pcap_open_live() sets catch-all filter itself for the purpose
    of setting snap length.
  
    Fortunately, most programs explicitly sets (event catch-all) filter after
    that. tcpdump(1) is a good example.
  
    So a bit hackis approach is taken: we upgrade description only after second
    BIOCSETF is received.
  
    Sysctl is named net.bpf.optimize_writers and is turned off by default.
  
    - While here, document all sysctl variables in bpf.4
  
  r233946
    Fix build broken by r233938.
  
  r235744
    Fix panic on attaching to non-existent interface
  	(introduced by r233937, pointed by hrs@)
    Fix panic on tcpdump being attached to interface being removed
  	(introduced by r233937, pointed by hrs@ and adrian@)
    Protect most of bpf_setf() by BPF global lock
  
    Add several forgotten assertions (thanks to adrian@)
  
    Document current locking model inside bpf.c
    Document EVENTHANDLER(9) usage inside BPF.
  
  r235745
    Fix old panic when BPF consumer attaches to destroying interface.
    'flags' field is added to the end of bpf_if structure. Currently the only
    flag is BPFIF_FLAG_DYING which is set on bpf detach and checked by bpf_attachd()
    Problem can be easily triggered on SMP stable/[89] by the following command
    (sort of):
    'while true; do ifconfig vlan222 create vlan 222 vlandev em0 up ; \
      tcpdump -pi vlan222 & ; ifconfig vlan222 destroy ; done'
  
    Fix possible use-after-free when BPF detaches itself from interface, freeing
    bpf_bif memory, while interface is still UP and there can be routes via this
    interface. Freeing is now delayed till ifnet_departure_event is received via
    eventhandler(9) api.
  
    Convert bpfd rwlock back to mutex due lack of performance gain
    (currently checking if packet matches filter is done without holding bpfd
     lock and we have to acquire write lock if packet matches)
  
  r235746
    Call bpf_jitter() before acquiring BPF global lock due to malloc() being
    used inside bpf_jitter.
  
    Eliminate bpf_buffer_alloc() and allocate BPF buffers on descriptor creation
     and BIOCSBLEN ioctl. This permits us not to allocate buffers inside
     bpf_attachd() which is protected by global lock.
  
  r235747
    Make most BPF ioctls() SMP-safe.
  
  r236559
    Fix panic introduced by r235745. Panic occurs after first packet traverse
    renamed interface.
    Add several comments on locking
  
  r236231
    Fix BPF_JITTER code broken by r235746.
  
  r236251
    Fix 32-bit shim for BIOCSETF to drop all packets buffered on the descriptor
    and reset statistics as it should.
  
  r236261
    - Save the previous filter right before we set new one.
    - Reduce duplicate code and make it little easier to read.
  
  r236262
    Fix style(9) nits, reduce unnecessary type castings, etc., for bpf_setf().
  
  r236806
    Fix typo introduced in r236559.

Modified:
  stable/9/sys/kern/subr_witness.c
  stable/9/sys/net/bpf.c
  stable/9/sys/net/bpf.h
  stable/9/sys/net/bpf_buffer.c
  stable/9/sys/net/bpf_buffer.h
  stable/9/sys/net/bpfdesc.h
  stable/9/sys/security/mac/mac_net.c
Directory Properties:
  stable/9/sys/   (props changed)

Modified: stable/9/sys/kern/subr_witness.c
==============================================================================
--- stable/9/sys/kern/subr_witness.c	Sat Mar  2 14:54:33 2013	(r247628)
+++ stable/9/sys/kern/subr_witness.c	Sat Mar  2 15:11:20 2013	(r247629)
@@ -562,7 +562,7 @@ static struct witness_order_list_entry o
 	 * BPF
 	 */
 	{ "bpf global lock", &lock_class_mtx_sleep },
-	{ "bpf interface lock", &lock_class_mtx_sleep },
+	{ "bpf interface lock", &lock_class_rw },
 	{ "bpf cdev lock", &lock_class_mtx_sleep },
 	{ NULL, NULL },
 	/*

Modified: stable/9/sys/net/bpf.c
==============================================================================
--- stable/9/sys/net/bpf.c	Sat Mar  2 14:54:33 2013	(r247628)
+++ stable/9/sys/net/bpf.c	Sat Mar  2 15:11:20 2013	(r247629)
@@ -43,6 +43,8 @@ __FBSDID("$FreeBSD$");
 
 #include <sys/types.h>
 #include <sys/param.h>
+#include <sys/lock.h>
+#include <sys/rwlock.h>
 #include <sys/systm.h>
 #include <sys/conf.h>
 #include <sys/fcntl.h>
@@ -66,6 +68,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/socket.h>
 
 #include <net/if.h>
+#define	BPF_INTERNAL
 #include <net/bpf.h>
 #include <net/bpf_buffer.h>
 #ifdef BPF_JITTER
@@ -144,6 +147,7 @@ static int		bpf_bpfd_cnt;
 
 static void	bpf_attachd(struct bpf_d *, struct bpf_if *);
 static void	bpf_detachd(struct bpf_d *);
+static void	bpf_detachd_locked(struct bpf_d *);
 static void	bpf_freed(struct bpf_d *);
 static int	bpf_movein(struct uio *, int, struct ifnet *, struct mbuf **,
 		    struct sockaddr *, int *, struct bpf_insn *);
@@ -155,7 +159,7 @@ static void	catchpacket(struct bpf_d *, 
 		    void (*)(struct bpf_d *, caddr_t, u_int, void *, u_int),
 		    struct bintime *);
 static void	reset_d(struct bpf_d *);
-static int	 bpf_setf(struct bpf_d *, struct bpf_program *, u_long cmd);
+static int	bpf_setf(struct bpf_d *, struct bpf_program *, u_long cmd);
 static int	bpf_getdltlist(struct bpf_d *, struct bpf_dltlist *);
 static int	bpf_setdlt(struct bpf_d *, u_int);
 static void	filt_bpfdetach(struct knote *);
@@ -173,6 +177,12 @@ SYSCTL_INT(_net_bpf, OID_AUTO, zerocopy_
 SYSCTL_NODE(_net_bpf, OID_AUTO, stats, CTLFLAG_MPSAFE | CTLFLAG_RW,
     bpf_stats_sysctl, "bpf statistics portal");
 
+static VNET_DEFINE(int, bpf_optimize_writers) = 0;
+#define	V_bpf_optimize_writers VNET(bpf_optimize_writers)
+SYSCTL_VNET_INT(_net_bpf, OID_AUTO, optimize_writers,
+    CTLFLAG_RW, &VNET_NAME(bpf_optimize_writers), 0,
+    "Do not send packets until BPF program is set");
+
 static	d_open_t	bpfopen;
 static	d_read_t	bpfread;
 static	d_write_t	bpfwrite;
@@ -197,6 +207,37 @@ static struct filterops bpfread_filtops 
 	.f_event = filt_bpfread,
 };
 
+eventhandler_tag	bpf_ifdetach_cookie = NULL;
+
+/*
+ * LOCKING MODEL USED BY BPF:
+ * Locks:
+ * 1) global lock (BPF_LOCK). Mutex, used to protect interface addition/removal,
+ * some global counters and every bpf_if reference.
+ * 2) Interface lock. Rwlock, used to protect list of BPF descriptors and their filters.
+ * 3) Descriptor lock. Mutex, used to protect BPF buffers and various structure fields
+ *   used by bpf_mtap code.
+ *
+ * Lock order:
+ *
+ * Global lock, interface lock, descriptor lock
+ *
+ * We have to acquire interface lock before descriptor main lock due to BPF_MTAP[2]
+ * working model. In many places (like bpf_detachd) we start with BPF descriptor
+ * (and we need to at least rlock it to get reliable interface pointer). This
+ * gives us potential LOR. As a result, we use global lock to protect from bpf_if
+ * change in every such place.
+ *
+ * Changing d->bd_bif is protected by 1) global lock, 2) interface lock and
+ * 3) descriptor main wlock.
+ * Reading bd_bif can be protected by any of these locks, typically global lock.
+ *
+ * Changing read/write BPF filter is protected by the same three locks,
+ * the same applies for reading.
+ *
+ * Sleeping in global lock is not allowed due to bpfdetach() using it.
+ */
+
 /*
  * Wrapper functions for various buffering methods.  If the set of buffer
  * modes expands, we will probably want to introduce a switch data structure
@@ -290,7 +331,6 @@ bpf_canfreebuf(struct bpf_d *d)
 static int
 bpf_canwritebuf(struct bpf_d *d)
 {
-
 	BPFD_LOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
@@ -569,17 +609,92 @@ bad:
 static void
 bpf_attachd(struct bpf_d *d, struct bpf_if *bp)
 {
+	int op_w;
+
+	BPF_LOCK_ASSERT();
+
+	/*
+	 * Save sysctl value to protect from sysctl change
+	 * between reads
+	 */
+	op_w = V_bpf_optimize_writers;
+
+	if (d->bd_bif != NULL)
+		bpf_detachd_locked(d);
 	/*
-	 * Point d at bp, and add d to the interface's list of listeners.
-	 * Finally, point the driver's bpf cookie at the interface so
-	 * it will divert packets to bpf.
+	 * Point d at bp, and add d to the interface's list.
+	 * Since there are many applicaiotns using BPF for
+	 * sending raw packets only (dhcpd, cdpd are good examples)
+	 * we can delay adding d to the list of active listeners until
+	 * some filter is configured.
 	 */
-	BPFIF_LOCK(bp);
+
+	BPFIF_WLOCK(bp);
+	BPFD_LOCK(d);
+
 	d->bd_bif = bp;
-	LIST_INSERT_HEAD(&bp->bif_dlist, d, bd_next);
+
+	if (op_w != 0) {
+		/* Add to writers-only list */
+		LIST_INSERT_HEAD(&bp->bif_wlist, d, bd_next);
+		/*
+		 * We decrement bd_writer on every filter set operation.
+		 * First BIOCSETF is done by pcap_open_live() to set up
+		 * snap length. After that appliation usually sets its own filter
+		 */
+		d->bd_writer = 2;
+	} else
+		LIST_INSERT_HEAD(&bp->bif_dlist, d, bd_next);
+
+	BPFD_UNLOCK(d);
+	BPFIF_WUNLOCK(bp);
 
 	bpf_bpfd_cnt++;
-	BPFIF_UNLOCK(bp);
+
+	CTR3(KTR_NET, "%s: bpf_attach called by pid %d, adding to %s list",
+	    __func__, d->bd_pid, d->bd_writer ? "writer" : "active");
+
+	if (op_w == 0)
+		EVENTHANDLER_INVOKE(bpf_track, bp->bif_ifp, bp->bif_dlt, 1);
+}
+
+/*
+ * Add d to the list of active bp filters.
+ * Reuqires bpf_attachd() to be called before
+ */
+static void
+bpf_upgraded(struct bpf_d *d)
+{
+	struct bpf_if *bp;
+
+	BPF_LOCK_ASSERT();
+
+	bp = d->bd_bif;
+
+	/*
+	 * Filter can be set several times without specifying interface.
+	 * Mark d as reader and exit.
+	 */
+	if (bp == NULL) {
+		BPFD_LOCK(d);
+		d->bd_writer = 0;
+		BPFD_UNLOCK(d);
+		return;
+	}
+
+	BPFIF_WLOCK(bp);
+	BPFD_LOCK(d);
+
+	/* Remove from writers-only list */
+	LIST_REMOVE(d, bd_next);
+	LIST_INSERT_HEAD(&bp->bif_dlist, d, bd_next);
+	/* Mark d as reader */
+	d->bd_writer = 0;
+
+	BPFD_UNLOCK(d);
+	BPFIF_WUNLOCK(bp);
+
+	CTR2(KTR_NET, "%s: upgrade required by pid %d", __func__, d->bd_pid);
 
 	EVENTHANDLER_INVOKE(bpf_track, bp->bif_ifp, bp->bif_dlt, 1);
 }
@@ -590,26 +705,47 @@ bpf_attachd(struct bpf_d *d, struct bpf_
 static void
 bpf_detachd(struct bpf_d *d)
 {
+	BPF_LOCK();
+	bpf_detachd_locked(d);
+	BPF_UNLOCK();
+}
+
+static void
+bpf_detachd_locked(struct bpf_d *d)
+{
 	int error;
 	struct bpf_if *bp;
 	struct ifnet *ifp;
 
-	bp = d->bd_bif;
-	BPFIF_LOCK(bp);
+	CTR2(KTR_NET, "%s: detach required by pid %d", __func__, d->bd_pid);
+
+	BPF_LOCK_ASSERT();
+
+	/* Check if descriptor is attached */
+	if ((bp = d->bd_bif) == NULL)
+		return;
+
+	BPFIF_WLOCK(bp);
 	BPFD_LOCK(d);
-	ifp = d->bd_bif->bif_ifp;
+
+	/* Save bd_writer value */
+	error = d->bd_writer;
 
 	/*
 	 * Remove d from the interface's descriptor list.
 	 */
 	LIST_REMOVE(d, bd_next);
 
-	bpf_bpfd_cnt--;
+	ifp = bp->bif_ifp;
 	d->bd_bif = NULL;
 	BPFD_UNLOCK(d);
-	BPFIF_UNLOCK(bp);
+	BPFIF_WUNLOCK(bp);
+
+	bpf_bpfd_cnt--;
 
-	EVENTHANDLER_INVOKE(bpf_track, ifp, bp->bif_dlt, 0);
+	/* Call event handler iff d is attached */
+	if (error == 0)
+		EVENTHANDLER_INVOKE(bpf_track, ifp, bp->bif_dlt, 0);
 
 	/*
 	 * Check if this descriptor had requested promiscuous mode.
@@ -648,10 +784,7 @@ bpf_dtor(void *data)
 	d->bd_state = BPF_IDLE;
 	BPFD_UNLOCK(d);
 	funsetown(&d->bd_sigio);
-	mtx_lock(&bpf_mtx);
-	if (d->bd_bif)
-		bpf_detachd(d);
-	mtx_unlock(&bpf_mtx);
+	bpf_detachd(d);
 #ifdef MAC
 	mac_bpfdesc_destroy(d);
 #endif /* MAC */
@@ -671,7 +804,7 @@ static	int
 bpfopen(struct cdev *dev, int flags, int fmt, struct thread *td)
 {
 	struct bpf_d *d;
-	int error;
+	int error, size;
 
 	d = malloc(sizeof(*d), M_BPF, M_WAITOK | M_ZERO);
 	error = devfs_set_cdevpriv(d, bpf_dtor);
@@ -689,14 +822,18 @@ bpfopen(struct cdev *dev, int flags, int
 	d->bd_bufmode = BPF_BUFMODE_BUFFER;
 	d->bd_sig = SIGIO;
 	d->bd_direction = BPF_D_INOUT;
-	d->bd_pid = td->td_proc->p_pid;
+	BPF_PID_REFRESH(d, td);
 #ifdef MAC
 	mac_bpfdesc_init(d);
 	mac_bpfdesc_create(td->td_ucred, d);
 #endif
-	mtx_init(&d->bd_mtx, devtoname(dev), "bpf cdev lock", MTX_DEF);
-	callout_init_mtx(&d->bd_callout, &d->bd_mtx, 0);
-	knlist_init_mtx(&d->bd_sel.si_note, &d->bd_mtx);
+	mtx_init(&d->bd_lock, devtoname(dev), "bpf cdev lock", MTX_DEF);
+	callout_init_mtx(&d->bd_callout, &d->bd_lock, 0);
+	knlist_init_mtx(&d->bd_sel.si_note, &d->bd_lock);
+
+	/* Allocate default buffers */
+	size = d->bd_bufsize;
+	bpf_buffer_ioctl_sblen(d, &size);
 
 	return (0);
 }
@@ -726,7 +863,7 @@ bpfread(struct cdev *dev, struct uio *ui
 	non_block = ((ioflag & O_NONBLOCK) != 0);
 
 	BPFD_LOCK(d);
-	d->bd_pid = curthread->td_proc->p_pid;
+	BPF_PID_REFRESH_CUR(d);
 	if (d->bd_bufmode != BPF_BUFMODE_BUFFER) {
 		BPFD_UNLOCK(d);
 		return (EOPNOTSUPP);
@@ -772,7 +909,7 @@ bpfread(struct cdev *dev, struct uio *ui
 			BPFD_UNLOCK(d);
 			return (EWOULDBLOCK);
 		}
-		error = msleep(d, &d->bd_mtx, PRINET|PCATCH,
+		error = msleep(d, &d->bd_lock, PRINET|PCATCH,
 		     "bpf", d->bd_rtout);
 		if (error == EINTR || error == ERESTART) {
 			BPFD_UNLOCK(d);
@@ -889,8 +1026,9 @@ bpfwrite(struct cdev *dev, struct uio *u
 	if (error != 0)
 		return (error);
 
-	d->bd_pid = curthread->td_proc->p_pid;
+	BPF_PID_REFRESH_CUR(d);
 	d->bd_wcount++;
+	/* XXX: locking required */
 	if (d->bd_bif == NULL) {
 		d->bd_wdcount++;
 		return (ENXIO);
@@ -911,6 +1049,7 @@ bpfwrite(struct cdev *dev, struct uio *u
 	bzero(&dst, sizeof(dst));
 	m = NULL;
 	hlen = 0;
+	/* XXX: bpf_movein() can sleep */
 	error = bpf_movein(uio, (int)d->bd_bif->bif_dlt, ifp,
 	    &m, &dst, &hlen, d->bd_wfilter);
 	if (error) {
@@ -970,7 +1109,7 @@ static void
 reset_d(struct bpf_d *d)
 {
 
-	mtx_assert(&d->bd_mtx, MA_OWNED);
+	BPFD_LOCK_ASSERT(d);
 
 	if ((d->bd_hbuf != NULL) &&
 	    (d->bd_bufmode != BPF_BUFMODE_ZBUF || bpf_canfreebuf(d))) {
@@ -1038,7 +1177,7 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 	 * Refresh PID associated with this descriptor.
 	 */
 	BPFD_LOCK(d);
-	d->bd_pid = td->td_proc->p_pid;
+	BPF_PID_REFRESH(d, td);
 	if (d->bd_state == BPF_WAITING)
 		callout_stop(&d->bd_callout);
 	d->bd_state = BPF_IDLE;
@@ -1090,7 +1229,9 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 	case BIOCGDLTLIST32:
 	case BIOCGRTIMEOUT32:
 	case BIOCSRTIMEOUT32:
+		BPFD_LOCK(d);
 		d->bd_compat32 = 1;
+		BPFD_UNLOCK(d);
 	}
 #endif
 
@@ -1135,7 +1276,9 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 	 * Get buffer len [for read()].
 	 */
 	case BIOCGBLEN:
+		BPFD_LOCK(d);
 		*(u_int *)addr = d->bd_bufsize;
+		BPFD_UNLOCK(d);
 		break;
 
 	/*
@@ -1190,10 +1333,12 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 	 * Get current data link type.
 	 */
 	case BIOCGDLT:
+		BPF_LOCK();
 		if (d->bd_bif == NULL)
 			error = EINVAL;
 		else
 			*(u_int *)addr = d->bd_bif->bif_dlt;
+		BPF_UNLOCK();
 		break;
 
 	/*
@@ -1208,6 +1353,7 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 			list32 = (struct bpf_dltlist32 *)addr;
 			dltlist.bfl_len = list32->bfl_len;
 			dltlist.bfl_list = PTRIN(list32->bfl_list);
+			BPF_LOCK();
 			if (d->bd_bif == NULL)
 				error = EINVAL;
 			else {
@@ -1215,31 +1361,37 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 				if (error == 0)
 					list32->bfl_len = dltlist.bfl_len;
 			}
+			BPF_UNLOCK();
 			break;
 		}
 #endif
 
 	case BIOCGDLTLIST:
+		BPF_LOCK();
 		if (d->bd_bif == NULL)
 			error = EINVAL;
 		else
 			error = bpf_getdltlist(d, (struct bpf_dltlist *)addr);
+		BPF_UNLOCK();
 		break;
 
 	/*
 	 * Set data link type.
 	 */
 	case BIOCSDLT:
+		BPF_LOCK();
 		if (d->bd_bif == NULL)
 			error = EINVAL;
 		else
 			error = bpf_setdlt(d, *(u_int *)addr);
+		BPF_UNLOCK();
 		break;
 
 	/*
 	 * Get interface name.
 	 */
 	case BIOCGETIF:
+		BPF_LOCK();
 		if (d->bd_bif == NULL)
 			error = EINVAL;
 		else {
@@ -1249,13 +1401,16 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 			strlcpy(ifr->ifr_name, ifp->if_xname,
 			    sizeof(ifr->ifr_name));
 		}
+		BPF_UNLOCK();
 		break;
 
 	/*
 	 * Set interface.
 	 */
 	case BIOCSETIF:
+		BPF_LOCK();
 		error = bpf_setif(d, (struct ifreq *)addr);
+		BPF_UNLOCK();
 		break;
 
 	/*
@@ -1338,7 +1493,9 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 	 * Set immediate mode.
 	 */
 	case BIOCIMMEDIATE:
+		BPFD_LOCK(d);
 		d->bd_immediate = *(u_int *)addr;
+		BPFD_UNLOCK(d);
 		break;
 
 	case BIOCVERSION:
@@ -1354,21 +1511,27 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 	 * Get "header already complete" flag
 	 */
 	case BIOCGHDRCMPLT:
+		BPFD_LOCK(d);
 		*(u_int *)addr = d->bd_hdrcmplt;
+		BPFD_UNLOCK(d);
 		break;
 
 	/*
 	 * Set "header already complete" flag
 	 */
 	case BIOCSHDRCMPLT:
+		BPFD_LOCK(d);
 		d->bd_hdrcmplt = *(u_int *)addr ? 1 : 0;
+		BPFD_UNLOCK(d);
 		break;
 
 	/*
 	 * Get packet direction flag
 	 */
 	case BIOCGDIRECTION:
+		BPFD_LOCK(d);
 		*(u_int *)addr = d->bd_direction;
+		BPFD_UNLOCK(d);
 		break;
 
 	/*
@@ -1383,7 +1546,9 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 			case BPF_D_IN:
 			case BPF_D_INOUT:
 			case BPF_D_OUT:
+				BPFD_LOCK(d);
 				d->bd_direction = direction;
+				BPFD_UNLOCK(d);
 				break;
 			default:
 				error = EINVAL;
@@ -1395,7 +1560,9 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 	 * Get packet timestamp format and resolution.
 	 */
 	case BIOCGTSTAMP:
+		BPFD_LOCK(d);
 		*(u_int *)addr = d->bd_tstamp;
+		BPFD_UNLOCK(d);
 		break;
 
 	/*
@@ -1414,26 +1581,38 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 		break;
 
 	case BIOCFEEDBACK:
+		BPFD_LOCK(d);
 		d->bd_feedback = *(u_int *)addr;
+		BPFD_UNLOCK(d);
 		break;
 
 	case BIOCLOCK:
+		BPFD_LOCK(d);
 		d->bd_locked = 1;
+		BPFD_UNLOCK(d);
 		break;
 
 	case FIONBIO:		/* Non-blocking I/O */
 		break;
 
 	case FIOASYNC:		/* Send signal on receive packets */
+		BPFD_LOCK(d);
 		d->bd_async = *(int *)addr;
+		BPFD_UNLOCK(d);
 		break;
 
 	case FIOSETOWN:
+		/*
+		 * XXX: Add some sort of locking here?
+		 * fsetown() can sleep.
+		 */
 		error = fsetown(*(int *)addr, &d->bd_sigio);
 		break;
 
 	case FIOGETOWN:
+		BPFD_LOCK(d);
 		*(int *)addr = fgetown(&d->bd_sigio);
+		BPFD_UNLOCK(d);
 		break;
 
 	/* This is deprecated, FIOSETOWN should be used instead. */
@@ -1454,16 +1633,23 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 
 			if (sig >= NSIG)
 				error = EINVAL;
-			else
+			else {
+				BPFD_LOCK(d);
 				d->bd_sig = sig;
+				BPFD_UNLOCK(d);
+			}
 			break;
 		}
 	case BIOCGRSIG:
+		BPFD_LOCK(d);
 		*(u_int *)addr = d->bd_sig;
+		BPFD_UNLOCK(d);
 		break;
 
 	case BIOCGETBUFMODE:
+		BPFD_LOCK(d);
 		*(u_int *)addr = d->bd_bufmode;
+		BPFD_UNLOCK(d);
 		break;
 
 	case BIOCSETBUFMODE:
@@ -1518,19 +1704,31 @@ bpfioctl(struct cdev *dev, u_long cmd, c
 /*
  * Set d's packet filter program to fp.  If this file already has a filter,
  * free it and replace it.  Returns EINVAL for bogus requests.
+ *
+ * Note we need global lock here to serialize bpf_setf() and bpf_setif() calls
+ * since reading d->bd_bif can't be protected by d or interface lock due to
+ * lock order.
+ *
+ * Additionally, we have to acquire interface write lock due to bpf_mtap() uses
+ * interface read lock to read all filers.
+ *
  */
 static int
 bpf_setf(struct bpf_d *d, struct bpf_program *fp, u_long cmd)
 {
+#ifdef COMPAT_FREEBSD32
+	struct bpf_program fp_swab;
+	struct bpf_program32 *fp32;
+#endif
 	struct bpf_insn *fcode, *old;
-	u_int wfilter, flen, size;
 #ifdef BPF_JITTER
-	bpf_jit_filter *ofunc;
+	bpf_jit_filter *jfunc, *ofunc;
 #endif
-#ifdef COMPAT_FREEBSD32
-	struct bpf_program32 *fp32;
-	struct bpf_program fp_swab;
+	size_t size;
+	u_int flen;
+	int need_upgrade;
 
+#ifdef COMPAT_FREEBSD32
 	switch (cmd) {
 	case BIOCSETF32:
 	case BIOCSETWF32:
@@ -1550,73 +1748,86 @@ bpf_setf(struct bpf_d *d, struct bpf_pro
 		break;
 	}
 #endif
-	if (cmd == BIOCSETWF) {
-		old = d->bd_wfilter;
-		wfilter = 1;
-#ifdef BPF_JITTER
-		ofunc = NULL;
-#endif
-	} else {
-		wfilter = 0;
-		old = d->bd_rfilter;
+
+	fcode = NULL;
 #ifdef BPF_JITTER
-		ofunc = d->bd_bfilter;
+	jfunc = ofunc = NULL;
 #endif
-	}
-	if (fp->bf_insns == NULL) {
-		if (fp->bf_len != 0)
+	need_upgrade = 0;
+
+	/*
+	 * Check new filter validness before acquiring any locks.
+	 * Allocate memory for new filter, if needed.
+	 */
+	flen = fp->bf_len;
+	if (flen > bpf_maxinsns || (fp->bf_insns == NULL && flen != 0))
+		return (EINVAL);
+	size = flen * sizeof(*fp->bf_insns);
+	if (size > 0) {
+		/* We're setting up new filter.  Copy and check actual data. */
+		fcode = malloc(size, M_BPF, M_WAITOK);
+		if (copyin(fp->bf_insns, fcode, size) != 0 ||
+		    !bpf_validate(fcode, flen)) {
+			free(fcode, M_BPF);
 			return (EINVAL);
-		BPFD_LOCK(d);
-		if (wfilter)
-			d->bd_wfilter = NULL;
-		else {
-			d->bd_rfilter = NULL;
-#ifdef BPF_JITTER
-			d->bd_bfilter = NULL;
-#endif
-			if (cmd == BIOCSETF)
-				reset_d(d);
 		}
-		BPFD_UNLOCK(d);
-		if (old != NULL)
-			free((caddr_t)old, M_BPF);
 #ifdef BPF_JITTER
-		if (ofunc != NULL)
-			bpf_destroy_jit_filter(ofunc);
+		/* Filter is copied inside fcode and is perfectly valid. */
+		jfunc = bpf_jitter(fcode, flen);
 #endif
-		return (0);
 	}
-	flen = fp->bf_len;
-	if (flen > bpf_maxinsns)
-		return (EINVAL);
 
-	size = flen * sizeof(*fp->bf_insns);
-	fcode = (struct bpf_insn *)malloc(size, M_BPF, M_WAITOK);
-	if (copyin((caddr_t)fp->bf_insns, (caddr_t)fcode, size) == 0 &&
-	    bpf_validate(fcode, (int)flen)) {
-		BPFD_LOCK(d);
-		if (wfilter)
-			d->bd_wfilter = fcode;
-		else {
-			d->bd_rfilter = fcode;
+	BPF_LOCK();
+
+	/*
+	 * Set up new filter.
+	 * Protect filter change by interface lock.
+	 * Additionally, we are protected by global lock here.
+	 */
+	if (d->bd_bif != NULL)
+		BPFIF_WLOCK(d->bd_bif);
+	BPFD_LOCK(d);
+	if (cmd == BIOCSETWF) {
+		old = d->bd_wfilter;
+		d->bd_wfilter = fcode;
+	} else {
+		old = d->bd_rfilter;
+		d->bd_rfilter = fcode;
 #ifdef BPF_JITTER
-			d->bd_bfilter = bpf_jitter(fcode, flen);
+		ofunc = d->bd_bfilter;
+		d->bd_bfilter = jfunc;
 #endif
-			if (cmd == BIOCSETF)
-				reset_d(d);
+		if (cmd == BIOCSETF)
+			reset_d(d);
+
+		if (fcode != NULL) {
+			/*
+			 * Do not require upgrade by first BIOCSETF
+			 * (used to set snaplen) by pcap_open_live().
+			 */
+			if (d->bd_writer != 0 && --d->bd_writer == 0)
+				need_upgrade = 1;
+			CTR4(KTR_NET, "%s: filter function set by pid %d, "
+			    "bd_writer counter %d, need_upgrade %d",
+			    __func__, d->bd_pid, d->bd_writer, need_upgrade);
 		}
-		BPFD_UNLOCK(d);
-		if (old != NULL)
-			free((caddr_t)old, M_BPF);
+	}
+	BPFD_UNLOCK(d);
+	if (d->bd_bif != NULL)
+		BPFIF_WUNLOCK(d->bd_bif);
+	if (old != NULL)
+		free(old, M_BPF);
 #ifdef BPF_JITTER
-		if (ofunc != NULL)
-			bpf_destroy_jit_filter(ofunc);
+	if (ofunc != NULL)
+		bpf_destroy_jit_filter(ofunc);
 #endif
 
-		return (0);
-	}
-	free((caddr_t)fcode, M_BPF);
-	return (EINVAL);
+	/* Move d to active readers list. */
+	if (need_upgrade)
+		bpf_upgraded(d);
+
+	BPF_UNLOCK();
+	return (0);
 }
 
 /*
@@ -1630,28 +1841,30 @@ bpf_setif(struct bpf_d *d, struct ifreq 
 	struct bpf_if *bp;
 	struct ifnet *theywant;
 
+	BPF_LOCK_ASSERT();
+
 	theywant = ifunit(ifr->ifr_name);
 	if (theywant == NULL || theywant->if_bpf == NULL)
 		return (ENXIO);
 
 	bp = theywant->if_bpf;
 
+	/* Check if interface is not being detached from BPF */
+	BPFIF_RLOCK(bp);
+	if (bp->flags & BPFIF_FLAG_DYING) {
+		BPFIF_RUNLOCK(bp);
+		return (ENXIO);
+	}
+	BPFIF_RUNLOCK(bp);
+
 	/*
 	 * Behavior here depends on the buffering model.  If we're using
 	 * kernel memory buffers, then we can allocate them here.  If we're
 	 * using zero-copy, then the user process must have registered
 	 * buffers by the time we get here.  If not, return an error.
-	 *
-	 * XXXRW: There are locking issues here with multi-threaded use: what
-	 * if two threads try to set the interface at once?
 	 */
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_BUFFER:
-		if (d->bd_sbuf == NULL)
-			bpf_buffer_alloc(d);
-		KASSERT(d->bd_sbuf != NULL, ("bpf_setif: bd_sbuf NULL"));
-		break;
-
 	case BPF_BUFMODE_ZBUF:
 		if (d->bd_sbuf == NULL)
 			return (EINVAL);
@@ -1660,15 +1873,8 @@ bpf_setif(struct bpf_d *d, struct ifreq 
 	default:
 		panic("bpf_setif: bufmode %d", d->bd_bufmode);
 	}
-	if (bp != d->bd_bif) {
-		if (d->bd_bif)
-			/*
-			 * Detach if attached to something else.
-			 */
-			bpf_detachd(d);
-
+	if (bp != d->bd_bif)
 		bpf_attachd(d, bp);
-	}
 	BPFD_LOCK(d);
 	reset_d(d);
 	BPFD_UNLOCK(d);
@@ -1696,7 +1902,7 @@ bpfpoll(struct cdev *dev, int events, st
 	 */
 	revents = events & (POLLOUT | POLLWRNORM);
 	BPFD_LOCK(d);
-	d->bd_pid = td->td_proc->p_pid;
+	BPF_PID_REFRESH(d, td);
 	if (events & (POLLIN | POLLRDNORM)) {
 		if (bpf_ready(d))
 			revents |= events & (POLLIN | POLLRDNORM);
@@ -1731,7 +1937,7 @@ bpfkqfilter(struct cdev *dev, struct kno
 	 * Refresh PID associated with this descriptor.
 	 */
 	BPFD_LOCK(d);
-	d->bd_pid = curthread->td_proc->p_pid;
+	BPF_PID_REFRESH_CUR(d);
 	kn->kn_fop = &bpfread_filtops;
 	kn->kn_hook = d;
 	knlist_add(&d->bd_sel.si_note, kn, 1);
@@ -1829,9 +2035,19 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, 
 	int gottime;
 
 	gottime = BPF_TSTAMP_NONE;
-	BPFIF_LOCK(bp);
+
+	BPFIF_RLOCK(bp);
+
 	LIST_FOREACH(d, &bp->bif_dlist, bd_next) {
-		BPFD_LOCK(d);
+		/*
+		 * We are not using any locks for d here because:
+		 * 1) any filter change is protected by interface
+		 * write lock
+		 * 2) destroying/detaching d is protected by interface
+		 * write lock, too
+		 */
+
+		/* XXX: Do not protect counter for the sake of performance. */
 		++d->bd_rcount;
 		/*
 		 * NB: We dont call BPF_CHECK_DIRECTION() here since there is no
@@ -1847,6 +2063,11 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, 
 #endif
 		slen = bpf_filter(d->bd_rfilter, pkt, pktlen, pktlen);
 		if (slen != 0) {
+			/*
+			 * Filter matches. Let's to acquire write lock.
+			 */
+			BPFD_LOCK(d);
+
 			d->bd_fcount++;
 			if (gottime < bpf_ts_quality(d->bd_tstamp))
 				gottime = bpf_gettime(&bt, d->bd_tstamp, NULL);
@@ -1855,10 +2076,10 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, 
 #endif
 				catchpacket(d, pkt, pktlen, slen,
 				    bpf_append_bytes, &bt);
+			BPFD_UNLOCK(d);
 		}
-		BPFD_UNLOCK(d);
 	}
-	BPFIF_UNLOCK(bp);
+	BPFIF_RUNLOCK(bp);
 }
 
 #define	BPF_CHECK_DIRECTION(d, r, i)				\
@@ -1867,6 +2088,7 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, 
 
 /*
  * Incoming linkage from device drivers, when packet is in an mbuf chain.
+ * Locking model is explained in bpf_tap().
  */
 void
 bpf_mtap(struct bpf_if *bp, struct mbuf *m)
@@ -1886,13 +2108,13 @@ bpf_mtap(struct bpf_if *bp, struct mbuf 
 	}
 
 	pktlen = m_length(m, NULL);
-
 	gottime = BPF_TSTAMP_NONE;
-	BPFIF_LOCK(bp);
+
+	BPFIF_RLOCK(bp);
+
 	LIST_FOREACH(d, &bp->bif_dlist, bd_next) {
 		if (BPF_CHECK_DIRECTION(d, m->m_pkthdr.rcvif, bp->bif_ifp))
 			continue;
-		BPFD_LOCK(d);
 		++d->bd_rcount;
 #ifdef BPF_JITTER
 		bf = bpf_jitter_enable != 0 ? d->bd_bfilter : NULL;
@@ -1903,6 +2125,8 @@ bpf_mtap(struct bpf_if *bp, struct mbuf 
 #endif
 		slen = bpf_filter(d->bd_rfilter, (u_char *)m, pktlen, 0);
 		if (slen != 0) {
+			BPFD_LOCK(d);
+
 			d->bd_fcount++;
 			if (gottime < bpf_ts_quality(d->bd_tstamp))
 				gottime = bpf_gettime(&bt, d->bd_tstamp, m);
@@ -1911,10 +2135,10 @@ bpf_mtap(struct bpf_if *bp, struct mbuf 
 #endif
 				catchpacket(d, (u_char *)m, pktlen, slen,
 				    bpf_append_mbuf, &bt);
+			BPFD_UNLOCK(d);
 		}
-		BPFD_UNLOCK(d);
 	}
-	BPFIF_UNLOCK(bp);
+	BPFIF_RUNLOCK(bp);
 }
 
 /*
@@ -1948,14 +2172,17 @@ bpf_mtap2(struct bpf_if *bp, void *data,
 	pktlen += dlen;
 
 	gottime = BPF_TSTAMP_NONE;
-	BPFIF_LOCK(bp);
+
+	BPFIF_RLOCK(bp);
+
 	LIST_FOREACH(d, &bp->bif_dlist, bd_next) {
 		if (BPF_CHECK_DIRECTION(d, m->m_pkthdr.rcvif, bp->bif_ifp))
 			continue;
-		BPFD_LOCK(d);
 		++d->bd_rcount;
 		slen = bpf_filter(d->bd_rfilter, (u_char *)&mb, pktlen, 0);
 		if (slen != 0) {
+			BPFD_LOCK(d);
+
 			d->bd_fcount++;
 			if (gottime < bpf_ts_quality(d->bd_tstamp))
 				gottime = bpf_gettime(&bt, d->bd_tstamp, m);
@@ -1964,10 +2191,10 @@ bpf_mtap2(struct bpf_if *bp, void *data,
 #endif
 				catchpacket(d, (u_char *)&mb, pktlen, slen,
 				    bpf_append_mbuf, &bt);
+			BPFD_UNLOCK(d);
 		}
-		BPFD_UNLOCK(d);
 	}
-	BPFIF_UNLOCK(bp);
+	BPFIF_RUNLOCK(bp);
 }
 
 #undef	BPF_CHECK_DIRECTION
@@ -2206,7 +2433,7 @@ bpf_freed(struct bpf_d *d)
 	}
 	if (d->bd_wfilter != NULL)
 		free((caddr_t)d->bd_wfilter, M_BPF);
-	mtx_destroy(&d->bd_mtx);
+	mtx_destroy(&d->bd_lock);
 }
 
 /*
@@ -2236,15 +2463,16 @@ bpfattach2(struct ifnet *ifp, u_int dlt,
 		panic("bpfattach");
 
 	LIST_INIT(&bp->bif_dlist);
+	LIST_INIT(&bp->bif_wlist);
 	bp->bif_ifp = ifp;
 	bp->bif_dlt = dlt;
-	mtx_init(&bp->bif_mtx, "bpf interface lock", NULL, MTX_DEF);
+	rw_init(&bp->bif_lock, "bpf interface lock");
 	KASSERT(*driverp == NULL, ("bpfattach2: driverp already initialized"));
 	*driverp = bp;
 
-	mtx_lock(&bpf_mtx);
+	BPF_LOCK();
 	LIST_INSERT_HEAD(&bpf_iflist, bp, bif_next);
-	mtx_unlock(&bpf_mtx);
+	BPF_UNLOCK();
 
 	bp->bif_hdrlen = hdrlen;
 
@@ -2253,10 +2481,9 @@ bpfattach2(struct ifnet *ifp, u_int dlt,
 }
 
 /*
- * Detach bpf from an interface.  This involves detaching each descriptor
- * associated with the interface, and leaving bd_bif NULL.  Notify each
- * descriptor as it's detached so that any sleepers wake up and get
- * ENXIO.
+ * Detach bpf from an interface. This involves detaching each descriptor
+ * associated with the interface. Notify each descriptor as it's detached
+ * so that any sleepers wake up and get ENXIO.
  */
 void
 bpfdetach(struct ifnet *ifp)
@@ -2269,31 +2496,45 @@ bpfdetach(struct ifnet *ifp)
 	ndetached = 0;
 #endif

*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201303021511.r22FBLod031840>