From owner-freebsd-net@FreeBSD.ORG  Mon Mar 26 11:02:53 2012
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx2.freebsd.org (mx2.freebsd.org [IPv6:2001:4f8:fff6::35])
	by hub.freebsd.org (Postfix) with ESMTP id 1F7FF106566C;
	Mon, 26 Mar 2012 11:02:53 +0000 (UTC)
	(envelope-from melifaro@FreeBSD.org)
Received: from dhcp170-36-red.yandex.net (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx2.freebsd.org (Postfix) with ESMTP id 66A7E150CD8;
	Mon, 26 Mar 2012 11:02:51 +0000 (UTC)
Message-ID: <4F704C1B.5080709@FreeBSD.org>
Date: Mon, 26 Mar 2012 14:59:39 +0400
From: "Alexander V. Chernikov" <melifaro@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
	rv:8.0) Gecko/20111117 Thunderbird/8.0
MIME-Version: 1.0
To: freebsd-net@freebsd.org
Content-Type: multipart/mixed; boundary="------------060905060202080703000205"
Cc: Gleb Smirnoff <glebius@FreeBSD.org>
Subject: [PATCH] BPF locking redesign
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 Mar 2012 11:02:53 -0000

This is a multi-part message in MIME format.
--------------060905060202080703000205
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hello list!

There are some patches that can significantly improve forwarding 
performance if BPF consumers are present. However, some changes are a 
bit hackish and ABI change is required. Those are split into separate 
patches I want to discuss.

You probably need to merge r233505 for patches to work.

Description: bpf_rwlock

This is simple and straight-forwarded, we convert interface and 
descriptor locks to rwlock(9).

Additionally, filter(descriptor) (reader) lock in bpf_mtap[2] is 
removed. This was suggested by glebius@. We protect filter by requesting 
interface writer lock on filter change.

This greately improves performance: in most common case we need to 
acquire 1 reader lock instead of 2 mutexes.

bpf_if structure is now covered by  BPF_INTERNAL define. This permits 
including bpf.h without including rwlock stuff. However, this is is 
temporary solution, struct bpf_if should be made opaque for any external 
caller.


Description: bpf_writers

Linux and Solaris (at least OpenSolaris) has PF_PACKET socket families 
to send
raw ethernet frames. The only FreeBSD interface that can be used to send 
raw frames
is BPF. As a result, many programs like cdpd, lldpd, various dhcp stuff uses
BPF only to send data. This leads us to the situation when software like 
cdpd,
being run on high-traffic-volume interface significantly reduces overall 
performance
since we have to acquire additional locks for every packet.

Here we add sysctl that changes BPF behavior in the following way:
If program came and opens BPF socket without explicitly specifyin read 
filter we
assume it to be write-only and add it to special writer-only 
per-interface list.
This makes bpf_peers_present() return 0, so no additional overhead is 
introduced.
After filter is supplied, descriptor is added to original per-interface 
list permitting
packets to be captured.

Unfortunately, pcap_open_live() sets catch-all filter itself for the 
purpose of
setting snap length.

Fortunately, most programs explicitly sets (event catch-all) filter 
after that.
tcpdump(1) is a good example.

So a bit hackis approach is taken: we upgrade description only after second
BIOCSETF is received.

Sysctl is named net.bpf.optimize_writers and is turned off by default.


Description: bpf_if_opaque.

No patch at the moment. We can probably do the following:

Create bpf_if structure on bpfattach2, but do not assign it to supplied 
pointer. Instead we put new structure, pointer, interface pointer to 
some kind of hash (with interface name as key). When _reader_ comes AND 
sets valid filter, we checks if we need to attach bpfif to interface.
When reader disconnects, we set bpfif pointer back to zero.
No additional locking is required here: same struct bpfif lives as long 
as interface exists, so pointer will be either NULL or pointer to 
structure in and period of time. Even if some thread on other CPU sees 
non-coherent value - no problem. NULL means no filter is set so we skips 
BPF, non-null starts BPF_MTAP which acquires rlock and determines that 
no peers are present.

As a result, bpf_peers_present(x) simply returns (x).
There is no need to expose struct bpfif to external users.
Additionally, we do not request interface write lock to be acquired on 
every interface attach/departure which can potentially lead to small 
number of packets being dropped on mpd servers.

Btw, we can consider changing rlock / wlock to _try_ one to avoid this 
behavior.

Major drawback of this approach is totally broken ABI.
Any pre-compiled network driver carries inlined bpf_peers_present() 
which assumes ifp->if_bpf to be non-NULL.

However, we can introduce bpfattach3() or some special interface flag 
(set by if_alloc(), for example),  indicating that driver uses new api, 
and retain original behavior for old drivers.


-- 
WBR, Alexander

--------------060905060202080703000205
Content-Type: text/plain;
 name="bpf_writers.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="bpf_writers.diff"

>From 2ca09cf74ef63fbde0ccede4a7e883c6f0add51d Mon Sep 17 00:00:00 2001
From: "Alexander V. Chernikov" <melifaro@ipfw.ru>
Date: Mon, 26 Mar 2012 14:58:13 +0400
Subject: [PATCH 2/2] Optimize BPF writers

---
 share/man/man4/bpf.4 |   31 +++++++++++++---
 sys/net/bpf.c        |   97 ++++++++++++++++++++++++++++++++++++++++++++++----
 sys/net/bpf.h        |    1 +
 sys/net/bpfdesc.h    |    1 +
 4 files changed, 120 insertions(+), 10 deletions(-)

diff --git a/share/man/man4/bpf.4 b/share/man/man4/bpf.4
index b9d6742..44bed39 100644
--- a/share/man/man4/bpf.4
+++ b/share/man/man4/bpf.4
@@ -952,10 +952,33 @@ array initializers:
 .Fn BPF_STMT opcode operand
 and
 .Fn BPF_JUMP opcode operand true_offset false_offset .
-.Sh FILES
-.Bl -tag -compact -width /dev/bpf
-.It Pa /dev/bpf
-the packet filter device
+.Sh SYSCTL VARIABLES
+A set of
+.Xr sysctl 8
+variables controls the behaviour of the
+.Nm
+subsystem
+.Bl -tag -width indent
+.It Va net.bpf.optimize_writers: No 0
+Various programs use BPF to send (but not receive) raw packets
+(cdpd, lldpd, dhcpd, dhcp relays, etc. are good examples of such programs).
+They do not need incoming packets to be send to them. Turning this option on
+makes new BPF users to be attached to write-only interface list until program
+explicitly specifies read filter via
+.Cm pcap_set_filter() .
+This removes any performance degradation for high-speed interfaces.
+.It Va net.bpf.stats:
+Binary interface for retrieving general statistics.
+.It Va net.bpf.zerocopy_enable: No 0
+Permits zero-copy to be used with net BPF readers. Use with caution.
+.It Va net.bpf.maxinsns: No 512
+Maximum number of instructions that BPF program can contain. Use
+.Xr tcpdump 1
+-d option to determine approximate number of instruction for any filter.
+.It Va net.bpf.maxbufsize: No 524288
+Maximum buffer size to allocate for packets buffer.
+.It Va net.bpf.bufsize: No 4096
+Default buffer size to allocate for packets buffer.
 .El
 .Sh EXAMPLES
 The following filter is taken from the Reverse ARP Daemon.
diff --git a/sys/net/bpf.c b/sys/net/bpf.c
index c1d4da7..a3a64ad 100644
--- a/sys/net/bpf.c
+++ b/sys/net/bpf.c
@@ -176,6 +176,12 @@ SYSCTL_INT(_net_bpf, OID_AUTO, zerocopy_enable, CTLFLAG_RW,
 static SYSCTL_NODE(_net_bpf, OID_AUTO, stats, CTLFLAG_MPSAFE | CTLFLAG_RW,
     bpf_stats_sysctl, "bpf statistics portal");
 
+static VNET_DEFINE(int, bpf_optimize_writers) = 0;
+#define	V_bpf_optimize_writers VNET(bpf_optimize_writers)
+SYSCTL_VNET_INT(_net_bpf, OID_AUTO, optimize_writers,
+    CTLFLAG_RW, &VNET_NAME(bpf_optimize_writers), 0,
+    "Do not send packets until BPF program is set");
+
 static	d_open_t	bpfopen;
 static	d_read_t	bpfread;
 static	d_write_t	bpfwrite;
@@ -572,17 +578,66 @@ static void
 bpf_attachd(struct bpf_d *d, struct bpf_if *bp)
 {
 	/*
-	 * Point d at bp, and add d to the interface's list of listeners.
-	 * Finally, point the driver's bpf cookie at the interface so
-	 * it will divert packets to bpf.
+	 * Point d at bp, and add d to the interface's list.
+	 * Since there are many applicaiotns using BPF for
+	 * sending raw packets only (dhcpd, cdpd are good examples)
+	 * we can delay adding d to the list of active listeners until
+	 * some filter is configured.
 	 */
-	BPFIF_WLOCK(bp);
 	d->bd_bif = bp;
-	LIST_INSERT_HEAD(&bp->bif_dlist, d, bd_next);
 
+	BPFIF_WLOCK(bp);
+
+	if (V_bpf_optimize_writers != 0) {
+		/* Add to writers-only list */
+		LIST_INSERT_HEAD(&bp->bif_wlist, d, bd_next);
+		/*
+		 * We decrement bd_writer on every filter set operation.
+		 * First BIOCSETF is done by pcap_open_live() to set up
+		 * snap length. After that appliation usually sets its own filter
+		 */
+		d->bd_writer = 2;
+	} else
+		LIST_INSERT_HEAD(&bp->bif_dlist, d, bd_next);
+
+	BPFIF_WUNLOCK(bp);
+
+	BPF_LOCK();
 	bpf_bpfd_cnt++;
+	BPF_UNLOCK();
+
+	CTR3(KTR_NET, "%s: bpf_attach called by pid %d, adding to %s list",
+	    __func__, d->bd_pid, d->bd_writer ? "writer" : "active");
+
+	if (V_bpf_optimize_writers == 0)
+		EVENTHANDLER_INVOKE(bpf_track, bp->bif_ifp, bp->bif_dlt, 1);
+}
+
+/*
+ * Add d to the list of active bp filters.
+ * Reuqires bpf_attachd() to be called before
+ */
+static void
+bpf_upgraded(struct bpf_d *d)
+{
+	struct bpf_if *bp;
+
+	bp = d->bd_bif;
+
+	BPFIF_WLOCK(bp);
+	BPFD_WLOCK(d);
+
+	/* Remove from writers-only list */
+	LIST_REMOVE(d, bd_next);
+	LIST_INSERT_HEAD(&bp->bif_dlist, d, bd_next);
+	/* Mark d as reader */
+	d->bd_writer = 0;
+
+	BPFD_WUNLOCK(d);
 	BPFIF_WUNLOCK(bp);
 
+	CTR2(KTR_NET, "%s: upgrade required by pid %d", __func__, d->bd_pid);
+
 	EVENTHANDLER_INVOKE(bpf_track, bp->bif_ifp, bp->bif_dlt, 1);
 }
 
@@ -596,12 +651,17 @@ bpf_detachd(struct bpf_d *d)
 	struct bpf_if *bp;
 	struct ifnet *ifp;
 
+	CTR2(KTR_NET, "%s: detach required by pid %d", __func__, d->bd_pid);
+
 	BPF_LOCK_ASSERT();
 
 	bp = d->bd_bif;
 	BPFIF_WLOCK(bp);
 	BPFD_WLOCK(d);
 
+	/* Save bd_writer value */
+	error = d->bd_writer;
+
 	/*
 	 * Remove d from the interface's descriptor list.
 	 */
@@ -615,7 +675,9 @@ bpf_detachd(struct bpf_d *d)
 	/* We're already protected by global lock. */
 	bpf_bpfd_cnt--;
 
-	EVENTHANDLER_INVOKE(bpf_track, ifp, bp->bif_dlt, 0);
+	/* Call event handler iff d is attached */
+	if (error == 0)
+		EVENTHANDLER_INVOKE(bpf_track, ifp, bp->bif_dlt, 0);
 
 	/*
 	 * Check if this descriptor had requested promiscuous mode.
@@ -1536,6 +1598,7 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp, u_long cmd)
 #ifdef COMPAT_FREEBSD32
 	struct bpf_program32 *fp32;
 	struct bpf_program fp_swab;
+	int need_upgrade = 0;
 
 	if (cmd == BIOCSETWF32 || cmd == BIOCSETF32 || cmd == BIOCSETFNR32) {
 		fp32 = (struct bpf_program32 *)fp;
@@ -1611,6 +1674,16 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp, u_long cmd)
 #endif
 			if (cmd == BIOCSETF)
 				reset_d(d);
+
+			/*
+			 * Do not require upgrade by first BIOCSETF
+			 * (used to set snaplen) by pcap_open_live()
+			 */
+			if ((d->bd_writer != 0) && (--d->bd_writer == 0))
+				need_upgrade = 1;
+			CTR4(KTR_NET, "%s: filter function set by pid %d, "
+			    "bd_writer counter %d, need_upgrade %d",
+			    __func__, d->bd_pid, d->bd_writer, need_upgrade);
 		}
 		BPFD_WUNLOCK(d);
 		BPFIF_WUNLOCK(d->bd_bif);
@@ -1621,6 +1694,10 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp, u_long cmd)
 			bpf_destroy_jit_filter(ofunc);
 #endif
 
+		/* Move d to active readers list */
+		if (need_upgrade != 0)
+			bpf_upgraded(d);
+
 		return (0);
 	}
 	free((caddr_t)fcode, M_BPF);
@@ -2265,6 +2342,7 @@ bpfattach2(struct ifnet *ifp, u_int dlt, u_int hdrlen, struct bpf_if **driverp)
 		panic("bpfattach");
 
 	LIST_INIT(&bp->bif_dlist);
+	LIST_INIT(&bp->bif_wlist);
 	bp->bif_ifp = ifp;
 	bp->bif_dlt = dlt;
 	rw_init(&bp->bif_lock, "bpf interface lock");
@@ -2520,6 +2598,13 @@ bpf_stats_sysctl(SYSCTL_HANDLER_ARGS)
 	index = 0;
 	LIST_FOREACH(bp, &bpf_iflist, bif_next) {
 		BPFIF_RLOCK(bp);
+		/* Send writers-only first */
+		LIST_FOREACH(bd, &bp->bif_wlist, bd_next) {
+			xbd = &xbdbuf[index++];
+			BPFD_RLOCK(bd);
+			bpfstats_fill_xbpf(xbd, bd);
+			BPFD_RUNLOCK(bd);
+		}
 		LIST_FOREACH(bd, &bp->bif_dlist, bd_next) {
 			xbd = &xbdbuf[index++];
 			BPFD_RLOCK(bd);
diff --git a/sys/net/bpf.h b/sys/net/bpf.h
index 4c4f0c3..01e1bd2 100644
--- a/sys/net/bpf.h
+++ b/sys/net/bpf.h
@@ -1104,6 +1104,7 @@ struct bpf_if {
 	u_int bif_hdrlen;		/* length of link header */
 	struct ifnet *bif_ifp;		/* corresponding interface */
 	struct rwlock bif_lock;		/* interface lock */
+	LIST_HEAD(, bpf_d)	bif_wlist;	/* writer-only list */
 #endif
 };
 
diff --git a/sys/net/bpfdesc.h b/sys/net/bpfdesc.h
index 9ea4522..e11fdc6 100644
--- a/sys/net/bpfdesc.h
+++ b/sys/net/bpfdesc.h
@@ -79,6 +79,7 @@ struct bpf_d {
 	u_char		bd_promisc;	/* true if listening promiscuously */
 	u_char		bd_state;	/* idle, waiting, or timed out */
 	u_char		bd_immediate;	/* true to return on packet arrival */
+	u_char		bd_writer;	/* non-zero if d is writer-only */
 	int		bd_hdrcmplt;	/* false to fill in src lladdr automatically */
 	int		bd_direction;	/* select packet direction */
 	int		bd_tstamp;	/* select time stamping function */
-- 
1.7.9.4


--------------060905060202080703000205
Content-Type: text/plain;
 name="bpf_rwlock.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="bpf_rwlock.diff"

>From 00a39222ce5721e2fc3925657c3b1e69502d59b5 Mon Sep 17 00:00:00 2001
From: "Alexander V. Chernikov" <melifaro@ipfw.ru>
Date: Mon, 26 Mar 2012 14:57:09 +0400
Subject: [PATCH 1/2] Change mutexes to rwlocks

---
 sys/kern/subr_witness.c    |    4 +-
 sys/net/bpf.c              |  251 +++++++++++++++++++++++++-------------------
 sys/net/bpf.h              |    7 +-
 sys/net/bpf_buffer.c       |    6 +-
 sys/net/bpf_zerocopy.c     |   10 +-
 sys/net/bpfdesc.h          |   23 ++--
 sys/security/mac/mac_net.c |    2 +
 7 files changed, 180 insertions(+), 123 deletions(-)

diff --git a/sys/kern/subr_witness.c b/sys/kern/subr_witness.c
index 7853e69..53529b3 100644
--- a/sys/kern/subr_witness.c
+++ b/sys/kern/subr_witness.c
@@ -563,8 +563,8 @@ static struct witness_order_list_entry order_lists[] = {
 	 * BPF
 	 */
 	{ "bpf global lock", &lock_class_mtx_sleep },
-	{ "bpf interface lock", &lock_class_mtx_sleep },
-	{ "bpf cdev lock", &lock_class_mtx_sleep },
+	{ "bpf interface lock", &lock_class_rw },
+	{ "bpf cdev lock", &lock_class_rw },
 	{ NULL, NULL },
 	/*
 	 * NFS server
diff --git a/sys/net/bpf.c b/sys/net/bpf.c
index cf8f2ec..c1d4da7 100644
--- a/sys/net/bpf.c
+++ b/sys/net/bpf.c
@@ -43,6 +43,8 @@ __FBSDID("$FreeBSD: head/sys/net/bpf.c 232449 2012-03-03 08:19:18Z jmallett $");
 
 #include <sys/types.h>
 #include <sys/param.h>
+#include <sys/lock.h>
+#include <sys/rwlock.h>
 #include <sys/systm.h>
 #include <sys/conf.h>
 #include <sys/fcntl.h>
@@ -66,6 +68,7 @@ __FBSDID("$FreeBSD: head/sys/net/bpf.c 232449 2012-03-03 08:19:18Z jmallett $");
 #include <sys/socket.h>
 
 #include <net/if.h>
+#define	BPF_INTERNAL
 #include <net/bpf.h>
 #include <net/bpf_buffer.h>
 #ifdef BPF_JITTER
@@ -207,7 +210,7 @@ bpf_append_bytes(struct bpf_d *d, caddr_t buf, u_int offset, void *src,
     u_int len)
 {
 
-	BPFD_LOCK_ASSERT(d);
+	BPFD_WLOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_BUFFER:
@@ -227,7 +230,7 @@ bpf_append_mbuf(struct bpf_d *d, caddr_t buf, u_int offset, void *src,
     u_int len)
 {
 
-	BPFD_LOCK_ASSERT(d);
+	BPFD_WLOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_BUFFER:
@@ -249,7 +252,7 @@ static void
 bpf_buf_reclaimed(struct bpf_d *d)
 {
 
-	BPFD_LOCK_ASSERT(d);
+	BPFD_WLOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_BUFFER:
@@ -290,7 +293,6 @@ bpf_canfreebuf(struct bpf_d *d)
 static int
 bpf_canwritebuf(struct bpf_d *d)
 {
-
 	BPFD_LOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
@@ -309,7 +311,7 @@ static void
 bpf_buffull(struct bpf_d *d)
 {
 
-	BPFD_LOCK_ASSERT(d);
+	BPFD_WLOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_ZBUF:
@@ -325,7 +327,7 @@ void
 bpf_bufheld(struct bpf_d *d)
 {
 
-	BPFD_LOCK_ASSERT(d);
+	BPFD_WLOCK_ASSERT(d);
 
 	switch (d->bd_bufmode) {
 	case BPF_BUFMODE_ZBUF:
@@ -574,12 +576,12 @@ bpf_attachd(struct bpf_d *d, struct bpf_if *bp)
 	 * Finally, point the driver's bpf cookie at the interface so
 	 * it will divert packets to bpf.
 	 */
-	BPFIF_LOCK(bp);
+	BPFIF_WLOCK(bp);
 	d->bd_bif = bp;
 	LIST_INSERT_HEAD(&bp->bif_dlist, d, bd_next);
 
 	bpf_bpfd_cnt++;
-	BPFIF_UNLOCK(bp);
+	BPFIF_WUNLOCK(bp);
 
 	EVENTHANDLER_INVOKE(bpf_track, bp->bif_ifp, bp->bif_dlt, 1);
 }
@@ -594,20 +596,24 @@ bpf_detachd(struct bpf_d *d)
 	struct bpf_if *bp;
 	struct ifnet *ifp;
 
+	BPF_LOCK_ASSERT();
+
 	bp = d->bd_bif;
-	BPFIF_LOCK(bp);
-	BPFD_LOCK(d);
-	ifp = d->bd_bif->bif_ifp;
+	BPFIF_WLOCK(bp);
+	BPFD_WLOCK(d);
 
 	/*
 	 * Remove d from the interface's descriptor list.
 	 */
 	LIST_REMOVE(d, bd_next);
 
-	bpf_bpfd_cnt--;
+	ifp = bp->bif_ifp;
 	d->bd_bif = NULL;
-	BPFD_UNLOCK(d);
-	BPFIF_UNLOCK(bp);
+	BPFD_WUNLOCK(d);
+	BPFIF_WUNLOCK(bp);
+
+	/* We're already protected by global lock. */
+	bpf_bpfd_cnt--;
 
 	EVENTHANDLER_INVOKE(bpf_track, ifp, bp->bif_dlt, 0);
 
@@ -642,16 +648,16 @@ bpf_dtor(void *data)
 {
 	struct bpf_d *d = data;
 
-	BPFD_LOCK(d);
+	BPFD_WLOCK(d);
 	if (d->bd_state == BPF_WAITING)
 		callout_stop(&d->bd_callout);
 	d->bd_state = BPF_IDLE;
-	BPFD_UNLOCK(d);
+	BPFD_WUNLOCK(d);
 	funsetown(&d->bd_sigio);
-	mtx_lock(&bpf_mtx);
+	BPF_LOCK();
 	if (d->bd_bif)
 		bpf_detachd(d);
-	mtx_unlock(&bpf_mtx);
+	BPF_UNLOCK();
 #ifdef MAC
 	mac_bpfdesc_destroy(d);
 #endif /* MAC */
@@ -689,14 +695,14 @@ bpfopen(struct cdev *dev, int flags, int fmt, struct thread *td)
 	d->bd_bufmode = BPF_BUFMODE_BUFFER;
 	d->bd_sig = SIGIO;
 	d->bd_direction = BPF_D_INOUT;
-	d->bd_pid = td->td_proc->p_pid;
+	BPF_PID_REFRESH(d, td);
 #ifdef MAC
 	mac_bpfdesc_init(d);
 	mac_bpfdesc_create(td->td_ucred, d);
 #endif
-	mtx_init(&d->bd_mtx, devtoname(dev), "bpf cdev lock", MTX_DEF);
-	callout_init_mtx(&d->bd_callout, &d->bd_mtx, 0);
-	knlist_init_mtx(&d->bd_sel.si_note, &d->bd_mtx);
+	rw_init(&d->bd_lock, "bpf cdev lock");
+	callout_init_rw(&d->bd_callout, &d->bd_lock, 0);
+	knlist_init_rw_reader(&d->bd_sel.si_note, &d->bd_lock);
 
 	return (0);
 }
@@ -725,10 +731,10 @@ bpfread(struct cdev *dev, struct uio *uio, int ioflag)
 
 	non_block = ((ioflag & O_NONBLOCK) != 0);
 
-	BPFD_LOCK(d);
-	d->bd_pid = curthread->td_proc->p_pid;
+	BPFD_WLOCK(d);
+	BPF_PID_REFRESH_CUR(d);
 	if (d->bd_bufmode != BPF_BUFMODE_BUFFER) {
-		BPFD_UNLOCK(d);
+		BPFD_WUNLOCK(d);
 		return (EOPNOTSUPP);
 	}
 	if (d->bd_state == BPF_WAITING)
@@ -764,18 +770,18 @@ bpfread(struct cdev *dev, struct uio *uio, int ioflag)
 		 * it before using it again.
 		 */
 		if (d->bd_bif == NULL) {
-			BPFD_UNLOCK(d);
+			BPFD_WUNLOCK(d);
 			return (ENXIO);
 		}
 
 		if (non_block) {
-			BPFD_UNLOCK(d);
+			BPFD_WUNLOCK(d);
 			return (EWOULDBLOCK);
 		}
-		error = msleep(d, &d->bd_mtx, PRINET|PCATCH,
+		error = rw_sleep(d, &d->bd_lock, PRINET|PCATCH,
 		     "bpf", d->bd_rtout);
 		if (error == EINTR || error == ERESTART) {
-			BPFD_UNLOCK(d);
+			BPFD_WUNLOCK(d);
 			return (error);
 		}
 		if (error == EWOULDBLOCK) {
@@ -793,7 +799,7 @@ bpfread(struct cdev *dev, struct uio *uio, int ioflag)
 				break;
 
 			if (d->bd_slen == 0) {
-				BPFD_UNLOCK(d);
+				BPFD_WUNLOCK(d);
 				return (0);
 			}
 			ROTATE_BUFFERS(d);
@@ -803,7 +809,7 @@ bpfread(struct cdev *dev, struct uio *uio, int ioflag)
 	/*
 	 * At this point, we know we have something in the hold slot.
 	 */
-	BPFD_UNLOCK(d);
+	BPFD_WUNLOCK(d);
 
 	/*
 	 * Move data from hold buffer into user space.
@@ -816,12 +822,12 @@ bpfread(struct cdev *dev, struct uio *uio, int ioflag)
 	 */
 	error = bpf_uiomove(d, d->bd_hbuf, d->bd_hlen, uio);
 
-	BPFD_LOCK(d);
+	BPFD_WLOCK(d);
 	d->bd_fbuf = d->bd_hbuf;
 	d->bd_hbuf = NULL;
 	d->bd_hlen = 0;
 	bpf_buf_reclaimed(d);
-	BPFD_UNLOCK(d);
+	BPFD_WUNLOCK(d);
 
 	return (error);
 }
@@ -833,7 +839,7 @@ static __inline void
 bpf_wakeup(struct bpf_d *d)
 {
 
-	BPFD_LOCK_ASSERT(d);
+	BPFD_WLOCK_ASSERT(d);
 	if (d->bd_state == BPF_WAITING) {
 		callout_stop(&d->bd_callout);
 		d->bd_state = BPF_IDLE;
@@ -851,7 +857,7 @@ bpf_timed_out(void *arg)
 {
 	struct bpf_d *d = (struct bpf_d *)arg;
 
-	BPFD_LOCK_ASSERT(d);
+	BPFD_WLOCK_ASSERT(d);
 
 	if (callout_pending(&d->bd_callout) || !callout_active(&d->bd_callout))
 		return;
@@ -866,7 +872,7 @@ static int
 bpf_ready(struct bpf_d *d)
 {
 
-	BPFD_LOCK_ASSERT(d);
+	BPFD_WLOCK_ASSERT(d);
 
 	if (!bpf_canfreebuf(d) && d->bd_hlen != 0)
 		return (1);
@@ -889,7 +895,7 @@ bpfwrite(struct cdev *dev, struct uio *uio, int ioflag)
 	if (error != 0)
 		return (error);
 
-	d->bd_pid = curthread->td_proc->p_pid;
+	BPF_PID_REFRESH_CUR(d);
 	d->bd_wcount++;
 	if (d->bd_bif == NULL) {
 		d->bd_wdcount++;
@@ -937,11 +943,11 @@ bpfwrite(struct cdev *dev, struct uio *uio, int ioflag)
 
 	CURVNET_SET(ifp->if_vnet);
 #ifdef MAC
-	BPFD_LOCK(d);
+	BPFD_WLOCK(d);
 	mac_bpfdesc_create_mbuf(d, m);
 	if (mc != NULL)
 		mac_bpfdesc_create_mbuf(d, mc);
-	BPFD_UNLOCK(d);
+	BPFD_WUNLOCK(d);
 #endif
 
 	error = (*ifp->if_output)(ifp, m, &dst, NULL);
@@ -970,7 +976,7 @@ static void
 reset_d(struct bpf_d *d)
 {
 
-	mtx_assert(&d->bd_mtx, MA_OWNED);
+	BPFD_WLOCK_ASSERT(d);
 
 	if ((d->bd_hbuf != NULL) &&
 	    (d->bd_bufmode != BPF_BUFMODE_ZBUF || bpf_canfreebuf(d))) {
@@ -1037,12 +1043,12 @@ bpfioctl(struct cdev *dev, u_long cmd, caddr_t addr, int flags,
 	/*
 	 * Refresh PID associated with this descriptor.
 	 */
-	BPFD_LOCK(d);
-	d->bd_pid = td->td_proc->p_pid;
+	BPFD_WLOCK(d);
+	BPF_PID_REFRESH(d, td);
 	if (d->bd_state == BPF_WAITING)
 		callout_stop(&d->bd_callout);
 	d->bd_state = BPF_IDLE;
-	BPFD_UNLOCK(d);
+	BPFD_WUNLOCK(d);
 
 	if (d->bd_locked == 1) {
 		switch (cmd) {
@@ -1108,11 +1114,11 @@ bpfioctl(struct cdev *dev, u_long cmd, caddr_t addr, int flags,
 		{
 			int n;
 
-			BPFD_LOCK(d);
+			BPFD_WLOCK(d);
 			n = d->bd_slen;
 			if (d->bd_hbuf)
 				n += d->bd_hlen;
-			BPFD_UNLOCK(d);
+			BPFD_WUNLOCK(d);
 
 			*(int *)addr = n;
 			break;
@@ -1163,9 +1169,9 @@ bpfioctl(struct cdev *dev, u_long cmd, caddr_t addr, int flags,
 	 * Flush read packet buffer.
 	 */
 	case BIOCFLUSH:
-		BPFD_LOCK(d);
+		BPFD_WLOCK(d);
 		reset_d(d);
-		BPFD_UNLOCK(d);
+		BPFD_WUNLOCK(d);
 		break;
 
 	/*
@@ -1488,15 +1494,15 @@ bpfioctl(struct cdev *dev, u_long cmd, caddr_t addr, int flags,
 			return (EINVAL);
 		}
 
-		BPFD_LOCK(d);
+		BPFD_WLOCK(d);
 		if (d->bd_sbuf != NULL || d->bd_hbuf != NULL ||
 		    d->bd_fbuf != NULL || d->bd_bif != NULL) {
-			BPFD_UNLOCK(d);
+			BPFD_WUNLOCK(d);
 			CURVNET_RESTORE();
 			return (EBUSY);
 		}
 		d->bd_bufmode = *(u_int *)addr;
-		BPFD_UNLOCK(d);
+		BPFD_WUNLOCK(d);
 		break;
 
 	case BIOCGETZMAX:
@@ -1556,7 +1562,12 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp, u_long cmd)
 	if (fp->bf_insns == NULL) {
 		if (fp->bf_len != 0)
 			return (EINVAL);
-		BPFD_LOCK(d);
+		/* 
+		 * Protect filter change by interface lock, too.
+		 * The same lock order is used by bpf_detachd().
+		 */
+		BPFIF_WLOCK(d->bd_bif);
+		BPFD_WLOCK(d);
 		if (wfilter)
 			d->bd_wfilter = NULL;
 		else {
@@ -1567,7 +1578,8 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp, u_long cmd)
 			if (cmd == BIOCSETF)
 				reset_d(d);
 		}
-		BPFD_UNLOCK(d);
+		BPFD_WUNLOCK(d);
+		BPFIF_WUNLOCK(d->bd_bif);
 		if (old != NULL)
 			free((caddr_t)old, M_BPF);
 #ifdef BPF_JITTER
@@ -1584,7 +1596,12 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp, u_long cmd)
 	fcode = (struct bpf_insn *)malloc(size, M_BPF, M_WAITOK);
 	if (copyin((caddr_t)fp->bf_insns, (caddr_t)fcode, size) == 0 &&
 	    bpf_validate(fcode, (int)flen)) {
-		BPFD_LOCK(d);
+		/* 
+		 * Protect filter change by interface lock, too
+		 * The same lock order is used by bpf_detachd().
+		 */
+		BPFIF_WLOCK(d->bd_bif);
+		BPFD_WLOCK(d);
 		if (wfilter)
 			d->bd_wfilter = fcode;
 		else {
@@ -1595,7 +1612,8 @@ bpf_setf(struct bpf_d *d, struct bpf_program *fp, u_long cmd)
 			if (cmd == BIOCSETF)
 				reset_d(d);
 		}
-		BPFD_UNLOCK(d);
+		BPFD_WUNLOCK(d);
+		BPFIF_WUNLOCK(d->bd_bif);
 		if (old != NULL)
 			free((caddr_t)old, M_BPF);
 #ifdef BPF_JITTER
@@ -1659,9 +1677,9 @@ bpf_setif(struct bpf_d *d, struct ifreq *ifr)
 
 		bpf_attachd(d, bp);
 	}
-	BPFD_LOCK(d);
+	BPFD_WLOCK(d);
 	reset_d(d);
-	BPFD_UNLOCK(d);
+	BPFD_WUNLOCK(d);
 	return (0);
 }
 
@@ -1685,8 +1703,8 @@ bpfpoll(struct cdev *dev, int events, struct thread *td)
 	 * Refresh PID associated with this descriptor.
 	 */
 	revents = events & (POLLOUT | POLLWRNORM);
-	BPFD_LOCK(d);
-	d->bd_pid = td->td_proc->p_pid;
+	BPFD_WLOCK(d);
+	BPF_PID_REFRESH(d, td);
 	if (events & (POLLIN | POLLRDNORM)) {
 		if (bpf_ready(d))
 			revents |= events & (POLLIN | POLLRDNORM);
@@ -1700,7 +1718,7 @@ bpfpoll(struct cdev *dev, int events, struct thread *td)
 			}
 		}
 	}
-	BPFD_UNLOCK(d);
+	BPFD_WUNLOCK(d);
 	return (revents);
 }
 
@@ -1720,12 +1738,12 @@ bpfkqfilter(struct cdev *dev, struct knote *kn)
 	/*
 	 * Refresh PID associated with this descriptor.
 	 */
-	BPFD_LOCK(d);
-	d->bd_pid = curthread->td_proc->p_pid;
+	BPFD_WLOCK(d);
+	BPF_PID_REFRESH_CUR(d);
 	kn->kn_fop = &bpfread_filtops;
 	kn->kn_hook = d;
 	knlist_add(&d->bd_sel.si_note, kn, 1);
-	BPFD_UNLOCK(d);
+	BPFD_WUNLOCK(d);
 
 	return (0);
 }
@@ -1744,7 +1762,7 @@ filt_bpfread(struct knote *kn, long hint)
 	struct bpf_d *d = (struct bpf_d *)kn->kn_hook;
 	int ready;
 
-	BPFD_LOCK_ASSERT(d);
+	BPFD_WLOCK_ASSERT(d);
 	ready = bpf_ready(d);
 	if (ready) {
 		kn->kn_data = d->bd_slen;
@@ -1819,9 +1837,19 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, u_int pktlen)
 	int gottime;
 
 	gottime = BPF_TSTAMP_NONE;
-	BPFIF_LOCK(bp);
+
+	BPFIF_RLOCK(bp);
+
 	LIST_FOREACH(d, &bp->bif_dlist, bd_next) {
-		BPFD_LOCK(d);
+		/*
+		 * We are not using any locks for d here because:
+		 * 1) any filter change is protected by interface
+		 * write lock
+		 * 2) destroying/detaching d is protected by interface
+		 * write lock, too
+		 */
+
+		/* XXX: Do not protect counter for the sake of performance. */
 		++d->bd_rcount;
 		/*
 		 * NB: We dont call BPF_CHECK_DIRECTION() here since there is no
@@ -1837,6 +1865,11 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, u_int pktlen)
 #endif
 		slen = bpf_filter(d->bd_rfilter, pkt, pktlen, pktlen);
 		if (slen != 0) {
+			/*
+			 * Filter matches. Let's to acquire write lock.
+			 */
+			BPFD_WLOCK(d);
+
 			d->bd_fcount++;
 			if (gottime < bpf_ts_quality(d->bd_tstamp))
 				gottime = bpf_gettime(&bt, d->bd_tstamp, NULL);
@@ -1845,10 +1878,10 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, u_int pktlen)
 #endif
 				catchpacket(d, pkt, pktlen, slen,
 				    bpf_append_bytes, &bt);
+			BPFD_WUNLOCK(d);
 		}
-		BPFD_UNLOCK(d);
 	}
-	BPFIF_UNLOCK(bp);
+	BPFIF_RUNLOCK(bp);
 }
 
 #define	BPF_CHECK_DIRECTION(d, r, i)				\
@@ -1857,6 +1890,7 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, u_int pktlen)
 
 /*
  * Incoming linkage from device drivers, when packet is in an mbuf chain.
+ * Locking model is explained in bpf_tap().
  */
 void
 bpf_mtap(struct bpf_if *bp, struct mbuf *m)
@@ -1876,13 +1910,13 @@ bpf_mtap(struct bpf_if *bp, struct mbuf *m)
 	}
 
 	pktlen = m_length(m, NULL);
-
 	gottime = BPF_TSTAMP_NONE;
-	BPFIF_LOCK(bp);
+
+	BPFIF_RLOCK(bp);
+
 	LIST_FOREACH(d, &bp->bif_dlist, bd_next) {
 		if (BPF_CHECK_DIRECTION(d, m->m_pkthdr.rcvif, bp->bif_ifp))
 			continue;
-		BPFD_LOCK(d);
 		++d->bd_rcount;
 #ifdef BPF_JITTER
 		bf = bpf_jitter_enable != 0 ? d->bd_bfilter : NULL;
@@ -1893,6 +1927,8 @@ bpf_mtap(struct bpf_if *bp, struct mbuf *m)
 #endif
 		slen = bpf_filter(d->bd_rfilter, (u_char *)m, pktlen, 0);
 		if (slen != 0) {
+			BPFD_WLOCK(d);
+
 			d->bd_fcount++;
 			if (gottime < bpf_ts_quality(d->bd_tstamp))
 				gottime = bpf_gettime(&bt, d->bd_tstamp, m);
@@ -1901,10 +1937,10 @@ bpf_mtap(struct bpf_if *bp, struct mbuf *m)
 #endif
 				catchpacket(d, (u_char *)m, pktlen, slen,
 				    bpf_append_mbuf, &bt);
+			BPFD_WUNLOCK(d);
 		}
-		BPFD_UNLOCK(d);
 	}
-	BPFIF_UNLOCK(bp);
+	BPFIF_RUNLOCK(bp);
 }
 
 /*
@@ -1938,14 +1974,17 @@ bpf_mtap2(struct bpf_if *bp, void *data, u_int dlen, struct mbuf *m)
 	pktlen += dlen;
 
 	gottime = BPF_TSTAMP_NONE;
-	BPFIF_LOCK(bp);
+
+	BPFIF_RLOCK(bp);
+
 	LIST_FOREACH(d, &bp->bif_dlist, bd_next) {
 		if (BPF_CHECK_DIRECTION(d, m->m_pkthdr.rcvif, bp->bif_ifp))
 			continue;
-		BPFD_LOCK(d);
 		++d->bd_rcount;
 		slen = bpf_filter(d->bd_rfilter, (u_char *)&mb, pktlen, 0);
 		if (slen != 0) {
+			BPFD_WLOCK(d);
+
 			d->bd_fcount++;
 			if (gottime < bpf_ts_quality(d->bd_tstamp))
 				gottime = bpf_gettime(&bt, d->bd_tstamp, m);
@@ -1954,10 +1993,10 @@ bpf_mtap2(struct bpf_if *bp, void *data, u_int dlen, struct mbuf *m)
 #endif
 				catchpacket(d, (u_char *)&mb, pktlen, slen,
 				    bpf_append_mbuf, &bt);
+			BPFD_WUNLOCK(d);
 		}
-		BPFD_UNLOCK(d);
 	}
-	BPFIF_UNLOCK(bp);
+	BPFIF_RUNLOCK(bp);
 }
 
 #undef	BPF_CHECK_DIRECTION
@@ -2049,7 +2088,7 @@ catchpacket(struct bpf_d *d, u_char *pkt, u_int pktlen, u_int snaplen,
 	int do_timestamp;
 	int tstype;
 
-	BPFD_LOCK_ASSERT(d);
+	BPFD_WLOCK_ASSERT(d);
 
 	/*
 	 * Detect whether user space has released a buffer back to us, and if
@@ -2196,7 +2235,7 @@ bpf_freed(struct bpf_d *d)
 	}
 	if (d->bd_wfilter != NULL)
 		free((caddr_t)d->bd_wfilter, M_BPF);
-	mtx_destroy(&d->bd_mtx);
+	rw_destroy(&d->bd_lock);
 }
 
 /*
@@ -2228,13 +2267,13 @@ bpfattach2(struct ifnet *ifp, u_int dlt, u_int hdrlen, struct bpf_if **driverp)
 	LIST_INIT(&bp->bif_dlist);
 	bp->bif_ifp = ifp;
 	bp->bif_dlt = dlt;
-	mtx_init(&bp->bif_mtx, "bpf interface lock", NULL, MTX_DEF);
+	rw_init(&bp->bif_lock, "bpf interface lock");
 	KASSERT(*driverp == NULL, ("bpfattach2: driverp already initialized"));
 	*driverp = bp;
 
-	mtx_lock(&bpf_mtx);
+	BPF_LOCK();
 	LIST_INSERT_HEAD(&bpf_iflist, bp, bif_next);
-	mtx_unlock(&bpf_mtx);
+	BPF_UNLOCK();
 
 	bp->bif_hdrlen = hdrlen;
 
@@ -2261,14 +2300,14 @@ bpfdetach(struct ifnet *ifp)
 
 	/* Find all bpf_if struct's which reference ifp and detach them. */
 	do {
-		mtx_lock(&bpf_mtx);
+		BPF_LOCK();
 		LIST_FOREACH(bp, &bpf_iflist, bif_next) {
 			if (ifp == bp->bif_ifp)
 				break;
 		}
 		if (bp != NULL)
 			LIST_REMOVE(bp, bif_next);
-		mtx_unlock(&bpf_mtx);
+		BPF_UNLOCK();
 
 		if (bp != NULL) {
 #ifdef INVARIANTS
@@ -2276,11 +2315,11 @@ bpfdetach(struct ifnet *ifp)
 #endif
 			while ((d = LIST_FIRST(&bp->bif_dlist)) != NULL) {
 				bpf_detachd(d);
-				BPFD_LOCK(d);
+				BPFD_WLOCK(d);
 				bpf_wakeup(d);
-				BPFD_UNLOCK(d);
+				BPFD_WUNLOCK(d);
 			}
-			mtx_destroy(&bp->bif_mtx);
+			rw_destroy(&bp->bif_lock);
 			free(bp, M_BPF);
 		}
 	} while (bp != NULL);
@@ -2304,13 +2343,13 @@ bpf_getdltlist(struct bpf_d *d, struct bpf_dltlist *bfl)
 	ifp = d->bd_bif->bif_ifp;
 	n = 0;
 	error = 0;
-	mtx_lock(&bpf_mtx);
+	BPF_LOCK();
 	LIST_FOREACH(bp, &bpf_iflist, bif_next) {
 		if (bp->bif_ifp != ifp)
 			continue;
 		if (bfl->bfl_list != NULL) {
 			if (n >= bfl->bfl_len) {
-				mtx_unlock(&bpf_mtx);
+				BPF_UNLOCK();
 				return (ENOMEM);
 			}
 			error = copyout(&bp->bif_dlt,
@@ -2318,7 +2357,7 @@ bpf_getdltlist(struct bpf_d *d, struct bpf_dltlist *bfl)
 		}
 		n++;
 	}
-	mtx_unlock(&bpf_mtx);
+	BPF_UNLOCK();
 	bfl->bfl_len = n;
 	return (error);
 }
@@ -2336,19 +2375,19 @@ bpf_setdlt(struct bpf_d *d, u_int dlt)
 	if (d->bd_bif->bif_dlt == dlt)
 		return (0);
 	ifp = d->bd_bif->bif_ifp;
-	mtx_lock(&bpf_mtx);
+	BPF_LOCK();
 	LIST_FOREACH(bp, &bpf_iflist, bif_next) {
 		if (bp->bif_ifp == ifp && bp->bif_dlt == dlt)
 			break;
 	}
-	mtx_unlock(&bpf_mtx);
+	BPF_UNLOCK();
 	if (bp != NULL) {
 		opromisc = d->bd_promisc;
 		bpf_detachd(d);
 		bpf_attachd(d, bp);
-		BPFD_LOCK(d);
+		BPFD_WLOCK(d);
 		reset_d(d);
-		BPFD_UNLOCK(d);
+		BPFD_WUNLOCK(d);
 		if (opromisc) {
 			error = ifpromisc(bp->bif_ifp, 1);
 			if (error)
@@ -2386,22 +2425,22 @@ bpf_zero_counters(void)
 	struct bpf_if *bp;
 	struct bpf_d *bd;
 
-	mtx_lock(&bpf_mtx);
+	BPF_LOCK();
 	LIST_FOREACH(bp, &bpf_iflist, bif_next) {
-		BPFIF_LOCK(bp);
+		BPFIF_RLOCK(bp);
 		LIST_FOREACH(bd, &bp->bif_dlist, bd_next) {
-			BPFD_LOCK(bd);
+			BPFD_WLOCK(bd);
 			bd->bd_rcount = 0;
 			bd->bd_dcount = 0;
 			bd->bd_fcount = 0;
 			bd->bd_wcount = 0;
 			bd->bd_wfcount = 0;
 			bd->bd_zcopy = 0;
-			BPFD_UNLOCK(bd);
+			BPFD_WUNLOCK(bd);
 		}
-		BPFIF_UNLOCK(bp);
+		BPFIF_RUNLOCK(bp);
 	}
-	mtx_unlock(&bpf_mtx);
+	BPF_UNLOCK();
 }
 
 static void
@@ -2472,24 +2511,24 @@ bpf_stats_sysctl(SYSCTL_HANDLER_ARGS)
 	if (bpf_bpfd_cnt == 0)
 		return (SYSCTL_OUT(req, 0, 0));
 	xbdbuf = malloc(req->oldlen, M_BPF, M_WAITOK);
-	mtx_lock(&bpf_mtx);
+	BPF_LOCK();
 	if (req->oldlen < (bpf_bpfd_cnt * sizeof(*xbd))) {
-		mtx_unlock(&bpf_mtx);
+		BPF_UNLOCK();
 		free(xbdbuf, M_BPF);
 		return (ENOMEM);
 	}
 	index = 0;
 	LIST_FOREACH(bp, &bpf_iflist, bif_next) {
-		BPFIF_LOCK(bp);
+		BPFIF_RLOCK(bp);
 		LIST_FOREACH(bd, &bp->bif_dlist, bd_next) {
 			xbd = &xbdbuf[index++];
-			BPFD_LOCK(bd);
+			BPFD_RLOCK(bd);
 			bpfstats_fill_xbpf(xbd, bd);
-			BPFD_UNLOCK(bd);
+			BPFD_RUNLOCK(bd);
 		}
-		BPFIF_UNLOCK(bp);
+		BPFIF_RUNLOCK(bp);
 	}
-	mtx_unlock(&bpf_mtx);
+	BPF_UNLOCK();
 	error = SYSCTL_OUT(req, xbdbuf, index * sizeof(*xbd));
 	free(xbdbuf, M_BPF);
 	return (error);
diff --git a/sys/net/bpf.h b/sys/net/bpf.h
index c47ad1d..4c4f0c3 100644
--- a/sys/net/bpf.h
+++ b/sys/net/bpf.h
@@ -1092,14 +1092,19 @@ SYSCTL_DECL(_net_bpf);
 
 /*
  * Descriptor associated with each attached hardware interface.
+ * FIXME: this structure is exposed to external callers to speed up
+ * bpf_peers_present() call. However we cover all fields not needed by
+ * this function via BPF_INTERNAL define
  */
 struct bpf_if {
 	LIST_ENTRY(bpf_if)	bif_next;	/* list of all interfaces */
 	LIST_HEAD(, bpf_d)	bif_dlist;	/* descriptor list */
+#ifdef BPF_INTERNAL
 	u_int bif_dlt;				/* link layer type */
 	u_int bif_hdrlen;		/* length of link header */
 	struct ifnet *bif_ifp;		/* corresponding interface */
-	struct mtx	bif_mtx;	/* mutex for interface */
+	struct rwlock bif_lock;		/* interface lock */
+#endif
 };
 
 void	 bpf_bufheld(struct bpf_d *d);
diff --git a/sys/net/bpf_buffer.c b/sys/net/bpf_buffer.c
index e257960..51b418e 100644
--- a/sys/net/bpf_buffer.c
+++ b/sys/net/bpf_buffer.c
@@ -184,9 +184,9 @@ bpf_buffer_ioctl_sblen(struct bpf_d *d, u_int *i)
 {
 	u_int size;
 
-	BPFD_LOCK(d);
+	BPFD_WLOCK(d);
 	if (d->bd_bif != NULL) {
-		BPFD_UNLOCK(d);
+		BPFD_WUNLOCK(d);
 		return (EINVAL);
 	}
 	size = *i;
@@ -195,7 +195,7 @@ bpf_buffer_ioctl_sblen(struct bpf_d *d, u_int *i)
 	else if (size < BPF_MINBUFSIZE)
 		*i = size = BPF_MINBUFSIZE;
 	d->bd_bufsize = size;
-	BPFD_UNLOCK(d);
+	BPFD_WUNLOCK(d);
 	return (0);
 }
 
diff --git a/sys/net/bpf_zerocopy.c b/sys/net/bpf_zerocopy.c
index 60fd76f..1bf0bd6 100644
--- a/sys/net/bpf_zerocopy.c
+++ b/sys/net/bpf_zerocopy.c
@@ -515,14 +515,14 @@ bpf_zerocopy_ioctl_rotzbuf(struct thread *td, struct bpf_d *d,
 	struct zbuf *bzh;
 
 	bzero(bz, sizeof(*bz));
-	BPFD_LOCK(d);
+	BPFD_WLOCK(d);
 	if (d->bd_hbuf == NULL && d->bd_slen != 0) {
 		ROTATE_BUFFERS(d);
 		bzh = (struct zbuf *)d->bd_hbuf;
 		bz->bz_bufa = (void *)bzh->zb_uaddr;
 		bz->bz_buflen = d->bd_hlen;
 	}
-	BPFD_UNLOCK(d);
+	BPFD_WUNLOCK(d);
 	return (0);
 }
 
@@ -570,10 +570,10 @@ bpf_zerocopy_ioctl_setzbuf(struct thread *td, struct bpf_d *d,
 	 * We only allow buffers to be installed once, so atomically check
 	 * that no buffers are currently installed and install new buffers.
 	 */
-	BPFD_LOCK(d);
+	BPFD_WLOCK(d);
 	if (d->bd_hbuf != NULL || d->bd_sbuf != NULL || d->bd_fbuf != NULL ||
 	    d->bd_bif != NULL) {
-		BPFD_UNLOCK(d);
+		BPFD_WUNLOCK(d);
 		zbuf_free(zba);
 		zbuf_free(zbb);
 		return (EINVAL);
@@ -593,6 +593,6 @@ bpf_zerocopy_ioctl_setzbuf(struct thread *td, struct bpf_d *d,
 	 * shared management region.
 	 */
 	d->bd_bufsize = bz->bz_buflen - sizeof(struct bpf_zbuf_header);
-	BPFD_UNLOCK(d);
+	BPFD_WUNLOCK(d);
 	return (0);
 }
diff --git a/sys/net/bpfdesc.h b/sys/net/bpfdesc.h
index d8950b9..9ea4522 100644
--- a/sys/net/bpfdesc.h
+++ b/sys/net/bpfdesc.h
@@ -87,7 +87,7 @@ struct bpf_d {
 	int		bd_sig;		/* signal to send upon packet reception */
 	struct sigio *	bd_sigio;	/* information for async I/O */
 	struct selinfo	bd_sel;		/* bsd select info */
-	struct mtx	bd_mtx;		/* mutex for this descriptor */
+	struct rwlock	bd_lock;	/* per-descriptor lock */
 	struct callout	bd_callout;	/* for BPF timeouts with select */
 	struct label	*bd_label;	/* MAC label for descriptor */
 	u_int64_t	bd_fcount;	/* number of packets which matched filter */
@@ -106,10 +106,19 @@ struct bpf_d {
 #define BPF_WAITING	1		/* waiting for read timeout in select */
 #define BPF_TIMED_OUT	2		/* read timeout has expired in select */
 
-#define BPFD_LOCK(bd)		mtx_lock(&(bd)->bd_mtx)
-#define BPFD_UNLOCK(bd)		mtx_unlock(&(bd)->bd_mtx)
-#define BPFD_LOCK_ASSERT(bd)	mtx_assert(&(bd)->bd_mtx, MA_OWNED)
+#define BPFD_RLOCK(bd)		rw_rlock(&(bd)->bd_lock)
+#define BPFD_RUNLOCK(bd)	rw_runlock(&(bd)->bd_lock)
+#define BPFD_WLOCK(bd)		rw_wlock(&(bd)->bd_lock)
+#define BPFD_WUNLOCK(bd)	rw_wunlock(&(bd)->bd_lock)
+#define BPFD_WLOCK_ASSERT(bd)	rw_assert(&(bd)->bd_lock, RA_WLOCKED)
+#define BPFD_LOCK_ASSERT(bd)	rw_assert(&(bd)->bd_lock, RA_LOCKED)
 
+#define BPF_PID_REFRESH(bd, td)	(bd)->bd_pid = (td)->td_proc->p_pid
+#define BPF_PID_REFRESH_CUR(bd)	(bd)->bd_pid = curthread->td_proc->p_pid
+
+#define BPF_LOCK()		mtx_lock(&bpf_mtx)
+#define BPF_UNLOCK()		mtx_unlock(&bpf_mtx)
+#define BPF_LOCK_ASSERT()	mtx_assert(&bpf_mtx, MA_OWNED)
 /*
  * External representation of the bpf descriptor
  */
@@ -144,7 +153,9 @@ struct xbpf_d {
 	u_int64_t	bd_spare[4];
 };
 
-#define BPFIF_LOCK(bif)		mtx_lock(&(bif)->bif_mtx)
-#define BPFIF_UNLOCK(bif)	mtx_unlock(&(bif)->bif_mtx)
+#define BPFIF_RLOCK(bif)	rw_rlock(&(bif)->bif_lock)
+#define BPFIF_RUNLOCK(bif)	rw_runlock(&(bif)->bif_lock)
+#define BPFIF_WLOCK(bif)	rw_wlock(&(bif)->bif_lock)
+#define BPFIF_WUNLOCK(bif)	rw_wunlock(&(bif)->bif_lock)
 
 #endif
diff --git a/sys/security/mac/mac_net.c b/sys/security/mac/mac_net.c
index 56ee817..7857749 100644
--- a/sys/security/mac/mac_net.c
+++ b/sys/security/mac/mac_net.c
@@ -319,6 +319,7 @@ mac_bpfdesc_create_mbuf(struct bpf_d *d, struct mbuf *m)
 {
 	struct label *label;
 
+	/* Assume reader lock is enough. */
 	BPFD_LOCK_ASSERT(d);
 
 	if (mac_policy_count == 0)
@@ -354,6 +355,7 @@ mac_bpfdesc_check_receive(struct bpf_d *d, struct ifnet *ifp)
 {
 	int error;
 
+	/* Assume reader lock is enough. */
 	BPFD_LOCK_ASSERT(d);
 
 	if (mac_policy_count == 0)
-- 
1.7.9.4


--------------060905060202080703000205--