From owner-svn-src-stable@FreeBSD.ORG Sat Mar 2 15:11:21 2013 Return-Path: Delivered-To: svn-src-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 9DC3E9BD; Sat, 2 Mar 2013 15:11:21 +0000 (UTC) (envelope-from melifaro@FreeBSD.org) Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:1900:2254:2068::e6a:0]) by mx1.freebsd.org (Postfix) with ESMTP id 755B9DC; Sat, 2 Mar 2013 15:11:21 +0000 (UTC) Received: from svn.freebsd.org ([127.0.1.70]) by svn.freebsd.org (8.14.5/8.14.5) with ESMTP id r22FBLp1031843; Sat, 2 Mar 2013 15:11:21 GMT (envelope-from melifaro@svn.freebsd.org) Received: (from melifaro@localhost) by svn.freebsd.org (8.14.5/8.14.5/Submit) id r22FBLod031840; Sat, 2 Mar 2013 15:11:21 GMT (envelope-from melifaro@svn.freebsd.org) Message-Id: <201303021511.r22FBLod031840@svn.freebsd.org> From: "Alexander V. Chernikov" Date: Sat, 2 Mar 2013 15:11:21 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-9@freebsd.org Subject: svn commit: r247629 - in stable/9/sys: kern net security/mac X-SVN-Group: stable-9 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SVN commit messages for all the -stable branches of the src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Mar 2013 15:11:21 -0000 Author: melifaro Date: Sat Mar 2 15:11:20 2013 New Revision: 247629 URL: http://svnweb.freebsd.org/changeset/base/247629 Log: Merge * r233937 - Improve BPF locking model * r233938 - Improve performace for writer-only BPF users * r233946 - Fix build * r235744 - Fix (new) panic on attaching to non-existent interface * r235745 - Fix old panic when BPF consumer attaches to destroying interface * r235746 - Call bpf_jitter() before acquiring BPF global lock * r235747 - Make most BPF ioctls() SMP-safe. * r236231 - Fix BPF_JITTER code broken by r235746. * r236251 - Fix shim for BIOCSETF to drop all packets buffered on the descriptor. * r236261 - Save the previous filter right before we set new one. * r236262 - Fix style(9) nits, reduce unnecessary type castings. * r236559 - Fix panic introduced by r235745 * r236806 - Fix typo introduced in r236559. r233937 - Improve BPF locking model. Interface locks and descriptor locks are converted from mutex(9) to rwlock(9). This greately improves performance: in most common case we need to acquire 1 reader lock instead of 2 mutexes. - Remove filter(descriptor) (reader) lock in bpf_mtap[2] This was suggested by glebius@. We protect filter by requesting interface writer lock on filter change. - Cover struct bpf_if under BPF_INTERNAL define. This permits including bpf.h without including rwlock stuff. However, this is is temporary solution, struct bpf_if should be made opaque for any external caller. r233938 - Improve performace for writer-only BPF users. Linux and Solaris (at least OpenSolaris) has PF_PACKET socket families to send raw ethernet frames. The only FreeBSD interface that can be used to send raw frames is BPF. As a result, many programs like cdpd, lldpd, various dhcp stuff uses BPF only to send data. This leads us to the situation when software like cdpd, being run on high-traffic-volume interface significantly reduces overall performance since we have to acquire additional locks for every packet. Here we add sysctl that changes BPF behavior in the following way: If program came and opens BPF socket without explicitly specifyin read filter we assume it to be write-only and add it to special writer-only per-interface list. This makes bpf_peers_present() return 0, so no additional overhead is introduced. After filter is supplied, descriptor is added to original per-interface list permitting packets to be captured. Unfortunately, pcap_open_live() sets catch-all filter itself for the purpose of setting snap length. Fortunately, most programs explicitly sets (event catch-all) filter after that. tcpdump(1) is a good example. So a bit hackis approach is taken: we upgrade description only after second BIOCSETF is received. Sysctl is named net.bpf.optimize_writers and is turned off by default. - While here, document all sysctl variables in bpf.4 r233946 Fix build broken by r233938. r235744 Fix panic on attaching to non-existent interface (introduced by r233937, pointed by hrs@) Fix panic on tcpdump being attached to interface being removed (introduced by r233937, pointed by hrs@ and adrian@) Protect most of bpf_setf() by BPF global lock Add several forgotten assertions (thanks to adrian@) Document current locking model inside bpf.c Document EVENTHANDLER(9) usage inside BPF. r235745 Fix old panic when BPF consumer attaches to destroying interface. 'flags' field is added to the end of bpf_if structure. Currently the only flag is BPFIF_FLAG_DYING which is set on bpf detach and checked by bpf_attachd() Problem can be easily triggered on SMP stable/[89] by the following command (sort of): 'while true; do ifconfig vlan222 create vlan 222 vlandev em0 up ; \ tcpdump -pi vlan222 & ; ifconfig vlan222 destroy ; done' Fix possible use-after-free when BPF detaches itself from interface, freeing bpf_bif memory, while interface is still UP and there can be routes via this interface. Freeing is now delayed till ifnet_departure_event is received via eventhandler(9) api. Convert bpfd rwlock back to mutex due lack of performance gain (currently checking if packet matches filter is done without holding bpfd lock and we have to acquire write lock if packet matches) r235746 Call bpf_jitter() before acquiring BPF global lock due to malloc() being used inside bpf_jitter. Eliminate bpf_buffer_alloc() and allocate BPF buffers on descriptor creation and BIOCSBLEN ioctl. This permits us not to allocate buffers inside bpf_attachd() which is protected by global lock. r235747 Make most BPF ioctls() SMP-safe. r236559 Fix panic introduced by r235745. Panic occurs after first packet traverse renamed interface. Add several comments on locking r236231 Fix BPF_JITTER code broken by r235746. r236251 Fix 32-bit shim for BIOCSETF to drop all packets buffered on the descriptor and reset statistics as it should. r236261 - Save the previous filter right before we set new one. - Reduce duplicate code and make it little easier to read. r236262 Fix style(9) nits, reduce unnecessary type castings, etc., for bpf_setf(). r236806 Fix typo introduced in r236559. Modified: stable/9/sys/kern/subr_witness.c stable/9/sys/net/bpf.c stable/9/sys/net/bpf.h stable/9/sys/net/bpf_buffer.c stable/9/sys/net/bpf_buffer.h stable/9/sys/net/bpfdesc.h stable/9/sys/security/mac/mac_net.c Directory Properties: stable/9/sys/ (props changed) Modified: stable/9/sys/kern/subr_witness.c ============================================================================== --- stable/9/sys/kern/subr_witness.c Sat Mar 2 14:54:33 2013 (r247628) +++ stable/9/sys/kern/subr_witness.c Sat Mar 2 15:11:20 2013 (r247629) @@ -562,7 +562,7 @@ static struct witness_order_list_entry o * BPF */ { "bpf global lock", &lock_class_mtx_sleep }, - { "bpf interface lock", &lock_class_mtx_sleep }, + { "bpf interface lock", &lock_class_rw }, { "bpf cdev lock", &lock_class_mtx_sleep }, { NULL, NULL }, /* Modified: stable/9/sys/net/bpf.c ============================================================================== --- stable/9/sys/net/bpf.c Sat Mar 2 14:54:33 2013 (r247628) +++ stable/9/sys/net/bpf.c Sat Mar 2 15:11:20 2013 (r247629) @@ -43,6 +43,8 @@ __FBSDID("$FreeBSD$"); #include #include +#include +#include #include #include #include @@ -66,6 +68,7 @@ __FBSDID("$FreeBSD$"); #include #include +#define BPF_INTERNAL #include #include #ifdef BPF_JITTER @@ -144,6 +147,7 @@ static int bpf_bpfd_cnt; static void bpf_attachd(struct bpf_d *, struct bpf_if *); static void bpf_detachd(struct bpf_d *); +static void bpf_detachd_locked(struct bpf_d *); static void bpf_freed(struct bpf_d *); static int bpf_movein(struct uio *, int, struct ifnet *, struct mbuf **, struct sockaddr *, int *, struct bpf_insn *); @@ -155,7 +159,7 @@ static void catchpacket(struct bpf_d *, void (*)(struct bpf_d *, caddr_t, u_int, void *, u_int), struct bintime *); static void reset_d(struct bpf_d *); -static int bpf_setf(struct bpf_d *, struct bpf_program *, u_long cmd); +static int bpf_setf(struct bpf_d *, struct bpf_program *, u_long cmd); static int bpf_getdltlist(struct bpf_d *, struct bpf_dltlist *); static int bpf_setdlt(struct bpf_d *, u_int); static void filt_bpfdetach(struct knote *); @@ -173,6 +177,12 @@ SYSCTL_INT(_net_bpf, OID_AUTO, zerocopy_ SYSCTL_NODE(_net_bpf, OID_AUTO, stats, CTLFLAG_MPSAFE | CTLFLAG_RW, bpf_stats_sysctl, "bpf statistics portal"); +static VNET_DEFINE(int, bpf_optimize_writers) = 0; +#define V_bpf_optimize_writers VNET(bpf_optimize_writers) +SYSCTL_VNET_INT(_net_bpf, OID_AUTO, optimize_writers, + CTLFLAG_RW, &VNET_NAME(bpf_optimize_writers), 0, + "Do not send packets until BPF program is set"); + static d_open_t bpfopen; static d_read_t bpfread; static d_write_t bpfwrite; @@ -197,6 +207,37 @@ static struct filterops bpfread_filtops .f_event = filt_bpfread, }; +eventhandler_tag bpf_ifdetach_cookie = NULL; + +/* + * LOCKING MODEL USED BY BPF: + * Locks: + * 1) global lock (BPF_LOCK). Mutex, used to protect interface addition/removal, + * some global counters and every bpf_if reference. + * 2) Interface lock. Rwlock, used to protect list of BPF descriptors and their filters. + * 3) Descriptor lock. Mutex, used to protect BPF buffers and various structure fields + * used by bpf_mtap code. + * + * Lock order: + * + * Global lock, interface lock, descriptor lock + * + * We have to acquire interface lock before descriptor main lock due to BPF_MTAP[2] + * working model. In many places (like bpf_detachd) we start with BPF descriptor + * (and we need to at least rlock it to get reliable interface pointer). This + * gives us potential LOR. As a result, we use global lock to protect from bpf_if + * change in every such place. + * + * Changing d->bd_bif is protected by 1) global lock, 2) interface lock and + * 3) descriptor main wlock. + * Reading bd_bif can be protected by any of these locks, typically global lock. + * + * Changing read/write BPF filter is protected by the same three locks, + * the same applies for reading. + * + * Sleeping in global lock is not allowed due to bpfdetach() using it. + */ + /* * Wrapper functions for various buffering methods. If the set of buffer * modes expands, we will probably want to introduce a switch data structure @@ -290,7 +331,6 @@ bpf_canfreebuf(struct bpf_d *d) static int bpf_canwritebuf(struct bpf_d *d) { - BPFD_LOCK_ASSERT(d); switch (d->bd_bufmode) { @@ -569,17 +609,92 @@ bad: static void bpf_attachd(struct bpf_d *d, struct bpf_if *bp) { + int op_w; + + BPF_LOCK_ASSERT(); + + /* + * Save sysctl value to protect from sysctl change + * between reads + */ + op_w = V_bpf_optimize_writers; + + if (d->bd_bif != NULL) + bpf_detachd_locked(d); /* - * Point d at bp, and add d to the interface's list of listeners. - * Finally, point the driver's bpf cookie at the interface so - * it will divert packets to bpf. + * Point d at bp, and add d to the interface's list. + * Since there are many applicaiotns using BPF for + * sending raw packets only (dhcpd, cdpd are good examples) + * we can delay adding d to the list of active listeners until + * some filter is configured. */ - BPFIF_LOCK(bp); + + BPFIF_WLOCK(bp); + BPFD_LOCK(d); + d->bd_bif = bp; - LIST_INSERT_HEAD(&bp->bif_dlist, d, bd_next); + + if (op_w != 0) { + /* Add to writers-only list */ + LIST_INSERT_HEAD(&bp->bif_wlist, d, bd_next); + /* + * We decrement bd_writer on every filter set operation. + * First BIOCSETF is done by pcap_open_live() to set up + * snap length. After that appliation usually sets its own filter + */ + d->bd_writer = 2; + } else + LIST_INSERT_HEAD(&bp->bif_dlist, d, bd_next); + + BPFD_UNLOCK(d); + BPFIF_WUNLOCK(bp); bpf_bpfd_cnt++; - BPFIF_UNLOCK(bp); + + CTR3(KTR_NET, "%s: bpf_attach called by pid %d, adding to %s list", + __func__, d->bd_pid, d->bd_writer ? "writer" : "active"); + + if (op_w == 0) + EVENTHANDLER_INVOKE(bpf_track, bp->bif_ifp, bp->bif_dlt, 1); +} + +/* + * Add d to the list of active bp filters. + * Reuqires bpf_attachd() to be called before + */ +static void +bpf_upgraded(struct bpf_d *d) +{ + struct bpf_if *bp; + + BPF_LOCK_ASSERT(); + + bp = d->bd_bif; + + /* + * Filter can be set several times without specifying interface. + * Mark d as reader and exit. + */ + if (bp == NULL) { + BPFD_LOCK(d); + d->bd_writer = 0; + BPFD_UNLOCK(d); + return; + } + + BPFIF_WLOCK(bp); + BPFD_LOCK(d); + + /* Remove from writers-only list */ + LIST_REMOVE(d, bd_next); + LIST_INSERT_HEAD(&bp->bif_dlist, d, bd_next); + /* Mark d as reader */ + d->bd_writer = 0; + + BPFD_UNLOCK(d); + BPFIF_WUNLOCK(bp); + + CTR2(KTR_NET, "%s: upgrade required by pid %d", __func__, d->bd_pid); EVENTHANDLER_INVOKE(bpf_track, bp->bif_ifp, bp->bif_dlt, 1); } @@ -590,26 +705,47 @@ bpf_attachd(struct bpf_d *d, struct bpf_ static void bpf_detachd(struct bpf_d *d) { + BPF_LOCK(); + bpf_detachd_locked(d); + BPF_UNLOCK(); +} + +static void +bpf_detachd_locked(struct bpf_d *d) +{ int error; struct bpf_if *bp; struct ifnet *ifp; - bp = d->bd_bif; - BPFIF_LOCK(bp); + CTR2(KTR_NET, "%s: detach required by pid %d", __func__, d->bd_pid); + + BPF_LOCK_ASSERT(); + + /* Check if descriptor is attached */ + if ((bp = d->bd_bif) == NULL) + return; + + BPFIF_WLOCK(bp); BPFD_LOCK(d); - ifp = d->bd_bif->bif_ifp; + + /* Save bd_writer value */ + error = d->bd_writer; /* * Remove d from the interface's descriptor list. */ LIST_REMOVE(d, bd_next); - bpf_bpfd_cnt--; + ifp = bp->bif_ifp; d->bd_bif = NULL; BPFD_UNLOCK(d); - BPFIF_UNLOCK(bp); + BPFIF_WUNLOCK(bp); + + bpf_bpfd_cnt--; - EVENTHANDLER_INVOKE(bpf_track, ifp, bp->bif_dlt, 0); + /* Call event handler iff d is attached */ + if (error == 0) + EVENTHANDLER_INVOKE(bpf_track, ifp, bp->bif_dlt, 0); /* * Check if this descriptor had requested promiscuous mode. @@ -648,10 +784,7 @@ bpf_dtor(void *data) d->bd_state = BPF_IDLE; BPFD_UNLOCK(d); funsetown(&d->bd_sigio); - mtx_lock(&bpf_mtx); - if (d->bd_bif) - bpf_detachd(d); - mtx_unlock(&bpf_mtx); + bpf_detachd(d); #ifdef MAC mac_bpfdesc_destroy(d); #endif /* MAC */ @@ -671,7 +804,7 @@ static int bpfopen(struct cdev *dev, int flags, int fmt, struct thread *td) { struct bpf_d *d; - int error; + int error, size; d = malloc(sizeof(*d), M_BPF, M_WAITOK | M_ZERO); error = devfs_set_cdevpriv(d, bpf_dtor); @@ -689,14 +822,18 @@ bpfopen(struct cdev *dev, int flags, int d->bd_bufmode = BPF_BUFMODE_BUFFER; d->bd_sig = SIGIO; d->bd_direction = BPF_D_INOUT; - d->bd_pid = td->td_proc->p_pid; + BPF_PID_REFRESH(d, td); #ifdef MAC mac_bpfdesc_init(d); mac_bpfdesc_create(td->td_ucred, d); #endif - mtx_init(&d->bd_mtx, devtoname(dev), "bpf cdev lock", MTX_DEF); - callout_init_mtx(&d->bd_callout, &d->bd_mtx, 0); - knlist_init_mtx(&d->bd_sel.si_note, &d->bd_mtx); + mtx_init(&d->bd_lock, devtoname(dev), "bpf cdev lock", MTX_DEF); + callout_init_mtx(&d->bd_callout, &d->bd_lock, 0); + knlist_init_mtx(&d->bd_sel.si_note, &d->bd_lock); + + /* Allocate default buffers */ + size = d->bd_bufsize; + bpf_buffer_ioctl_sblen(d, &size); return (0); } @@ -726,7 +863,7 @@ bpfread(struct cdev *dev, struct uio *ui non_block = ((ioflag & O_NONBLOCK) != 0); BPFD_LOCK(d); - d->bd_pid = curthread->td_proc->p_pid; + BPF_PID_REFRESH_CUR(d); if (d->bd_bufmode != BPF_BUFMODE_BUFFER) { BPFD_UNLOCK(d); return (EOPNOTSUPP); @@ -772,7 +909,7 @@ bpfread(struct cdev *dev, struct uio *ui BPFD_UNLOCK(d); return (EWOULDBLOCK); } - error = msleep(d, &d->bd_mtx, PRINET|PCATCH, + error = msleep(d, &d->bd_lock, PRINET|PCATCH, "bpf", d->bd_rtout); if (error == EINTR || error == ERESTART) { BPFD_UNLOCK(d); @@ -889,8 +1026,9 @@ bpfwrite(struct cdev *dev, struct uio *u if (error != 0) return (error); - d->bd_pid = curthread->td_proc->p_pid; + BPF_PID_REFRESH_CUR(d); d->bd_wcount++; + /* XXX: locking required */ if (d->bd_bif == NULL) { d->bd_wdcount++; return (ENXIO); @@ -911,6 +1049,7 @@ bpfwrite(struct cdev *dev, struct uio *u bzero(&dst, sizeof(dst)); m = NULL; hlen = 0; + /* XXX: bpf_movein() can sleep */ error = bpf_movein(uio, (int)d->bd_bif->bif_dlt, ifp, &m, &dst, &hlen, d->bd_wfilter); if (error) { @@ -970,7 +1109,7 @@ static void reset_d(struct bpf_d *d) { - mtx_assert(&d->bd_mtx, MA_OWNED); + BPFD_LOCK_ASSERT(d); if ((d->bd_hbuf != NULL) && (d->bd_bufmode != BPF_BUFMODE_ZBUF || bpf_canfreebuf(d))) { @@ -1038,7 +1177,7 @@ bpfioctl(struct cdev *dev, u_long cmd, c * Refresh PID associated with this descriptor. */ BPFD_LOCK(d); - d->bd_pid = td->td_proc->p_pid; + BPF_PID_REFRESH(d, td); if (d->bd_state == BPF_WAITING) callout_stop(&d->bd_callout); d->bd_state = BPF_IDLE; @@ -1090,7 +1229,9 @@ bpfioctl(struct cdev *dev, u_long cmd, c case BIOCGDLTLIST32: case BIOCGRTIMEOUT32: case BIOCSRTIMEOUT32: + BPFD_LOCK(d); d->bd_compat32 = 1; + BPFD_UNLOCK(d); } #endif @@ -1135,7 +1276,9 @@ bpfioctl(struct cdev *dev, u_long cmd, c * Get buffer len [for read()]. */ case BIOCGBLEN: + BPFD_LOCK(d); *(u_int *)addr = d->bd_bufsize; + BPFD_UNLOCK(d); break; /* @@ -1190,10 +1333,12 @@ bpfioctl(struct cdev *dev, u_long cmd, c * Get current data link type. */ case BIOCGDLT: + BPF_LOCK(); if (d->bd_bif == NULL) error = EINVAL; else *(u_int *)addr = d->bd_bif->bif_dlt; + BPF_UNLOCK(); break; /* @@ -1208,6 +1353,7 @@ bpfioctl(struct cdev *dev, u_long cmd, c list32 = (struct bpf_dltlist32 *)addr; dltlist.bfl_len = list32->bfl_len; dltlist.bfl_list = PTRIN(list32->bfl_list); + BPF_LOCK(); if (d->bd_bif == NULL) error = EINVAL; else { @@ -1215,31 +1361,37 @@ bpfioctl(struct cdev *dev, u_long cmd, c if (error == 0) list32->bfl_len = dltlist.bfl_len; } + BPF_UNLOCK(); break; } #endif case BIOCGDLTLIST: + BPF_LOCK(); if (d->bd_bif == NULL) error = EINVAL; else error = bpf_getdltlist(d, (struct bpf_dltlist *)addr); + BPF_UNLOCK(); break; /* * Set data link type. */ case BIOCSDLT: + BPF_LOCK(); if (d->bd_bif == NULL) error = EINVAL; else error = bpf_setdlt(d, *(u_int *)addr); + BPF_UNLOCK(); break; /* * Get interface name. */ case BIOCGETIF: + BPF_LOCK(); if (d->bd_bif == NULL) error = EINVAL; else { @@ -1249,13 +1401,16 @@ bpfioctl(struct cdev *dev, u_long cmd, c strlcpy(ifr->ifr_name, ifp->if_xname, sizeof(ifr->ifr_name)); } + BPF_UNLOCK(); break; /* * Set interface. */ case BIOCSETIF: + BPF_LOCK(); error = bpf_setif(d, (struct ifreq *)addr); + BPF_UNLOCK(); break; /* @@ -1338,7 +1493,9 @@ bpfioctl(struct cdev *dev, u_long cmd, c * Set immediate mode. */ case BIOCIMMEDIATE: + BPFD_LOCK(d); d->bd_immediate = *(u_int *)addr; + BPFD_UNLOCK(d); break; case BIOCVERSION: @@ -1354,21 +1511,27 @@ bpfioctl(struct cdev *dev, u_long cmd, c * Get "header already complete" flag */ case BIOCGHDRCMPLT: + BPFD_LOCK(d); *(u_int *)addr = d->bd_hdrcmplt; + BPFD_UNLOCK(d); break; /* * Set "header already complete" flag */ case BIOCSHDRCMPLT: + BPFD_LOCK(d); d->bd_hdrcmplt = *(u_int *)addr ? 1 : 0; + BPFD_UNLOCK(d); break; /* * Get packet direction flag */ case BIOCGDIRECTION: + BPFD_LOCK(d); *(u_int *)addr = d->bd_direction; + BPFD_UNLOCK(d); break; /* @@ -1383,7 +1546,9 @@ bpfioctl(struct cdev *dev, u_long cmd, c case BPF_D_IN: case BPF_D_INOUT: case BPF_D_OUT: + BPFD_LOCK(d); d->bd_direction = direction; + BPFD_UNLOCK(d); break; default: error = EINVAL; @@ -1395,7 +1560,9 @@ bpfioctl(struct cdev *dev, u_long cmd, c * Get packet timestamp format and resolution. */ case BIOCGTSTAMP: + BPFD_LOCK(d); *(u_int *)addr = d->bd_tstamp; + BPFD_UNLOCK(d); break; /* @@ -1414,26 +1581,38 @@ bpfioctl(struct cdev *dev, u_long cmd, c break; case BIOCFEEDBACK: + BPFD_LOCK(d); d->bd_feedback = *(u_int *)addr; + BPFD_UNLOCK(d); break; case BIOCLOCK: + BPFD_LOCK(d); d->bd_locked = 1; + BPFD_UNLOCK(d); break; case FIONBIO: /* Non-blocking I/O */ break; case FIOASYNC: /* Send signal on receive packets */ + BPFD_LOCK(d); d->bd_async = *(int *)addr; + BPFD_UNLOCK(d); break; case FIOSETOWN: + /* + * XXX: Add some sort of locking here? + * fsetown() can sleep. + */ error = fsetown(*(int *)addr, &d->bd_sigio); break; case FIOGETOWN: + BPFD_LOCK(d); *(int *)addr = fgetown(&d->bd_sigio); + BPFD_UNLOCK(d); break; /* This is deprecated, FIOSETOWN should be used instead. */ @@ -1454,16 +1633,23 @@ bpfioctl(struct cdev *dev, u_long cmd, c if (sig >= NSIG) error = EINVAL; - else + else { + BPFD_LOCK(d); d->bd_sig = sig; + BPFD_UNLOCK(d); + } break; } case BIOCGRSIG: + BPFD_LOCK(d); *(u_int *)addr = d->bd_sig; + BPFD_UNLOCK(d); break; case BIOCGETBUFMODE: + BPFD_LOCK(d); *(u_int *)addr = d->bd_bufmode; + BPFD_UNLOCK(d); break; case BIOCSETBUFMODE: @@ -1518,19 +1704,31 @@ bpfioctl(struct cdev *dev, u_long cmd, c /* * Set d's packet filter program to fp. If this file already has a filter, * free it and replace it. Returns EINVAL for bogus requests. + * + * Note we need global lock here to serialize bpf_setf() and bpf_setif() calls + * since reading d->bd_bif can't be protected by d or interface lock due to + * lock order. + * + * Additionally, we have to acquire interface write lock due to bpf_mtap() uses + * interface read lock to read all filers. + * */ static int bpf_setf(struct bpf_d *d, struct bpf_program *fp, u_long cmd) { +#ifdef COMPAT_FREEBSD32 + struct bpf_program fp_swab; + struct bpf_program32 *fp32; +#endif struct bpf_insn *fcode, *old; - u_int wfilter, flen, size; #ifdef BPF_JITTER - bpf_jit_filter *ofunc; + bpf_jit_filter *jfunc, *ofunc; #endif -#ifdef COMPAT_FREEBSD32 - struct bpf_program32 *fp32; - struct bpf_program fp_swab; + size_t size; + u_int flen; + int need_upgrade; +#ifdef COMPAT_FREEBSD32 switch (cmd) { case BIOCSETF32: case BIOCSETWF32: @@ -1550,73 +1748,86 @@ bpf_setf(struct bpf_d *d, struct bpf_pro break; } #endif - if (cmd == BIOCSETWF) { - old = d->bd_wfilter; - wfilter = 1; -#ifdef BPF_JITTER - ofunc = NULL; -#endif - } else { - wfilter = 0; - old = d->bd_rfilter; + + fcode = NULL; #ifdef BPF_JITTER - ofunc = d->bd_bfilter; + jfunc = ofunc = NULL; #endif - } - if (fp->bf_insns == NULL) { - if (fp->bf_len != 0) + need_upgrade = 0; + + /* + * Check new filter validness before acquiring any locks. + * Allocate memory for new filter, if needed. + */ + flen = fp->bf_len; + if (flen > bpf_maxinsns || (fp->bf_insns == NULL && flen != 0)) + return (EINVAL); + size = flen * sizeof(*fp->bf_insns); + if (size > 0) { + /* We're setting up new filter. Copy and check actual data. */ + fcode = malloc(size, M_BPF, M_WAITOK); + if (copyin(fp->bf_insns, fcode, size) != 0 || + !bpf_validate(fcode, flen)) { + free(fcode, M_BPF); return (EINVAL); - BPFD_LOCK(d); - if (wfilter) - d->bd_wfilter = NULL; - else { - d->bd_rfilter = NULL; -#ifdef BPF_JITTER - d->bd_bfilter = NULL; -#endif - if (cmd == BIOCSETF) - reset_d(d); } - BPFD_UNLOCK(d); - if (old != NULL) - free((caddr_t)old, M_BPF); #ifdef BPF_JITTER - if (ofunc != NULL) - bpf_destroy_jit_filter(ofunc); + /* Filter is copied inside fcode and is perfectly valid. */ + jfunc = bpf_jitter(fcode, flen); #endif - return (0); } - flen = fp->bf_len; - if (flen > bpf_maxinsns) - return (EINVAL); - size = flen * sizeof(*fp->bf_insns); - fcode = (struct bpf_insn *)malloc(size, M_BPF, M_WAITOK); - if (copyin((caddr_t)fp->bf_insns, (caddr_t)fcode, size) == 0 && - bpf_validate(fcode, (int)flen)) { - BPFD_LOCK(d); - if (wfilter) - d->bd_wfilter = fcode; - else { - d->bd_rfilter = fcode; + BPF_LOCK(); + + /* + * Set up new filter. + * Protect filter change by interface lock. + * Additionally, we are protected by global lock here. + */ + if (d->bd_bif != NULL) + BPFIF_WLOCK(d->bd_bif); + BPFD_LOCK(d); + if (cmd == BIOCSETWF) { + old = d->bd_wfilter; + d->bd_wfilter = fcode; + } else { + old = d->bd_rfilter; + d->bd_rfilter = fcode; #ifdef BPF_JITTER - d->bd_bfilter = bpf_jitter(fcode, flen); + ofunc = d->bd_bfilter; + d->bd_bfilter = jfunc; #endif - if (cmd == BIOCSETF) - reset_d(d); + if (cmd == BIOCSETF) + reset_d(d); + + if (fcode != NULL) { + /* + * Do not require upgrade by first BIOCSETF + * (used to set snaplen) by pcap_open_live(). + */ + if (d->bd_writer != 0 && --d->bd_writer == 0) + need_upgrade = 1; + CTR4(KTR_NET, "%s: filter function set by pid %d, " + "bd_writer counter %d, need_upgrade %d", + __func__, d->bd_pid, d->bd_writer, need_upgrade); } - BPFD_UNLOCK(d); - if (old != NULL) - free((caddr_t)old, M_BPF); + } + BPFD_UNLOCK(d); + if (d->bd_bif != NULL) + BPFIF_WUNLOCK(d->bd_bif); + if (old != NULL) + free(old, M_BPF); #ifdef BPF_JITTER - if (ofunc != NULL) - bpf_destroy_jit_filter(ofunc); + if (ofunc != NULL) + bpf_destroy_jit_filter(ofunc); #endif - return (0); - } - free((caddr_t)fcode, M_BPF); - return (EINVAL); + /* Move d to active readers list. */ + if (need_upgrade) + bpf_upgraded(d); + + BPF_UNLOCK(); + return (0); } /* @@ -1630,28 +1841,30 @@ bpf_setif(struct bpf_d *d, struct ifreq struct bpf_if *bp; struct ifnet *theywant; + BPF_LOCK_ASSERT(); + theywant = ifunit(ifr->ifr_name); if (theywant == NULL || theywant->if_bpf == NULL) return (ENXIO); bp = theywant->if_bpf; + /* Check if interface is not being detached from BPF */ + BPFIF_RLOCK(bp); + if (bp->flags & BPFIF_FLAG_DYING) { + BPFIF_RUNLOCK(bp); + return (ENXIO); + } + BPFIF_RUNLOCK(bp); + /* * Behavior here depends on the buffering model. If we're using * kernel memory buffers, then we can allocate them here. If we're * using zero-copy, then the user process must have registered * buffers by the time we get here. If not, return an error. - * - * XXXRW: There are locking issues here with multi-threaded use: what - * if two threads try to set the interface at once? */ switch (d->bd_bufmode) { case BPF_BUFMODE_BUFFER: - if (d->bd_sbuf == NULL) - bpf_buffer_alloc(d); - KASSERT(d->bd_sbuf != NULL, ("bpf_setif: bd_sbuf NULL")); - break; - case BPF_BUFMODE_ZBUF: if (d->bd_sbuf == NULL) return (EINVAL); @@ -1660,15 +1873,8 @@ bpf_setif(struct bpf_d *d, struct ifreq default: panic("bpf_setif: bufmode %d", d->bd_bufmode); } - if (bp != d->bd_bif) { - if (d->bd_bif) - /* - * Detach if attached to something else. - */ - bpf_detachd(d); - + if (bp != d->bd_bif) bpf_attachd(d, bp); - } BPFD_LOCK(d); reset_d(d); BPFD_UNLOCK(d); @@ -1696,7 +1902,7 @@ bpfpoll(struct cdev *dev, int events, st */ revents = events & (POLLOUT | POLLWRNORM); BPFD_LOCK(d); - d->bd_pid = td->td_proc->p_pid; + BPF_PID_REFRESH(d, td); if (events & (POLLIN | POLLRDNORM)) { if (bpf_ready(d)) revents |= events & (POLLIN | POLLRDNORM); @@ -1731,7 +1937,7 @@ bpfkqfilter(struct cdev *dev, struct kno * Refresh PID associated with this descriptor. */ BPFD_LOCK(d); - d->bd_pid = curthread->td_proc->p_pid; + BPF_PID_REFRESH_CUR(d); kn->kn_fop = &bpfread_filtops; kn->kn_hook = d; knlist_add(&d->bd_sel.si_note, kn, 1); @@ -1829,9 +2035,19 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, int gottime; gottime = BPF_TSTAMP_NONE; - BPFIF_LOCK(bp); + + BPFIF_RLOCK(bp); + LIST_FOREACH(d, &bp->bif_dlist, bd_next) { - BPFD_LOCK(d); + /* + * We are not using any locks for d here because: + * 1) any filter change is protected by interface + * write lock + * 2) destroying/detaching d is protected by interface + * write lock, too + */ + + /* XXX: Do not protect counter for the sake of performance. */ ++d->bd_rcount; /* * NB: We dont call BPF_CHECK_DIRECTION() here since there is no @@ -1847,6 +2063,11 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, #endif slen = bpf_filter(d->bd_rfilter, pkt, pktlen, pktlen); if (slen != 0) { + /* + * Filter matches. Let's to acquire write lock. + */ + BPFD_LOCK(d); + d->bd_fcount++; if (gottime < bpf_ts_quality(d->bd_tstamp)) gottime = bpf_gettime(&bt, d->bd_tstamp, NULL); @@ -1855,10 +2076,10 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, #endif catchpacket(d, pkt, pktlen, slen, bpf_append_bytes, &bt); + BPFD_UNLOCK(d); } - BPFD_UNLOCK(d); } - BPFIF_UNLOCK(bp); + BPFIF_RUNLOCK(bp); } #define BPF_CHECK_DIRECTION(d, r, i) \ @@ -1867,6 +2088,7 @@ bpf_tap(struct bpf_if *bp, u_char *pkt, /* * Incoming linkage from device drivers, when packet is in an mbuf chain. + * Locking model is explained in bpf_tap(). */ void bpf_mtap(struct bpf_if *bp, struct mbuf *m) @@ -1886,13 +2108,13 @@ bpf_mtap(struct bpf_if *bp, struct mbuf } pktlen = m_length(m, NULL); - gottime = BPF_TSTAMP_NONE; - BPFIF_LOCK(bp); + + BPFIF_RLOCK(bp); + LIST_FOREACH(d, &bp->bif_dlist, bd_next) { if (BPF_CHECK_DIRECTION(d, m->m_pkthdr.rcvif, bp->bif_ifp)) continue; - BPFD_LOCK(d); ++d->bd_rcount; #ifdef BPF_JITTER bf = bpf_jitter_enable != 0 ? d->bd_bfilter : NULL; @@ -1903,6 +2125,8 @@ bpf_mtap(struct bpf_if *bp, struct mbuf #endif slen = bpf_filter(d->bd_rfilter, (u_char *)m, pktlen, 0); if (slen != 0) { + BPFD_LOCK(d); + d->bd_fcount++; if (gottime < bpf_ts_quality(d->bd_tstamp)) gottime = bpf_gettime(&bt, d->bd_tstamp, m); @@ -1911,10 +2135,10 @@ bpf_mtap(struct bpf_if *bp, struct mbuf #endif catchpacket(d, (u_char *)m, pktlen, slen, bpf_append_mbuf, &bt); + BPFD_UNLOCK(d); } - BPFD_UNLOCK(d); } - BPFIF_UNLOCK(bp); + BPFIF_RUNLOCK(bp); } /* @@ -1948,14 +2172,17 @@ bpf_mtap2(struct bpf_if *bp, void *data, pktlen += dlen; gottime = BPF_TSTAMP_NONE; - BPFIF_LOCK(bp); + + BPFIF_RLOCK(bp); + LIST_FOREACH(d, &bp->bif_dlist, bd_next) { if (BPF_CHECK_DIRECTION(d, m->m_pkthdr.rcvif, bp->bif_ifp)) continue; - BPFD_LOCK(d); ++d->bd_rcount; slen = bpf_filter(d->bd_rfilter, (u_char *)&mb, pktlen, 0); if (slen != 0) { + BPFD_LOCK(d); + d->bd_fcount++; if (gottime < bpf_ts_quality(d->bd_tstamp)) gottime = bpf_gettime(&bt, d->bd_tstamp, m); @@ -1964,10 +2191,10 @@ bpf_mtap2(struct bpf_if *bp, void *data, #endif catchpacket(d, (u_char *)&mb, pktlen, slen, bpf_append_mbuf, &bt); + BPFD_UNLOCK(d); } - BPFD_UNLOCK(d); } - BPFIF_UNLOCK(bp); + BPFIF_RUNLOCK(bp); } #undef BPF_CHECK_DIRECTION @@ -2206,7 +2433,7 @@ bpf_freed(struct bpf_d *d) } if (d->bd_wfilter != NULL) free((caddr_t)d->bd_wfilter, M_BPF); - mtx_destroy(&d->bd_mtx); + mtx_destroy(&d->bd_lock); } /* @@ -2236,15 +2463,16 @@ bpfattach2(struct ifnet *ifp, u_int dlt, panic("bpfattach"); LIST_INIT(&bp->bif_dlist); + LIST_INIT(&bp->bif_wlist); bp->bif_ifp = ifp; bp->bif_dlt = dlt; - mtx_init(&bp->bif_mtx, "bpf interface lock", NULL, MTX_DEF); + rw_init(&bp->bif_lock, "bpf interface lock"); KASSERT(*driverp == NULL, ("bpfattach2: driverp already initialized")); *driverp = bp; - mtx_lock(&bpf_mtx); + BPF_LOCK(); LIST_INSERT_HEAD(&bpf_iflist, bp, bif_next); - mtx_unlock(&bpf_mtx); + BPF_UNLOCK(); bp->bif_hdrlen = hdrlen; @@ -2253,10 +2481,9 @@ bpfattach2(struct ifnet *ifp, u_int dlt, } /* - * Detach bpf from an interface. This involves detaching each descriptor - * associated with the interface, and leaving bd_bif NULL. Notify each - * descriptor as it's detached so that any sleepers wake up and get - * ENXIO. + * Detach bpf from an interface. This involves detaching each descriptor + * associated with the interface. Notify each descriptor as it's detached + * so that any sleepers wake up and get ENXIO. */ void bpfdetach(struct ifnet *ifp) @@ -2269,31 +2496,45 @@ bpfdetach(struct ifnet *ifp) ndetached = 0; #endif *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***