Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 22 Mar 2018 14:16:06 +0000
From:      Ruslan Bukin <ruslan.bukin@cl.cam.ac.uk>
To:        "Jonathan T. Looney" <jtl@FreeBSD.org>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r331347 - in head: etc/mtree include sys/conf sys/dev/tcp_log sys/kern sys/netinet usr.bin/netstat
Message-ID:  <20180322141606.GA4972@bsdpad.com>
In-Reply-To: <201803220940.w2M9e8T4067719@repo.freebsd.org>
References:  <201803220940.w2M9e8T4067719@repo.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
We don't have atomic_fetchadd_64 for mips32 I think

Ruslan

On Thu, Mar 22, 2018 at 09:40:08AM +0000, Jonathan T. Looney wrote:
> Author: jtl
> Date: Thu Mar 22 09:40:08 2018
> New Revision: 331347
> URL: https://svnweb.freebsd.org/changeset/base/331347
> 
> Log:
>   Add the "TCP Blackbox Recorder" which we discussed at the developer
>   summits at BSDCan and BSDCam in 2017.
>   
>   The TCP Blackbox Recorder allows you to capture events on a TCP connection
>   in a ring buffer. It stores metadata with the event. It optionally stores
>   the TCP header associated with an event (if the event is associated with a
>   packet) and also optionally stores information on the sockets.
>   
>   It supports setting a log ID on a TCP connection and using this to correlate
>   multiple connections that share a common log ID.
>   
>   You can log connections in different modes. If you are doing a coordinated
>   test with a particular connection, you may tell the system to put it in
>   mode 4 (continuous dump). Or, if you just want to monitor for errors, you
>   can put it in mode 1 (ring buffer) and dump all the ring buffers associated
>   with the connection ID when we receive an error signal for that connection
>   ID. You can set a default mode that will be applied to a particular ratio
>   of incoming connections. You can also manually set a mode using a socket
>   option.
>   
>   This commit includes only basic probes. rrs@ has added quite an abundance
>   of probes in his TCP development work. He plans to commit those soon.
>   
>   There are user-space programs which we plan to commit as ports. These read
>   the data from the log device and output pcapng files, and then let you
>   analyze the data (and metadata) in the pcapng files.
>   
>   Reviewed by:	gnn (previous version)
>   Obtained from:	Netflix, Inc.
>   Relnotes:	yes
>   Differential Revision:	https://reviews.freebsd.org/D11085
> 
> Added:
>   head/sys/dev/tcp_log/
>   head/sys/dev/tcp_log/tcp_log_dev.c   (contents, props changed)
>   head/sys/dev/tcp_log/tcp_log_dev.h   (contents, props changed)
>   head/sys/netinet/tcp_log_buf.c   (contents, props changed)
>   head/sys/netinet/tcp_log_buf.h   (contents, props changed)
> Modified:
>   head/etc/mtree/BSD.include.dist
>   head/include/Makefile
>   head/sys/conf/files
>   head/sys/kern/subr_witness.c
>   head/sys/netinet/tcp.h
>   head/sys/netinet/tcp_input.c
>   head/sys/netinet/tcp_output.c
>   head/sys/netinet/tcp_subr.c
>   head/sys/netinet/tcp_timer.c
>   head/sys/netinet/tcp_usrreq.c
>   head/sys/netinet/tcp_var.h
>   head/usr.bin/netstat/inet.c
>   head/usr.bin/netstat/main.c
>   head/usr.bin/netstat/netstat.1
>   head/usr.bin/netstat/netstat.h
> 
> Modified: head/etc/mtree/BSD.include.dist
> ==============================================================================
> --- head/etc/mtree/BSD.include.dist	Thu Mar 22 08:32:39 2018	(r331346)
> +++ head/etc/mtree/BSD.include.dist	Thu Mar 22 09:40:08 2018	(r331347)
> @@ -158,6 +158,8 @@
>          ..
>          speaker
>          ..
> +        tcp_log
> +        ..
>          usb
>          ..
>          vkbd
> 
> Modified: head/include/Makefile
> ==============================================================================
> --- head/include/Makefile	Thu Mar 22 08:32:39 2018	(r331346)
> +++ head/include/Makefile	Thu Mar 22 09:40:08 2018	(r331347)
> @@ -47,7 +47,7 @@ LSUBDIRS=	cam/ata cam/mmc cam/nvme cam/scsi \
>  	dev/hwpmc dev/hyperv \
>  	dev/ic dev/iicbus dev/io dev/lmc dev/mfi dev/mmc dev/nvme \
>  	dev/ofw dev/pbio dev/pci ${_dev_powermac_nvram} dev/ppbus dev/smbus \
> -	dev/speaker dev/vkbd dev/wi \
> +	dev/speaker dev/tcp_log dev/vkbd dev/wi \
>  	fs/devfs fs/fdescfs fs/msdosfs fs/nandfs fs/nfs fs/nullfs \
>  	fs/procfs fs/smbfs fs/udf fs/unionfs \
>  	geom/cache geom/concat geom/eli geom/gate geom/journal geom/label \
> 
> Modified: head/sys/conf/files
> ==============================================================================
> --- head/sys/conf/files	Thu Mar 22 08:32:39 2018	(r331346)
> +++ head/sys/conf/files	Thu Mar 22 09:40:08 2018	(r331347)
> @@ -3161,6 +3161,7 @@ dev/syscons/star/star_saver.c	optional star_saver
>  dev/syscons/syscons.c		optional sc
>  dev/syscons/sysmouse.c		optional sc
>  dev/syscons/warp/warp_saver.c	optional warp_saver
> +dev/tcp_log/tcp_log_dev.c	optional inet | inet6
>  dev/tdfx/tdfx_linux.c		optional tdfx_linux tdfx compat_linux
>  dev/tdfx/tdfx_pci.c		optional tdfx pci
>  dev/ti/if_ti.c			optional ti pci
> @@ -4309,6 +4310,7 @@ netinet/tcp_debug.c		optional tcpdebug
>  netinet/tcp_fastopen.c		optional inet tcp_rfc7413 | inet6 tcp_rfc7413
>  netinet/tcp_hostcache.c		optional inet | inet6
>  netinet/tcp_input.c		optional inet | inet6
> +netinet/tcp_log_buf.c		optional inet | inet6
>  netinet/tcp_lro.c		optional inet | inet6
>  netinet/tcp_output.c		optional inet | inet6
>  netinet/tcp_offload.c		optional tcp_offload inet | tcp_offload inet6
> 
> Added: head/sys/dev/tcp_log/tcp_log_dev.c
> ==============================================================================
> --- /dev/null	00:00:00 1970	(empty, because file is newly added)
> +++ head/sys/dev/tcp_log/tcp_log_dev.c	Thu Mar 22 09:40:08 2018	(r331347)
> @@ -0,0 +1,521 @@
> +/*-
> + * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
> + *
> + * Copyright (c) 2016-2017
> + *	Netflix Inc.  All rights reserved.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + *
> + */
> +
> +#include <sys/cdefs.h>
> +__FBSDID("$FreeBSD$");
> +
> +#include <sys/param.h>
> +#include <sys/conf.h>
> +#include <sys/fcntl.h>
> +#include <sys/filio.h>
> +#include <sys/kernel.h>
> +#include <sys/lock.h>
> +#include <sys/malloc.h>
> +#include <sys/module.h>
> +#include <sys/poll.h>
> +#include <sys/queue.h>
> +#include <sys/refcount.h>
> +#include <sys/mutex.h>
> +#include <sys/selinfo.h>
> +#include <sys/socket.h>
> +#include <sys/socketvar.h>
> +#include <sys/sysctl.h>
> +#include <sys/tree.h>
> +#include <sys/uio.h>
> +#include <machine/atomic.h>
> +#include <sys/counter.h>
> +
> +#include <dev/tcp_log/tcp_log_dev.h>
> +
> +#ifdef TCPLOG_DEBUG_COUNTERS
> +extern counter_u64_t tcp_log_que_read;
> +extern counter_u64_t tcp_log_que_freed;
> +#endif
> +
> +static struct cdev *tcp_log_dev;
> +static struct selinfo tcp_log_sel;
> +
> +static struct log_queueh tcp_log_dev_queue_head = STAILQ_HEAD_INITIALIZER(tcp_log_dev_queue_head);
> +static struct log_infoh tcp_log_dev_reader_head = STAILQ_HEAD_INITIALIZER(tcp_log_dev_reader_head);
> +
> +MALLOC_DEFINE(M_TCPLOGDEV, "tcp_log_dev", "TCP log device data structures");
> +
> +static int	tcp_log_dev_listeners = 0;
> +
> +static struct mtx tcp_log_dev_queue_lock;
> +
> +#define	TCP_LOG_DEV_QUEUE_LOCK()	mtx_lock(&tcp_log_dev_queue_lock)
> +#define	TCP_LOG_DEV_QUEUE_UNLOCK()	mtx_unlock(&tcp_log_dev_queue_lock)
> +#define	TCP_LOG_DEV_QUEUE_LOCK_ASSERT()	mtx_assert(&tcp_log_dev_queue_lock, MA_OWNED)
> +#define	TCP_LOG_DEV_QUEUE_UNLOCK_ASSERT() mtx_assert(&tcp_log_dev_queue_lock, MA_NOTOWNED)
> +#define	TCP_LOG_DEV_QUEUE_REF(tldq)	refcount_acquire(&((tldq)->tldq_refcnt))
> +#define	TCP_LOG_DEV_QUEUE_UNREF(tldq)	refcount_release(&((tldq)->tldq_refcnt))
> +
> +static void	tcp_log_dev_clear_refcount(struct tcp_log_dev_queue *entry);
> +static void	tcp_log_dev_clear_cdevpriv(void *data);
> +static int	tcp_log_dev_open(struct cdev *dev __unused, int flags,
> +    int devtype __unused, struct thread *td __unused);
> +static int	tcp_log_dev_write(struct cdev *dev __unused,
> +    struct uio *uio __unused, int flags __unused);
> +static int	tcp_log_dev_read(struct cdev *dev __unused, struct uio *uio,
> +    int flags __unused);
> +static int	tcp_log_dev_ioctl(struct cdev *dev __unused, u_long cmd,
> +    caddr_t data, int fflag __unused, struct thread *td __unused);
> +static int	tcp_log_dev_poll(struct cdev *dev __unused, int events,
> +    struct thread *td);
> +
> +
> +enum tcp_log_dev_queue_lock_state {
> +	QUEUE_UNLOCKED = 0,
> +	QUEUE_LOCKED,
> +};
> +
> +static struct cdevsw tcp_log_cdevsw = {
> +	.d_version =	D_VERSION,
> +	.d_read =	tcp_log_dev_read,
> +	.d_open =	tcp_log_dev_open,
> +	.d_write =	tcp_log_dev_write,
> +	.d_poll =	tcp_log_dev_poll,
> +	.d_ioctl =	tcp_log_dev_ioctl,
> +#ifdef NOTYET
> +	.d_mmap =	tcp_log_dev_mmap,
> +#endif
> +	.d_name =	"tcp_log",
> +};
> +
> +static __inline void
> +tcp_log_dev_queue_validate_lock(int lockstate)
> +{
> +
> +#ifdef INVARIANTS
> +	switch (lockstate) {
> +	case QUEUE_LOCKED:
> +		TCP_LOG_DEV_QUEUE_LOCK_ASSERT();
> +		break;
> +	case QUEUE_UNLOCKED:
> +		TCP_LOG_DEV_QUEUE_UNLOCK_ASSERT();
> +		break;
> +	default:
> +		kassert_panic("%s:%d: unknown queue lock state", __func__,
> +		    __LINE__);
> +	}
> +#endif
> +}
> +
> +/*
> + * Clear the refcount. If appropriate, it will remove the entry from the
> + * queue and call the destructor.
> + *
> + * This must be called with the queue lock held.
> + */
> +static void
> +tcp_log_dev_clear_refcount(struct tcp_log_dev_queue *entry)
> +{
> +
> +	KASSERT(entry != NULL, ("%s: called with NULL entry", __func__));
> +
> +	TCP_LOG_DEV_QUEUE_LOCK_ASSERT();
> +
> +	if (TCP_LOG_DEV_QUEUE_UNREF(entry)) {
> +#ifdef TCPLOG_DEBUG_COUNTERS
> +		counter_u64_add(tcp_log_que_freed, 1);
> +#endif
> +		/* Remove the entry from the queue and call the destructor. */
> +		STAILQ_REMOVE(&tcp_log_dev_queue_head, entry, tcp_log_dev_queue,
> +		    tldq_queue);
> +		(*entry->tldq_dtor)(entry);
> +	}
> +}
> +
> +static void
> +tcp_log_dev_clear_cdevpriv(void *data)
> +{
> +	struct tcp_log_dev_info *priv;
> +	struct tcp_log_dev_queue *entry, *entry_tmp;
> +
> +	priv = (struct tcp_log_dev_info *)data;
> +	if (priv == NULL)
> +		return;
> +
> +	/*
> +	 * Lock the queue and drop our references. We hold references to all
> +	 * the entries starting with tldi_head (or, if tldi_head == NULL, all
> +	 * entries in the queue).
> +	 * 
> +	 * Because we don't want anyone adding addition things to the queue
> +	 * while we are doing this, we lock the queue.
> +	 */
> +	TCP_LOG_DEV_QUEUE_LOCK();
> +	if (priv->tldi_head != NULL) {
> +		entry = priv->tldi_head;
> +		STAILQ_FOREACH_FROM_SAFE(entry, &tcp_log_dev_queue_head,
> +		    tldq_queue, entry_tmp) {
> +			tcp_log_dev_clear_refcount(entry);
> +		}
> +	}
> +	tcp_log_dev_listeners--;
> +	KASSERT(tcp_log_dev_listeners >= 0,
> +	    ("%s: tcp_log_dev_listeners is unexpectedly negative", __func__));
> +	STAILQ_REMOVE(&tcp_log_dev_reader_head, priv, tcp_log_dev_info,
> +	    tldi_list);
> +	TCP_LOG_DEV_QUEUE_LOCK_ASSERT();
> +	TCP_LOG_DEV_QUEUE_UNLOCK();
> +	free(priv, M_TCPLOGDEV);
> +}
> +
> +static int
> +tcp_log_dev_open(struct cdev *dev __unused, int flags, int devtype __unused,
> +    struct thread *td __unused)
> +{
> +	struct tcp_log_dev_info *priv;
> +	struct tcp_log_dev_queue *entry;
> +	int rv;
> +
> +	/*
> +	 * Ideally, we shouldn't see these because of file system
> +	 * permissions.
> +	 */
> +	if (flags & (FWRITE | FEXEC | FAPPEND | O_TRUNC))
> +		return (ENODEV);
> +
> +	/* Allocate space to hold information about where we are. */
> +	priv = malloc(sizeof(struct tcp_log_dev_info), M_TCPLOGDEV,
> +	    M_ZERO | M_WAITOK);
> +
> +	/* Stash the private data away. */
> +	rv = devfs_set_cdevpriv((void *)priv, tcp_log_dev_clear_cdevpriv);
> +	if (!rv) {
> +		/*
> +		 * Increase the listener count, add this reader to the list, and
> +		 * take references on all current queues.
> +		 */
> +		TCP_LOG_DEV_QUEUE_LOCK();
> +		tcp_log_dev_listeners++;
> +		STAILQ_INSERT_HEAD(&tcp_log_dev_reader_head, priv, tldi_list);
> +		priv->tldi_head = STAILQ_FIRST(&tcp_log_dev_queue_head);
> +		if (priv->tldi_head != NULL)
> +			priv->tldi_cur = priv->tldi_head->tldq_buf;
> +		STAILQ_FOREACH(entry, &tcp_log_dev_queue_head, tldq_queue)
> +			TCP_LOG_DEV_QUEUE_REF(entry);
> +		TCP_LOG_DEV_QUEUE_UNLOCK();
> +	} else {
> +		/* Free the entry. */
> +		free(priv, M_TCPLOGDEV);
> +	}
> +	return (rv);
> +}
> +
> +static int
> +tcp_log_dev_write(struct cdev *dev __unused, struct uio *uio __unused,
> +    int flags __unused)
> +{
> +
> +	return (ENODEV);
> +}
> +
> +static __inline void
> +tcp_log_dev_rotate_bufs(struct tcp_log_dev_info *priv, int *lockstate)
> +{
> +	struct tcp_log_dev_queue *entry;
> +
> +	KASSERT(priv->tldi_head != NULL,
> +	    ("%s:%d: priv->tldi_head unexpectedly NULL",
> +	    __func__, __LINE__));
> +	KASSERT(priv->tldi_head->tldq_buf == priv->tldi_cur,
> +	    ("%s:%d: buffer mismatch (%p vs %p)",
> +	    __func__, __LINE__, priv->tldi_head->tldq_buf,
> +	    priv->tldi_cur));
> +	tcp_log_dev_queue_validate_lock(*lockstate);
> +
> +	if (*lockstate == QUEUE_UNLOCKED) {
> +		TCP_LOG_DEV_QUEUE_LOCK();
> +		*lockstate = QUEUE_LOCKED;
> +	}
> +	entry = priv->tldi_head;
> +	priv->tldi_head = STAILQ_NEXT(entry, tldq_queue);
> +	tcp_log_dev_clear_refcount(entry);
> +	priv->tldi_cur = NULL;
> +}
> +
> +static int
> +tcp_log_dev_read(struct cdev *dev __unused, struct uio *uio, int flags)
> +{
> +	struct tcp_log_common_header *buf;
> +	struct tcp_log_dev_info *priv;
> +	struct tcp_log_dev_queue *entry;
> +	ssize_t len;
> +	int lockstate, rv;
> +
> +	/* Get our private info. */
> +	rv = devfs_get_cdevpriv((void **)&priv);
> +	if (rv)
> +		return (rv);
> +
> +	lockstate = QUEUE_UNLOCKED;
> +
> +	/* Do we need to get a new buffer? */
> +	while (priv->tldi_cur == NULL ||
> +	    priv->tldi_cur->tlch_length <= priv->tldi_off) {
> +		/* Did we somehow forget to rotate? */
> +		KASSERT(priv->tldi_cur == NULL,
> +		    ("%s:%d: tldi_cur is unexpectedly non-NULL", __func__,
> +		    __LINE__));
> +		if (priv->tldi_cur != NULL)
> +			tcp_log_dev_rotate_bufs(priv, &lockstate);
> +
> +		/*
> +		 * Before we start looking at tldi_head, we need a lock on the
> +		 * queue to make sure tldi_head stays stable.
> +		 */
> +		if (lockstate == QUEUE_UNLOCKED) {
> +			TCP_LOG_DEV_QUEUE_LOCK();
> +			lockstate = QUEUE_LOCKED;
> +		}
> +
> +		/* We need the next buffer. Do we have one? */
> +		if (priv->tldi_head == NULL && (flags & FNONBLOCK)) {
> +			rv = EAGAIN;
> +			goto done;
> +		}
> +		if (priv->tldi_head == NULL) {
> +			/* Sleep and wait for more things we can read. */
> +			rv = mtx_sleep(&tcp_log_dev_listeners,
> +			    &tcp_log_dev_queue_lock, PCATCH, "tcplogdev", 0);
> +			if (rv)
> +				goto done;
> +			if (priv->tldi_head == NULL)
> +				continue;
> +		}
> +
> +		/*
> +		 * We have an entry to read. We want to try to create a
> +		 * buffer, if one doesn't already exist.
> +		 */
> +		entry = priv->tldi_head;
> +		if (entry->tldq_buf == NULL) {
> +			TCP_LOG_DEV_QUEUE_LOCK_ASSERT();
> +			buf = (*entry->tldq_xform)(entry);
> +			if (buf == NULL) {
> +				rv = EBUSY;
> +				goto done;
> +			}
> +			entry->tldq_buf = buf;
> +		}
> +
> +		priv->tldi_cur = entry->tldq_buf;
> +		priv->tldi_off = 0;
> +	}
> +
> +	/* Copy what we can from this buffer to the output buffer. */
> +	if (uio->uio_resid > 0) {
> +		/* Drop locks so we can take page faults. */
> +		if (lockstate == QUEUE_LOCKED)
> +			TCP_LOG_DEV_QUEUE_UNLOCK();
> +		lockstate = QUEUE_UNLOCKED;
> +
> +		KASSERT(priv->tldi_cur != NULL,
> +		    ("%s: priv->tldi_cur is unexpectedly NULL", __func__));
> +
> +		/* Copy as much as we can to this uio. */
> +		len = priv->tldi_cur->tlch_length - priv->tldi_off;
> +		if (len > uio->uio_resid)
> +			len = uio->uio_resid;
> +		rv = uiomove(((uint8_t *)priv->tldi_cur) + priv->tldi_off,
> +		    len, uio);
> +		if (rv != 0)
> +			goto done;
> +		priv->tldi_off += len;
> +#ifdef TCPLOG_DEBUG_COUNTERS
> +		counter_u64_add(tcp_log_que_read, len);
> +#endif
> +	}
> +	/* Are we done with this buffer? If so, find the next one. */
> +	if (priv->tldi_off >= priv->tldi_cur->tlch_length) {
> +		KASSERT(priv->tldi_off == priv->tldi_cur->tlch_length,
> +		    ("%s: offset (%ju) exceeds length (%ju)", __func__,
> +		    (uintmax_t)priv->tldi_off,
> +		    (uintmax_t)priv->tldi_cur->tlch_length));
> +		tcp_log_dev_rotate_bufs(priv, &lockstate);
> +	}
> +done:
> +	tcp_log_dev_queue_validate_lock(lockstate);
> +	if (lockstate == QUEUE_LOCKED)
> +		TCP_LOG_DEV_QUEUE_UNLOCK();
> +	return (rv);
> +}
> +
> +static int
> +tcp_log_dev_ioctl(struct cdev *dev __unused, u_long cmd, caddr_t data,
> +    int fflag __unused, struct thread *td __unused)
> +{
> +	struct tcp_log_dev_info *priv;
> +	int rv;
> +
> +	/* Get our private info. */
> +	rv = devfs_get_cdevpriv((void **)&priv);
> +	if (rv)
> +		return (rv);
> +
> +	/*
> +	 * Set things. Here, we are most concerned about the non-blocking I/O
> +	 * flag.
> +	 */
> +	rv = 0;
> +	switch (cmd) {
> +	case FIONBIO:
> +		break;
> +	case FIOASYNC:
> +		if (*(int *)data != 0)
> +			rv = EINVAL;
> +		break;
> +	default:
> +		rv = ENOIOCTL;
> +	}
> +	return (rv);
> +}
> +
> +static int
> +tcp_log_dev_poll(struct cdev *dev __unused, int events, struct thread *td)
> +{
> +	struct tcp_log_dev_info *priv;
> +	int revents;
> +
> +	/*
> +	 * Get our private info. If this fails, claim that all events are
> +	 * ready. That should prod the user to do something that will
> +	 * make the error evident to them.
> +	 */
> +	if (devfs_get_cdevpriv((void **)&priv))
> +		return (events);
> +
> +	revents = 0;
> +	if (events & (POLLIN | POLLRDNORM)) {
> +		/*
> +		 * We can (probably) read right now if we are partway through
> +		 * a buffer or if we are just about to start a buffer.
> +		 * Because we are going to read tldi_head, we should acquire
> +		 * a read lock on the queue.
> +		 */
> +		TCP_LOG_DEV_QUEUE_LOCK();
> +		if ((priv->tldi_head != NULL && priv->tldi_cur == NULL) ||
> +		    (priv->tldi_cur != NULL &&
> +		    priv->tldi_off < priv->tldi_cur->tlch_length))
> +			revents = events & (POLLIN | POLLRDNORM);
> +		else
> +			selrecord(td, &tcp_log_sel);
> +		TCP_LOG_DEV_QUEUE_UNLOCK();
> +	} else {
> +		/*
> +		 * It only makes sense to poll for reading. So, again, prod the
> +		 * user to do something that will make the error of their ways
> +		 * apparent.
> +		 */
> +		revents = events;
> +	}
> +	return (revents);
> +}
> +
> +int
> +tcp_log_dev_add_log(struct tcp_log_dev_queue *entry)
> +{
> +	struct tcp_log_dev_info *priv;
> +	int rv;
> +	bool wakeup_needed;
> +
> +	KASSERT(entry->tldq_buf != NULL || entry->tldq_xform != NULL,
> +	    ("%s: Called with both tldq_buf and tldq_xform set to NULL",
> +	    __func__));
> +	KASSERT(entry->tldq_dtor != NULL,
> +	    ("%s: Called with tldq_dtor set to NULL", __func__));
> +
> +	/* Get a lock on the queue. */
> +	TCP_LOG_DEV_QUEUE_LOCK();
> +
> +	/* If no one is listening, tell the caller to free the resources. */
> +	if (tcp_log_dev_listeners == 0) {
> +		rv = ENXIO;
> +		goto done;
> +	}
> +
> +	/* Add this to the end of the tailq. */
> +	STAILQ_INSERT_TAIL(&tcp_log_dev_queue_head, entry, tldq_queue);
> +
> +	/* Add references for all current listeners. */
> +	refcount_init(&entry->tldq_refcnt, tcp_log_dev_listeners);
> +
> +	/*
> +	 * If any listener is currently stuck on NULL, that means they are
> +	 * waiting. Point their head to this new entry.
> +	 */
> +	wakeup_needed = false;
> +	STAILQ_FOREACH(priv, &tcp_log_dev_reader_head, tldi_list)
> +		if (priv->tldi_head == NULL) {
> +			priv->tldi_head = entry;
> +			wakeup_needed = true;
> +		}
> +
> +	if (wakeup_needed) {
> +		selwakeup(&tcp_log_sel);
> +		wakeup(&tcp_log_dev_listeners);
> +	}
> +
> +	rv = 0;
> +
> +done:
> +	TCP_LOG_DEV_QUEUE_LOCK_ASSERT();
> +	TCP_LOG_DEV_QUEUE_UNLOCK();
> +	return (rv);
> +}
> +
> +static int
> +tcp_log_dev_modevent(module_t mod __unused, int type, void *data __unused)
> +{
> +
> +	/* TODO: Support intelligent unloading. */
> +	switch (type) {
> +	case MOD_LOAD:
> +		if (bootverbose)
> +			printf("tcp_log: tcp_log device\n");
> +		memset(&tcp_log_sel, 0, sizeof(tcp_log_sel));
> +		memset(&tcp_log_dev_queue_lock, 0, sizeof(struct mtx));
> +		mtx_init(&tcp_log_dev_queue_lock, "tcp_log dev",
> +			 "tcp_log device queues", MTX_DEF);
> +		tcp_log_dev = make_dev_credf(MAKEDEV_ETERNAL_KLD,
> +		    &tcp_log_cdevsw, 0, NULL, UID_ROOT, GID_WHEEL, 0400,
> +		    "tcp_log");
> +		break;
> +	default:
> +		return (EOPNOTSUPP);
> +	}
> +
> +	return (0);
> +}
> +
> +DEV_MODULE(tcp_log_dev, tcp_log_dev_modevent, NULL);
> +MODULE_VERSION(tcp_log_dev, 1);
> 
> Added: head/sys/dev/tcp_log/tcp_log_dev.h
> ==============================================================================
> --- /dev/null	00:00:00 1970	(empty, because file is newly added)
> +++ head/sys/dev/tcp_log/tcp_log_dev.h	Thu Mar 22 09:40:08 2018	(r331347)
> @@ -0,0 +1,88 @@
> +/*-
> + * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
> + *
> + * Copyright (c) 2016
> + *	Netflix Inc.  All rights reserved.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + *
> + * $FreeBSD$
> + */
> +
> +#ifndef __tcp_log_dev_h__
> +#define	__tcp_log_dev_h__
> +
> +/*
> + * This is the common header for data streamed from the log device. All
> + * blocks of data need to start with this header.
> + */
> +struct tcp_log_common_header {
> +	uint32_t	tlch_version;	/* Version is specific to type. */
> +	uint32_t	tlch_type;	/* Type of entry(ies) that follow. */
> +	uint64_t	tlch_length;	/* Total length, including header. */
> +} __packed;
> +
> +#define	TCP_LOG_DEV_TYPE_BBR	1	/* black box recorder */
> +
> +#ifdef _KERNEL
> +/*
> + * This is a queue entry. All queue entries need to start with this structure
> + * so the common code can cast them to this structure; however, other modules
> + * are free to include additional data after this structure.
> + *
> + * The elements are explained here:
> + * tldq_queue: used by the common code to maintain this entry's position in the
> + *     queue.
> + * tldq_buf: should be NULL, or a pointer to a chunk of data. The data must be
> + *     as long as the common header indicates.
> + * tldq_xform: If tldq_buf is NULL, the code will call this to create the
> + *     the tldq_buf object. The function should *not* directly modify tldq_buf,
> + *     but should return the buffer (which must meet the restrictions
> + *     indicated for tldq_buf).
> + * tldq_dtor: This function is called to free the queue entry. If tldq_buf is
> + *     not NULL, the dtor function must free that, too.
> + * tldq_refcnt: used by the common code to indicate how many readers still need
> + *     this data.
> + */
> +struct tcp_log_dev_queue {
> +	STAILQ_ENTRY(tcp_log_dev_queue) tldq_queue;
> +	struct tcp_log_common_header *tldq_buf;
> +	struct tcp_log_common_header *(*tldq_xform)(struct tcp_log_dev_queue *entry);
> +	void	(*tldq_dtor)(struct tcp_log_dev_queue *entry);
> +	volatile u_int tldq_refcnt;
> +};
> +
> +STAILQ_HEAD(log_queueh, tcp_log_dev_queue);
> +
> +struct tcp_log_dev_info {
> +	STAILQ_ENTRY(tcp_log_dev_info) tldi_list;
> +	struct tcp_log_dev_queue *tldi_head;
> +	struct tcp_log_common_header *tldi_cur;
> +	off_t			tldi_off;
> +};
> +STAILQ_HEAD(log_infoh, tcp_log_dev_info);
> +
> +
> +MALLOC_DECLARE(M_TCPLOGDEV);
> +int tcp_log_dev_add_log(struct tcp_log_dev_queue *entry);
> +#endif /* _KERNEL */
> +#endif /* !__tcp_log_dev_h__ */
> 
> Modified: head/sys/kern/subr_witness.c
> ==============================================================================
> --- head/sys/kern/subr_witness.c	Thu Mar 22 08:32:39 2018	(r331346)
> +++ head/sys/kern/subr_witness.c	Thu Mar 22 09:40:08 2018	(r331347)
> @@ -640,6 +640,14 @@ static struct witness_order_list_entry order_lists[] =
>  	{ "db->db_mtx", &lock_class_sx },
>  	{ NULL, NULL },
>  	/*
> +	 * TCP log locks
> +	 */
> +	{ "TCP ID tree", &lock_class_rw },
> +	{ "tcp log id bucket", &lock_class_mtx_sleep },
> +	{ "tcpinp", &lock_class_rw },
> +	{ "TCP log expireq", &lock_class_mtx_sleep },
> +	{ NULL, NULL },
> +	/*
>  	 * spin locks
>  	 */
>  #ifdef SMP
> 
> Modified: head/sys/netinet/tcp.h
> ==============================================================================
> --- head/sys/netinet/tcp.h	Thu Mar 22 08:32:39 2018	(r331346)
> +++ head/sys/netinet/tcp.h	Thu Mar 22 09:40:08 2018	(r331347)
> @@ -168,6 +168,12 @@ struct tcphdr {
>  #define TCP_NOOPT	8	/* don't use TCP options */
>  #define TCP_MD5SIG	16	/* use MD5 digests (RFC2385) */
>  #define	TCP_INFO	32	/* retrieve tcp_info structure */
> +#define	TCP_LOG		34	/* configure event logging for connection */
> +#define	TCP_LOGBUF	35	/* retrieve event log for connection */
> +#define	TCP_LOGID	36	/* configure log ID to correlate connections */
> +#define	TCP_LOGDUMP	37	/* dump connection log events to device */
> +#define	TCP_LOGDUMPID	38	/* dump events from connections with same ID to
> +				   device */
>  #define	TCP_CONGESTION	64	/* get/set congestion control algorithm */
>  #define	TCP_CCALGOOPT	65	/* get/set cc algorithm specific options */
>  #define	TCP_KEEPINIT	128	/* N, time to establish connection */
> @@ -188,6 +194,9 @@ struct tcphdr {
>  #define	TCPI_OPT_WSCALE		0x04
>  #define	TCPI_OPT_ECN		0x08
>  #define	TCPI_OPT_TOE		0x10
> +
> +/* Maximum length of log ID. */
> +#define TCP_LOG_ID_LEN	64
>  
>  /*
>   * The TCP_INFO socket option comes from the Linux 2.6 TCP API, and permits
> 
> Modified: head/sys/netinet/tcp_input.c
> ==============================================================================
> --- head/sys/netinet/tcp_input.c	Thu Mar 22 08:32:39 2018	(r331346)
> +++ head/sys/netinet/tcp_input.c	Thu Mar 22 09:40:08 2018	(r331347)
> @@ -102,6 +102,7 @@ __FBSDID("$FreeBSD$");
>  #include <netinet6/nd6.h>
>  #include <netinet/tcp.h>
>  #include <netinet/tcp_fsm.h>
> +#include <netinet/tcp_log_buf.h>
>  #include <netinet/tcp_seq.h>
>  #include <netinet/tcp_timer.h>
>  #include <netinet/tcp_var.h>
> @@ -1592,6 +1593,8 @@ tcp_do_segment(struct mbuf *m, struct tcphdr *th, stru
>  	/* Save segment, if requested. */
>  	tcp_pcap_add(th, m, &(tp->t_inpkts));
>  #endif
> +	TCP_LOG_EVENT(tp, th, &so->so_rcv, &so->so_snd, TCP_LOG_IN, 0,
> +	    tlen, NULL, true);
>  
>  	if ((thflags & TH_SYN) && (thflags & TH_FIN) && V_drop_synfin) {
>  		if ((s = tcp_log_addrs(inc, th, NULL, NULL))) {
> 
> Added: head/sys/netinet/tcp_log_buf.c
> ==============================================================================
> --- /dev/null	00:00:00 1970	(empty, because file is newly added)
> +++ head/sys/netinet/tcp_log_buf.c	Thu Mar 22 09:40:08 2018	(r331347)
> @@ -0,0 +1,2480 @@
> +/*-
> + * SPDX-License-Identifier: BSD-2-Clause-FreeBSD
> + *
> + * Copyright (c) 2016-2018
> + *	Netflix Inc.  All rights reserved.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + *
> + */
> +
> +#include <sys/cdefs.h>
> +__FBSDID("$FreeBSD$");
> +
> +#include <sys/param.h>
> +#include <sys/kernel.h>
> +#include <sys/lock.h>
> +#include <sys/malloc.h>
> +#include <sys/mutex.h>
> +#include <sys/queue.h>
> +#include <sys/refcount.h>
> +#include <sys/rwlock.h>
> +#include <sys/socket.h>
> +#include <sys/socketvar.h>
> +#include <sys/sysctl.h>
> +#include <sys/tree.h>
> +#include <sys/counter.h>
> +
> +#include <dev/tcp_log/tcp_log_dev.h>
> +
> +#include <net/if.h>
> +#include <net/if_var.h>
> +#include <net/vnet.h>
> +
> +#include <netinet/in.h>
> +#include <netinet/in_pcb.h>
> +#include <netinet/in_var.h>
> +#include <netinet/tcp_var.h>
> +#include <netinet/tcp_log_buf.h>
> +
> +/* Default expiry time */
> +#define	TCP_LOG_EXPIRE_TIME	((sbintime_t)60 * SBT_1S)
> +
> +/* Max interval at which to run the expiry timer */
> +#define	TCP_LOG_EXPIRE_INTVL	((sbintime_t)5 * SBT_1S)
> +
> +bool	tcp_log_verbose;
> +static uma_zone_t tcp_log_bucket_zone, tcp_log_node_zone, tcp_log_zone;
> +static int	tcp_log_session_limit = TCP_LOG_BUF_DEFAULT_SESSION_LIMIT;
> +static uint32_t	tcp_log_version = TCP_LOG_BUF_VER;
> +RB_HEAD(tcp_log_id_tree, tcp_log_id_bucket);
> +static struct tcp_log_id_tree tcp_log_id_head;
> +static STAILQ_HEAD(, tcp_log_id_node) tcp_log_expireq_head =
> +    STAILQ_HEAD_INITIALIZER(tcp_log_expireq_head);
> +static struct mtx tcp_log_expireq_mtx;
> +static struct callout tcp_log_expireq_callout;
> +static uint64_t tcp_log_auto_ratio = 0;
> +static uint64_t tcp_log_auto_ratio_cur = 0;
> +static uint32_t tcp_log_auto_mode = TCP_LOG_STATE_TAIL;
> +static bool tcp_log_auto_all = false;
> +
> +RB_PROTOTYPE_STATIC(tcp_log_id_tree, tcp_log_id_bucket, tlb_rb, tcp_log_id_cmp)
> +
> +SYSCTL_NODE(_net_inet_tcp, OID_AUTO, bb, CTLFLAG_RW, 0, "TCP Black Box controls");
> +
> +SYSCTL_BOOL(_net_inet_tcp_bb, OID_AUTO, log_verbose, CTLFLAG_RW, &tcp_log_verbose,
> +    0, "Force verbose logging for TCP traces");
> +
> +SYSCTL_INT(_net_inet_tcp_bb, OID_AUTO, log_session_limit,
> +    CTLFLAG_RW, &tcp_log_session_limit, 0,
> +    "Maximum number of events maintained for each TCP session");
> +
> +SYSCTL_UMA_MAX(_net_inet_tcp_bb, OID_AUTO, log_global_limit, CTLFLAG_RW,
> +    &tcp_log_zone, "Maximum number of events maintained for all TCP sessions");
> +
> +SYSCTL_UMA_CUR(_net_inet_tcp_bb, OID_AUTO, log_global_entries, CTLFLAG_RD,
> +    &tcp_log_zone, "Current number of events maintained for all TCP sessions");
> +
> +SYSCTL_UMA_MAX(_net_inet_tcp_bb, OID_AUTO, log_id_limit, CTLFLAG_RW,
> +    &tcp_log_bucket_zone, "Maximum number of log IDs");
> +
> +SYSCTL_UMA_CUR(_net_inet_tcp_bb, OID_AUTO, log_id_entries, CTLFLAG_RD,
> +    &tcp_log_bucket_zone, "Current number of log IDs");
> +
> +SYSCTL_UMA_MAX(_net_inet_tcp_bb, OID_AUTO, log_id_tcpcb_limit, CTLFLAG_RW,
> +    &tcp_log_node_zone, "Maximum number of tcpcbs with log IDs");
> +
> +SYSCTL_UMA_CUR(_net_inet_tcp_bb, OID_AUTO, log_id_tcpcb_entries, CTLFLAG_RD,
> +    &tcp_log_node_zone, "Current number of tcpcbs with log IDs");
> +
> +SYSCTL_U32(_net_inet_tcp_bb, OID_AUTO, log_version, CTLFLAG_RD, &tcp_log_version,
> +    0, "Version of log formats exported");
> +
> +SYSCTL_U64(_net_inet_tcp_bb, OID_AUTO, log_auto_ratio, CTLFLAG_RW,
> +    &tcp_log_auto_ratio, 0, "Do auto capturing for 1 out of N sessions");
> +
> +SYSCTL_U32(_net_inet_tcp_bb, OID_AUTO, log_auto_mode, CTLFLAG_RW,
> +    &tcp_log_auto_mode, TCP_LOG_STATE_HEAD_AUTO,
> +    "Logging mode for auto-selected sessions (default is TCP_LOG_STATE_HEAD_AUTO)");
> +
> +SYSCTL_BOOL(_net_inet_tcp_bb, OID_AUTO, log_auto_all, CTLFLAG_RW,
> +    &tcp_log_auto_all, false,
> +    "Auto-select from all sessions (rather than just those with IDs)");
> +
> +#ifdef TCPLOG_DEBUG_COUNTERS
> +counter_u64_t tcp_log_queued;
> +counter_u64_t tcp_log_que_fail1;
> +counter_u64_t tcp_log_que_fail2;
> +counter_u64_t tcp_log_que_fail3;
> +counter_u64_t tcp_log_que_fail4;
> +counter_u64_t tcp_log_que_fail5;
> +counter_u64_t tcp_log_que_copyout;
> +counter_u64_t tcp_log_que_read;
> +counter_u64_t tcp_log_que_freed;
> +
> +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, queued, CTLFLAG_RD,
> +    &tcp_log_queued, "Number of entries queued");
> +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail1, CTLFLAG_RD,
> +    &tcp_log_que_fail1, "Number of entries queued but fail 1");
> +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail2, CTLFLAG_RD,
> +    &tcp_log_que_fail2, "Number of entries queued but fail 2");
> +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail3, CTLFLAG_RD,
> +    &tcp_log_que_fail3, "Number of entries queued but fail 3");
> +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail4, CTLFLAG_RD,
> +    &tcp_log_que_fail4, "Number of entries queued but fail 4");
> +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail5, CTLFLAG_RD,
> +    &tcp_log_que_fail5, "Number of entries queued but fail 4");
> +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, copyout, CTLFLAG_RD,
> +    &tcp_log_que_copyout, "Number of entries copied out");
> +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, read, CTLFLAG_RD,
> +    &tcp_log_que_read, "Number of entries read from the queue");
> +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, freed, CTLFLAG_RD,
> +    &tcp_log_que_freed, "Number of entries freed after reading");
> +#endif
> +
> +#ifdef INVARIANTS
> +#define	TCPLOG_DEBUG_RINGBUF
> +#endif
> +
> +struct tcp_log_mem
> +{
> +	STAILQ_ENTRY(tcp_log_mem) tlm_queue;
> +	struct tcp_log_buffer	tlm_buf;
> +	struct tcp_log_verbose	tlm_v;
> +#ifdef TCPLOG_DEBUG_RINGBUF
> +	volatile int		tlm_refcnt;
> +#endif
> +};
> +
> +/* 60 bytes for the header, + 16 bytes for padding */
> +static uint8_t	zerobuf[76];
> +
> +/*
> + * Lock order:
> + * 1. TCPID_TREE
> + * 2. TCPID_BUCKET
> + * 3. INP
> + *
> + * Rules:
> + * A. You need a lock on the Tree to add/remove buckets.
> + * B. You need a lock on the bucket to add/remove nodes from the bucket.
> + * C. To change information in a node, you need the INP lock if the tln_closed
> + *    field is false. Otherwise, you need the bucket lock. (Note that the
> + *    tln_closed field can change at any point, so you need to recheck the
> + *    entry after acquiring the INP lock.)
> + * D. To remove a node from the bucket, you must have that entry locked,
> + *    according to the criteria of Rule C. Also, the node must not be on
> + *    the expiry queue.
> + * E. The exception to C is the expiry queue fields, which are locked by
> + *    the TCPLOG_EXPIREQ lock.
> + *
> + * Buckets have a reference count. Each node is a reference. Further,
> + * other callers may add reference counts to keep a bucket from disappearing.
> + * You can add a reference as long as you own a lock sufficient to keep the
> + * bucket from disappearing. For example, a common use is:
> + *   a. Have a locked INP, but need to lock the TCPID_BUCKET.
> + *   b. Add a refcount on the bucket. (Safe because the INP lock prevents
> + *      the TCPID_BUCKET from going away.)
> + *   c. Drop the INP lock.
> + *   d. Acquire a lock on the TCPID_BUCKET.
> + *   e. Acquire a lock on the INP.
> + *   f. Drop the refcount on the bucket.
> + *      (At this point, the bucket may disappear.)
> + *
> + * Expire queue lock:
> + * You can acquire this with either the bucket or INP lock. Don't reverse it.
> + * When the expire code has committed to freeing a node, it resets the expiry
> + * time to SBT_MAX. That is the signal to everyone else that they should
> + * leave that node alone.
> + */
> +static struct rwlock tcp_id_tree_lock;
> +#define	TCPID_TREE_WLOCK()		rw_wlock(&tcp_id_tree_lock)
> +#define	TCPID_TREE_RLOCK()		rw_rlock(&tcp_id_tree_lock)
> +#define	TCPID_TREE_UPGRADE()		rw_try_upgrade(&tcp_id_tree_lock)
> +#define	TCPID_TREE_WUNLOCK()		rw_wunlock(&tcp_id_tree_lock)
> +#define	TCPID_TREE_RUNLOCK()		rw_runlock(&tcp_id_tree_lock)
> +#define	TCPID_TREE_WLOCK_ASSERT()	rw_assert(&tcp_id_tree_lock, RA_WLOCKED)
> +#define	TCPID_TREE_RLOCK_ASSERT()	rw_assert(&tcp_id_tree_lock, RA_RLOCKED)
> +#define	TCPID_TREE_UNLOCK_ASSERT()	rw_assert(&tcp_id_tree_lock, RA_UNLOCKED)
> +
> +#define	TCPID_BUCKET_LOCK_INIT(tlb)	mtx_init(&((tlb)->tlb_mtx), "tcp log id bucket", NULL, MTX_DEF)
> +#define	TCPID_BUCKET_LOCK_DESTROY(tlb)	mtx_destroy(&((tlb)->tlb_mtx))
> +#define	TCPID_BUCKET_LOCK(tlb)		mtx_lock(&((tlb)->tlb_mtx))
> +#define	TCPID_BUCKET_UNLOCK(tlb)	mtx_unlock(&((tlb)->tlb_mtx))
> +#define	TCPID_BUCKET_LOCK_ASSERT(tlb)	mtx_assert(&((tlb)->tlb_mtx), MA_OWNED)
> +#define	TCPID_BUCKET_UNLOCK_ASSERT(tlb) mtx_assert(&((tlb)->tlb_mtx), MA_NOTOWNED)
> +
> +#define	TCPID_BUCKET_REF(tlb)		refcount_acquire(&((tlb)->tlb_refcnt))
> +#define	TCPID_BUCKET_UNREF(tlb)		refcount_release(&((tlb)->tlb_refcnt))
> +
> +#define	TCPLOG_EXPIREQ_LOCK()		mtx_lock(&tcp_log_expireq_mtx)
> +#define	TCPLOG_EXPIREQ_UNLOCK()		mtx_unlock(&tcp_log_expireq_mtx)
> +
> +SLIST_HEAD(tcp_log_id_head, tcp_log_id_node);
> +
> +struct tcp_log_id_bucket
> +{
> +	/*
> +	 * tlb_id must be first. This lets us use strcmp on
> +	 * (struct tcp_log_id_bucket *) and (char *) interchangeably.
> +	 */
> +	char				tlb_id[TCP_LOG_ID_LEN];
> +	RB_ENTRY(tcp_log_id_bucket)	tlb_rb;
> +	struct tcp_log_id_head		tlb_head;
> +	struct mtx			tlb_mtx;
> +	volatile u_int			tlb_refcnt;
> +};
> +
> +struct tcp_log_id_node
> +{
> +	SLIST_ENTRY(tcp_log_id_node) tln_list;
> +	STAILQ_ENTRY(tcp_log_id_node) tln_expireq; /* Locked by the expireq lock */
> +	sbintime_t		tln_expiretime;	/* Locked by the expireq lock */
> +
> +	/*
> +	 * If INP is NULL, that means the connection has closed. We've
> +	 * saved the connection endpoint information and the log entries
> +	 * in the tln_ie and tln_entries members. We've also saved a pointer
> +	 * to the enclosing bucket here. If INP is not NULL, the information is
> +	 * in the PCB and not here.
> +	 */
> +	struct inpcb		*tln_inp;
> +	struct tcpcb		*tln_tp;
> +	struct tcp_log_id_bucket *tln_bucket;
> +	struct in_endpoints	tln_ie;
> +	struct tcp_log_stailq	tln_entries;
> +	int			tln_count;
> +	volatile int		tln_closed;
> +	uint8_t			tln_af;
> +};
> +
> +enum tree_lock_state {
> +	TREE_UNLOCKED = 0,
> +	TREE_RLOCKED,
> +	TREE_WLOCKED,
> +};
> +
> +/* Do we want to select this session for auto-logging? */
> +static __inline bool
> +tcp_log_selectauto(void)
> +{
> +
> +	/*
> 
> *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180322141606.GA4972>