Date: Thu, 22 Mar 2018 09:40:08 +0000 (UTC) From: "Jonathan T. Looney" <jtl@FreeBSD.org> To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r331347 - in head: etc/mtree include sys/conf sys/dev/tcp_log sys/kern sys/netinet usr.bin/netstat Message-ID: <201803220940.w2M9e8T4067719@repo.freebsd.org>
next in thread | raw e-mail | index | archive | help
Author: jtl Date: Thu Mar 22 09:40:08 2018 New Revision: 331347 URL: https://svnweb.freebsd.org/changeset/base/331347 Log: Add the "TCP Blackbox Recorder" which we discussed at the developer summits at BSDCan and BSDCam in 2017. The TCP Blackbox Recorder allows you to capture events on a TCP connection in a ring buffer. It stores metadata with the event. It optionally stores the TCP header associated with an event (if the event is associated with a packet) and also optionally stores information on the sockets. It supports setting a log ID on a TCP connection and using this to correlate multiple connections that share a common log ID. You can log connections in different modes. If you are doing a coordinated test with a particular connection, you may tell the system to put it in mode 4 (continuous dump). Or, if you just want to monitor for errors, you can put it in mode 1 (ring buffer) and dump all the ring buffers associated with the connection ID when we receive an error signal for that connection ID. You can set a default mode that will be applied to a particular ratio of incoming connections. You can also manually set a mode using a socket option. This commit includes only basic probes. rrs@ has added quite an abundance of probes in his TCP development work. He plans to commit those soon. There are user-space programs which we plan to commit as ports. These read the data from the log device and output pcapng files, and then let you analyze the data (and metadata) in the pcapng files. Reviewed by: gnn (previous version) Obtained from: Netflix, Inc. Relnotes: yes Differential Revision: https://reviews.freebsd.org/D11085 Added: head/sys/dev/tcp_log/ head/sys/dev/tcp_log/tcp_log_dev.c (contents, props changed) head/sys/dev/tcp_log/tcp_log_dev.h (contents, props changed) head/sys/netinet/tcp_log_buf.c (contents, props changed) head/sys/netinet/tcp_log_buf.h (contents, props changed) Modified: head/etc/mtree/BSD.include.dist head/include/Makefile head/sys/conf/files head/sys/kern/subr_witness.c head/sys/netinet/tcp.h head/sys/netinet/tcp_input.c head/sys/netinet/tcp_output.c head/sys/netinet/tcp_subr.c head/sys/netinet/tcp_timer.c head/sys/netinet/tcp_usrreq.c head/sys/netinet/tcp_var.h head/usr.bin/netstat/inet.c head/usr.bin/netstat/main.c head/usr.bin/netstat/netstat.1 head/usr.bin/netstat/netstat.h Modified: head/etc/mtree/BSD.include.dist ============================================================================== --- head/etc/mtree/BSD.include.dist Thu Mar 22 08:32:39 2018 (r331346) +++ head/etc/mtree/BSD.include.dist Thu Mar 22 09:40:08 2018 (r331347) @@ -158,6 +158,8 @@ .. speaker .. + tcp_log + .. usb .. vkbd Modified: head/include/Makefile ============================================================================== --- head/include/Makefile Thu Mar 22 08:32:39 2018 (r331346) +++ head/include/Makefile Thu Mar 22 09:40:08 2018 (r331347) @@ -47,7 +47,7 @@ LSUBDIRS= cam/ata cam/mmc cam/nvme cam/scsi \ dev/hwpmc dev/hyperv \ dev/ic dev/iicbus dev/io dev/lmc dev/mfi dev/mmc dev/nvme \ dev/ofw dev/pbio dev/pci ${_dev_powermac_nvram} dev/ppbus dev/smbus \ - dev/speaker dev/vkbd dev/wi \ + dev/speaker dev/tcp_log dev/vkbd dev/wi \ fs/devfs fs/fdescfs fs/msdosfs fs/nandfs fs/nfs fs/nullfs \ fs/procfs fs/smbfs fs/udf fs/unionfs \ geom/cache geom/concat geom/eli geom/gate geom/journal geom/label \ Modified: head/sys/conf/files ============================================================================== --- head/sys/conf/files Thu Mar 22 08:32:39 2018 (r331346) +++ head/sys/conf/files Thu Mar 22 09:40:08 2018 (r331347) @@ -3161,6 +3161,7 @@ dev/syscons/star/star_saver.c optional star_saver dev/syscons/syscons.c optional sc dev/syscons/sysmouse.c optional sc dev/syscons/warp/warp_saver.c optional warp_saver +dev/tcp_log/tcp_log_dev.c optional inet | inet6 dev/tdfx/tdfx_linux.c optional tdfx_linux tdfx compat_linux dev/tdfx/tdfx_pci.c optional tdfx pci dev/ti/if_ti.c optional ti pci @@ -4309,6 +4310,7 @@ netinet/tcp_debug.c optional tcpdebug netinet/tcp_fastopen.c optional inet tcp_rfc7413 | inet6 tcp_rfc7413 netinet/tcp_hostcache.c optional inet | inet6 netinet/tcp_input.c optional inet | inet6 +netinet/tcp_log_buf.c optional inet | inet6 netinet/tcp_lro.c optional inet | inet6 netinet/tcp_output.c optional inet | inet6 netinet/tcp_offload.c optional tcp_offload inet | tcp_offload inet6 Added: head/sys/dev/tcp_log/tcp_log_dev.c ============================================================================== --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ head/sys/dev/tcp_log/tcp_log_dev.c Thu Mar 22 09:40:08 2018 (r331347) @@ -0,0 +1,521 @@ +/*- + * SPDX-License-Identifier: BSD-2-Clause-FreeBSD + * + * Copyright (c) 2016-2017 + * Netflix Inc. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + */ + +#include <sys/cdefs.h> +__FBSDID("$FreeBSD$"); + +#include <sys/param.h> +#include <sys/conf.h> +#include <sys/fcntl.h> +#include <sys/filio.h> +#include <sys/kernel.h> +#include <sys/lock.h> +#include <sys/malloc.h> +#include <sys/module.h> +#include <sys/poll.h> +#include <sys/queue.h> +#include <sys/refcount.h> +#include <sys/mutex.h> +#include <sys/selinfo.h> +#include <sys/socket.h> +#include <sys/socketvar.h> +#include <sys/sysctl.h> +#include <sys/tree.h> +#include <sys/uio.h> +#include <machine/atomic.h> +#include <sys/counter.h> + +#include <dev/tcp_log/tcp_log_dev.h> + +#ifdef TCPLOG_DEBUG_COUNTERS +extern counter_u64_t tcp_log_que_read; +extern counter_u64_t tcp_log_que_freed; +#endif + +static struct cdev *tcp_log_dev; +static struct selinfo tcp_log_sel; + +static struct log_queueh tcp_log_dev_queue_head = STAILQ_HEAD_INITIALIZER(tcp_log_dev_queue_head); +static struct log_infoh tcp_log_dev_reader_head = STAILQ_HEAD_INITIALIZER(tcp_log_dev_reader_head); + +MALLOC_DEFINE(M_TCPLOGDEV, "tcp_log_dev", "TCP log device data structures"); + +static int tcp_log_dev_listeners = 0; + +static struct mtx tcp_log_dev_queue_lock; + +#define TCP_LOG_DEV_QUEUE_LOCK() mtx_lock(&tcp_log_dev_queue_lock) +#define TCP_LOG_DEV_QUEUE_UNLOCK() mtx_unlock(&tcp_log_dev_queue_lock) +#define TCP_LOG_DEV_QUEUE_LOCK_ASSERT() mtx_assert(&tcp_log_dev_queue_lock, MA_OWNED) +#define TCP_LOG_DEV_QUEUE_UNLOCK_ASSERT() mtx_assert(&tcp_log_dev_queue_lock, MA_NOTOWNED) +#define TCP_LOG_DEV_QUEUE_REF(tldq) refcount_acquire(&((tldq)->tldq_refcnt)) +#define TCP_LOG_DEV_QUEUE_UNREF(tldq) refcount_release(&((tldq)->tldq_refcnt)) + +static void tcp_log_dev_clear_refcount(struct tcp_log_dev_queue *entry); +static void tcp_log_dev_clear_cdevpriv(void *data); +static int tcp_log_dev_open(struct cdev *dev __unused, int flags, + int devtype __unused, struct thread *td __unused); +static int tcp_log_dev_write(struct cdev *dev __unused, + struct uio *uio __unused, int flags __unused); +static int tcp_log_dev_read(struct cdev *dev __unused, struct uio *uio, + int flags __unused); +static int tcp_log_dev_ioctl(struct cdev *dev __unused, u_long cmd, + caddr_t data, int fflag __unused, struct thread *td __unused); +static int tcp_log_dev_poll(struct cdev *dev __unused, int events, + struct thread *td); + + +enum tcp_log_dev_queue_lock_state { + QUEUE_UNLOCKED = 0, + QUEUE_LOCKED, +}; + +static struct cdevsw tcp_log_cdevsw = { + .d_version = D_VERSION, + .d_read = tcp_log_dev_read, + .d_open = tcp_log_dev_open, + .d_write = tcp_log_dev_write, + .d_poll = tcp_log_dev_poll, + .d_ioctl = tcp_log_dev_ioctl, +#ifdef NOTYET + .d_mmap = tcp_log_dev_mmap, +#endif + .d_name = "tcp_log", +}; + +static __inline void +tcp_log_dev_queue_validate_lock(int lockstate) +{ + +#ifdef INVARIANTS + switch (lockstate) { + case QUEUE_LOCKED: + TCP_LOG_DEV_QUEUE_LOCK_ASSERT(); + break; + case QUEUE_UNLOCKED: + TCP_LOG_DEV_QUEUE_UNLOCK_ASSERT(); + break; + default: + kassert_panic("%s:%d: unknown queue lock state", __func__, + __LINE__); + } +#endif +} + +/* + * Clear the refcount. If appropriate, it will remove the entry from the + * queue and call the destructor. + * + * This must be called with the queue lock held. + */ +static void +tcp_log_dev_clear_refcount(struct tcp_log_dev_queue *entry) +{ + + KASSERT(entry != NULL, ("%s: called with NULL entry", __func__)); + + TCP_LOG_DEV_QUEUE_LOCK_ASSERT(); + + if (TCP_LOG_DEV_QUEUE_UNREF(entry)) { +#ifdef TCPLOG_DEBUG_COUNTERS + counter_u64_add(tcp_log_que_freed, 1); +#endif + /* Remove the entry from the queue and call the destructor. */ + STAILQ_REMOVE(&tcp_log_dev_queue_head, entry, tcp_log_dev_queue, + tldq_queue); + (*entry->tldq_dtor)(entry); + } +} + +static void +tcp_log_dev_clear_cdevpriv(void *data) +{ + struct tcp_log_dev_info *priv; + struct tcp_log_dev_queue *entry, *entry_tmp; + + priv = (struct tcp_log_dev_info *)data; + if (priv == NULL) + return; + + /* + * Lock the queue and drop our references. We hold references to all + * the entries starting with tldi_head (or, if tldi_head == NULL, all + * entries in the queue). + * + * Because we don't want anyone adding addition things to the queue + * while we are doing this, we lock the queue. + */ + TCP_LOG_DEV_QUEUE_LOCK(); + if (priv->tldi_head != NULL) { + entry = priv->tldi_head; + STAILQ_FOREACH_FROM_SAFE(entry, &tcp_log_dev_queue_head, + tldq_queue, entry_tmp) { + tcp_log_dev_clear_refcount(entry); + } + } + tcp_log_dev_listeners--; + KASSERT(tcp_log_dev_listeners >= 0, + ("%s: tcp_log_dev_listeners is unexpectedly negative", __func__)); + STAILQ_REMOVE(&tcp_log_dev_reader_head, priv, tcp_log_dev_info, + tldi_list); + TCP_LOG_DEV_QUEUE_LOCK_ASSERT(); + TCP_LOG_DEV_QUEUE_UNLOCK(); + free(priv, M_TCPLOGDEV); +} + +static int +tcp_log_dev_open(struct cdev *dev __unused, int flags, int devtype __unused, + struct thread *td __unused) +{ + struct tcp_log_dev_info *priv; + struct tcp_log_dev_queue *entry; + int rv; + + /* + * Ideally, we shouldn't see these because of file system + * permissions. + */ + if (flags & (FWRITE | FEXEC | FAPPEND | O_TRUNC)) + return (ENODEV); + + /* Allocate space to hold information about where we are. */ + priv = malloc(sizeof(struct tcp_log_dev_info), M_TCPLOGDEV, + M_ZERO | M_WAITOK); + + /* Stash the private data away. */ + rv = devfs_set_cdevpriv((void *)priv, tcp_log_dev_clear_cdevpriv); + if (!rv) { + /* + * Increase the listener count, add this reader to the list, and + * take references on all current queues. + */ + TCP_LOG_DEV_QUEUE_LOCK(); + tcp_log_dev_listeners++; + STAILQ_INSERT_HEAD(&tcp_log_dev_reader_head, priv, tldi_list); + priv->tldi_head = STAILQ_FIRST(&tcp_log_dev_queue_head); + if (priv->tldi_head != NULL) + priv->tldi_cur = priv->tldi_head->tldq_buf; + STAILQ_FOREACH(entry, &tcp_log_dev_queue_head, tldq_queue) + TCP_LOG_DEV_QUEUE_REF(entry); + TCP_LOG_DEV_QUEUE_UNLOCK(); + } else { + /* Free the entry. */ + free(priv, M_TCPLOGDEV); + } + return (rv); +} + +static int +tcp_log_dev_write(struct cdev *dev __unused, struct uio *uio __unused, + int flags __unused) +{ + + return (ENODEV); +} + +static __inline void +tcp_log_dev_rotate_bufs(struct tcp_log_dev_info *priv, int *lockstate) +{ + struct tcp_log_dev_queue *entry; + + KASSERT(priv->tldi_head != NULL, + ("%s:%d: priv->tldi_head unexpectedly NULL", + __func__, __LINE__)); + KASSERT(priv->tldi_head->tldq_buf == priv->tldi_cur, + ("%s:%d: buffer mismatch (%p vs %p)", + __func__, __LINE__, priv->tldi_head->tldq_buf, + priv->tldi_cur)); + tcp_log_dev_queue_validate_lock(*lockstate); + + if (*lockstate == QUEUE_UNLOCKED) { + TCP_LOG_DEV_QUEUE_LOCK(); + *lockstate = QUEUE_LOCKED; + } + entry = priv->tldi_head; + priv->tldi_head = STAILQ_NEXT(entry, tldq_queue); + tcp_log_dev_clear_refcount(entry); + priv->tldi_cur = NULL; +} + +static int +tcp_log_dev_read(struct cdev *dev __unused, struct uio *uio, int flags) +{ + struct tcp_log_common_header *buf; + struct tcp_log_dev_info *priv; + struct tcp_log_dev_queue *entry; + ssize_t len; + int lockstate, rv; + + /* Get our private info. */ + rv = devfs_get_cdevpriv((void **)&priv); + if (rv) + return (rv); + + lockstate = QUEUE_UNLOCKED; + + /* Do we need to get a new buffer? */ + while (priv->tldi_cur == NULL || + priv->tldi_cur->tlch_length <= priv->tldi_off) { + /* Did we somehow forget to rotate? */ + KASSERT(priv->tldi_cur == NULL, + ("%s:%d: tldi_cur is unexpectedly non-NULL", __func__, + __LINE__)); + if (priv->tldi_cur != NULL) + tcp_log_dev_rotate_bufs(priv, &lockstate); + + /* + * Before we start looking at tldi_head, we need a lock on the + * queue to make sure tldi_head stays stable. + */ + if (lockstate == QUEUE_UNLOCKED) { + TCP_LOG_DEV_QUEUE_LOCK(); + lockstate = QUEUE_LOCKED; + } + + /* We need the next buffer. Do we have one? */ + if (priv->tldi_head == NULL && (flags & FNONBLOCK)) { + rv = EAGAIN; + goto done; + } + if (priv->tldi_head == NULL) { + /* Sleep and wait for more things we can read. */ + rv = mtx_sleep(&tcp_log_dev_listeners, + &tcp_log_dev_queue_lock, PCATCH, "tcplogdev", 0); + if (rv) + goto done; + if (priv->tldi_head == NULL) + continue; + } + + /* + * We have an entry to read. We want to try to create a + * buffer, if one doesn't already exist. + */ + entry = priv->tldi_head; + if (entry->tldq_buf == NULL) { + TCP_LOG_DEV_QUEUE_LOCK_ASSERT(); + buf = (*entry->tldq_xform)(entry); + if (buf == NULL) { + rv = EBUSY; + goto done; + } + entry->tldq_buf = buf; + } + + priv->tldi_cur = entry->tldq_buf; + priv->tldi_off = 0; + } + + /* Copy what we can from this buffer to the output buffer. */ + if (uio->uio_resid > 0) { + /* Drop locks so we can take page faults. */ + if (lockstate == QUEUE_LOCKED) + TCP_LOG_DEV_QUEUE_UNLOCK(); + lockstate = QUEUE_UNLOCKED; + + KASSERT(priv->tldi_cur != NULL, + ("%s: priv->tldi_cur is unexpectedly NULL", __func__)); + + /* Copy as much as we can to this uio. */ + len = priv->tldi_cur->tlch_length - priv->tldi_off; + if (len > uio->uio_resid) + len = uio->uio_resid; + rv = uiomove(((uint8_t *)priv->tldi_cur) + priv->tldi_off, + len, uio); + if (rv != 0) + goto done; + priv->tldi_off += len; +#ifdef TCPLOG_DEBUG_COUNTERS + counter_u64_add(tcp_log_que_read, len); +#endif + } + /* Are we done with this buffer? If so, find the next one. */ + if (priv->tldi_off >= priv->tldi_cur->tlch_length) { + KASSERT(priv->tldi_off == priv->tldi_cur->tlch_length, + ("%s: offset (%ju) exceeds length (%ju)", __func__, + (uintmax_t)priv->tldi_off, + (uintmax_t)priv->tldi_cur->tlch_length)); + tcp_log_dev_rotate_bufs(priv, &lockstate); + } +done: + tcp_log_dev_queue_validate_lock(lockstate); + if (lockstate == QUEUE_LOCKED) + TCP_LOG_DEV_QUEUE_UNLOCK(); + return (rv); +} + +static int +tcp_log_dev_ioctl(struct cdev *dev __unused, u_long cmd, caddr_t data, + int fflag __unused, struct thread *td __unused) +{ + struct tcp_log_dev_info *priv; + int rv; + + /* Get our private info. */ + rv = devfs_get_cdevpriv((void **)&priv); + if (rv) + return (rv); + + /* + * Set things. Here, we are most concerned about the non-blocking I/O + * flag. + */ + rv = 0; + switch (cmd) { + case FIONBIO: + break; + case FIOASYNC: + if (*(int *)data != 0) + rv = EINVAL; + break; + default: + rv = ENOIOCTL; + } + return (rv); +} + +static int +tcp_log_dev_poll(struct cdev *dev __unused, int events, struct thread *td) +{ + struct tcp_log_dev_info *priv; + int revents; + + /* + * Get our private info. If this fails, claim that all events are + * ready. That should prod the user to do something that will + * make the error evident to them. + */ + if (devfs_get_cdevpriv((void **)&priv)) + return (events); + + revents = 0; + if (events & (POLLIN | POLLRDNORM)) { + /* + * We can (probably) read right now if we are partway through + * a buffer or if we are just about to start a buffer. + * Because we are going to read tldi_head, we should acquire + * a read lock on the queue. + */ + TCP_LOG_DEV_QUEUE_LOCK(); + if ((priv->tldi_head != NULL && priv->tldi_cur == NULL) || + (priv->tldi_cur != NULL && + priv->tldi_off < priv->tldi_cur->tlch_length)) + revents = events & (POLLIN | POLLRDNORM); + else + selrecord(td, &tcp_log_sel); + TCP_LOG_DEV_QUEUE_UNLOCK(); + } else { + /* + * It only makes sense to poll for reading. So, again, prod the + * user to do something that will make the error of their ways + * apparent. + */ + revents = events; + } + return (revents); +} + +int +tcp_log_dev_add_log(struct tcp_log_dev_queue *entry) +{ + struct tcp_log_dev_info *priv; + int rv; + bool wakeup_needed; + + KASSERT(entry->tldq_buf != NULL || entry->tldq_xform != NULL, + ("%s: Called with both tldq_buf and tldq_xform set to NULL", + __func__)); + KASSERT(entry->tldq_dtor != NULL, + ("%s: Called with tldq_dtor set to NULL", __func__)); + + /* Get a lock on the queue. */ + TCP_LOG_DEV_QUEUE_LOCK(); + + /* If no one is listening, tell the caller to free the resources. */ + if (tcp_log_dev_listeners == 0) { + rv = ENXIO; + goto done; + } + + /* Add this to the end of the tailq. */ + STAILQ_INSERT_TAIL(&tcp_log_dev_queue_head, entry, tldq_queue); + + /* Add references for all current listeners. */ + refcount_init(&entry->tldq_refcnt, tcp_log_dev_listeners); + + /* + * If any listener is currently stuck on NULL, that means they are + * waiting. Point their head to this new entry. + */ + wakeup_needed = false; + STAILQ_FOREACH(priv, &tcp_log_dev_reader_head, tldi_list) + if (priv->tldi_head == NULL) { + priv->tldi_head = entry; + wakeup_needed = true; + } + + if (wakeup_needed) { + selwakeup(&tcp_log_sel); + wakeup(&tcp_log_dev_listeners); + } + + rv = 0; + +done: + TCP_LOG_DEV_QUEUE_LOCK_ASSERT(); + TCP_LOG_DEV_QUEUE_UNLOCK(); + return (rv); +} + +static int +tcp_log_dev_modevent(module_t mod __unused, int type, void *data __unused) +{ + + /* TODO: Support intelligent unloading. */ + switch (type) { + case MOD_LOAD: + if (bootverbose) + printf("tcp_log: tcp_log device\n"); + memset(&tcp_log_sel, 0, sizeof(tcp_log_sel)); + memset(&tcp_log_dev_queue_lock, 0, sizeof(struct mtx)); + mtx_init(&tcp_log_dev_queue_lock, "tcp_log dev", + "tcp_log device queues", MTX_DEF); + tcp_log_dev = make_dev_credf(MAKEDEV_ETERNAL_KLD, + &tcp_log_cdevsw, 0, NULL, UID_ROOT, GID_WHEEL, 0400, + "tcp_log"); + break; + default: + return (EOPNOTSUPP); + } + + return (0); +} + +DEV_MODULE(tcp_log_dev, tcp_log_dev_modevent, NULL); +MODULE_VERSION(tcp_log_dev, 1); Added: head/sys/dev/tcp_log/tcp_log_dev.h ============================================================================== --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ head/sys/dev/tcp_log/tcp_log_dev.h Thu Mar 22 09:40:08 2018 (r331347) @@ -0,0 +1,88 @@ +/*- + * SPDX-License-Identifier: BSD-2-Clause-FreeBSD + * + * Copyright (c) 2016 + * Netflix Inc. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + * $FreeBSD$ + */ + +#ifndef __tcp_log_dev_h__ +#define __tcp_log_dev_h__ + +/* + * This is the common header for data streamed from the log device. All + * blocks of data need to start with this header. + */ +struct tcp_log_common_header { + uint32_t tlch_version; /* Version is specific to type. */ + uint32_t tlch_type; /* Type of entry(ies) that follow. */ + uint64_t tlch_length; /* Total length, including header. */ +} __packed; + +#define TCP_LOG_DEV_TYPE_BBR 1 /* black box recorder */ + +#ifdef _KERNEL +/* + * This is a queue entry. All queue entries need to start with this structure + * so the common code can cast them to this structure; however, other modules + * are free to include additional data after this structure. + * + * The elements are explained here: + * tldq_queue: used by the common code to maintain this entry's position in the + * queue. + * tldq_buf: should be NULL, or a pointer to a chunk of data. The data must be + * as long as the common header indicates. + * tldq_xform: If tldq_buf is NULL, the code will call this to create the + * the tldq_buf object. The function should *not* directly modify tldq_buf, + * but should return the buffer (which must meet the restrictions + * indicated for tldq_buf). + * tldq_dtor: This function is called to free the queue entry. If tldq_buf is + * not NULL, the dtor function must free that, too. + * tldq_refcnt: used by the common code to indicate how many readers still need + * this data. + */ +struct tcp_log_dev_queue { + STAILQ_ENTRY(tcp_log_dev_queue) tldq_queue; + struct tcp_log_common_header *tldq_buf; + struct tcp_log_common_header *(*tldq_xform)(struct tcp_log_dev_queue *entry); + void (*tldq_dtor)(struct tcp_log_dev_queue *entry); + volatile u_int tldq_refcnt; +}; + +STAILQ_HEAD(log_queueh, tcp_log_dev_queue); + +struct tcp_log_dev_info { + STAILQ_ENTRY(tcp_log_dev_info) tldi_list; + struct tcp_log_dev_queue *tldi_head; + struct tcp_log_common_header *tldi_cur; + off_t tldi_off; +}; +STAILQ_HEAD(log_infoh, tcp_log_dev_info); + + +MALLOC_DECLARE(M_TCPLOGDEV); +int tcp_log_dev_add_log(struct tcp_log_dev_queue *entry); +#endif /* _KERNEL */ +#endif /* !__tcp_log_dev_h__ */ Modified: head/sys/kern/subr_witness.c ============================================================================== --- head/sys/kern/subr_witness.c Thu Mar 22 08:32:39 2018 (r331346) +++ head/sys/kern/subr_witness.c Thu Mar 22 09:40:08 2018 (r331347) @@ -640,6 +640,14 @@ static struct witness_order_list_entry order_lists[] = { "db->db_mtx", &lock_class_sx }, { NULL, NULL }, /* + * TCP log locks + */ + { "TCP ID tree", &lock_class_rw }, + { "tcp log id bucket", &lock_class_mtx_sleep }, + { "tcpinp", &lock_class_rw }, + { "TCP log expireq", &lock_class_mtx_sleep }, + { NULL, NULL }, + /* * spin locks */ #ifdef SMP Modified: head/sys/netinet/tcp.h ============================================================================== --- head/sys/netinet/tcp.h Thu Mar 22 08:32:39 2018 (r331346) +++ head/sys/netinet/tcp.h Thu Mar 22 09:40:08 2018 (r331347) @@ -168,6 +168,12 @@ struct tcphdr { #define TCP_NOOPT 8 /* don't use TCP options */ #define TCP_MD5SIG 16 /* use MD5 digests (RFC2385) */ #define TCP_INFO 32 /* retrieve tcp_info structure */ +#define TCP_LOG 34 /* configure event logging for connection */ +#define TCP_LOGBUF 35 /* retrieve event log for connection */ +#define TCP_LOGID 36 /* configure log ID to correlate connections */ +#define TCP_LOGDUMP 37 /* dump connection log events to device */ +#define TCP_LOGDUMPID 38 /* dump events from connections with same ID to + device */ #define TCP_CONGESTION 64 /* get/set congestion control algorithm */ #define TCP_CCALGOOPT 65 /* get/set cc algorithm specific options */ #define TCP_KEEPINIT 128 /* N, time to establish connection */ @@ -188,6 +194,9 @@ struct tcphdr { #define TCPI_OPT_WSCALE 0x04 #define TCPI_OPT_ECN 0x08 #define TCPI_OPT_TOE 0x10 + +/* Maximum length of log ID. */ +#define TCP_LOG_ID_LEN 64 /* * The TCP_INFO socket option comes from the Linux 2.6 TCP API, and permits Modified: head/sys/netinet/tcp_input.c ============================================================================== --- head/sys/netinet/tcp_input.c Thu Mar 22 08:32:39 2018 (r331346) +++ head/sys/netinet/tcp_input.c Thu Mar 22 09:40:08 2018 (r331347) @@ -102,6 +102,7 @@ __FBSDID("$FreeBSD$"); #include <netinet6/nd6.h> #include <netinet/tcp.h> #include <netinet/tcp_fsm.h> +#include <netinet/tcp_log_buf.h> #include <netinet/tcp_seq.h> #include <netinet/tcp_timer.h> #include <netinet/tcp_var.h> @@ -1592,6 +1593,8 @@ tcp_do_segment(struct mbuf *m, struct tcphdr *th, stru /* Save segment, if requested. */ tcp_pcap_add(th, m, &(tp->t_inpkts)); #endif + TCP_LOG_EVENT(tp, th, &so->so_rcv, &so->so_snd, TCP_LOG_IN, 0, + tlen, NULL, true); if ((thflags & TH_SYN) && (thflags & TH_FIN) && V_drop_synfin) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) { Added: head/sys/netinet/tcp_log_buf.c ============================================================================== --- /dev/null 00:00:00 1970 (empty, because file is newly added) +++ head/sys/netinet/tcp_log_buf.c Thu Mar 22 09:40:08 2018 (r331347) @@ -0,0 +1,2480 @@ +/*- + * SPDX-License-Identifier: BSD-2-Clause-FreeBSD + * + * Copyright (c) 2016-2018 + * Netflix Inc. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * + */ + +#include <sys/cdefs.h> +__FBSDID("$FreeBSD$"); + +#include <sys/param.h> +#include <sys/kernel.h> +#include <sys/lock.h> +#include <sys/malloc.h> +#include <sys/mutex.h> +#include <sys/queue.h> +#include <sys/refcount.h> +#include <sys/rwlock.h> +#include <sys/socket.h> +#include <sys/socketvar.h> +#include <sys/sysctl.h> +#include <sys/tree.h> +#include <sys/counter.h> + +#include <dev/tcp_log/tcp_log_dev.h> + +#include <net/if.h> +#include <net/if_var.h> +#include <net/vnet.h> + +#include <netinet/in.h> +#include <netinet/in_pcb.h> +#include <netinet/in_var.h> +#include <netinet/tcp_var.h> +#include <netinet/tcp_log_buf.h> + +/* Default expiry time */ +#define TCP_LOG_EXPIRE_TIME ((sbintime_t)60 * SBT_1S) + +/* Max interval at which to run the expiry timer */ +#define TCP_LOG_EXPIRE_INTVL ((sbintime_t)5 * SBT_1S) + +bool tcp_log_verbose; +static uma_zone_t tcp_log_bucket_zone, tcp_log_node_zone, tcp_log_zone; +static int tcp_log_session_limit = TCP_LOG_BUF_DEFAULT_SESSION_LIMIT; +static uint32_t tcp_log_version = TCP_LOG_BUF_VER; +RB_HEAD(tcp_log_id_tree, tcp_log_id_bucket); +static struct tcp_log_id_tree tcp_log_id_head; +static STAILQ_HEAD(, tcp_log_id_node) tcp_log_expireq_head = + STAILQ_HEAD_INITIALIZER(tcp_log_expireq_head); +static struct mtx tcp_log_expireq_mtx; +static struct callout tcp_log_expireq_callout; +static uint64_t tcp_log_auto_ratio = 0; +static uint64_t tcp_log_auto_ratio_cur = 0; +static uint32_t tcp_log_auto_mode = TCP_LOG_STATE_TAIL; +static bool tcp_log_auto_all = false; + +RB_PROTOTYPE_STATIC(tcp_log_id_tree, tcp_log_id_bucket, tlb_rb, tcp_log_id_cmp) + +SYSCTL_NODE(_net_inet_tcp, OID_AUTO, bb, CTLFLAG_RW, 0, "TCP Black Box controls"); + +SYSCTL_BOOL(_net_inet_tcp_bb, OID_AUTO, log_verbose, CTLFLAG_RW, &tcp_log_verbose, + 0, "Force verbose logging for TCP traces"); + +SYSCTL_INT(_net_inet_tcp_bb, OID_AUTO, log_session_limit, + CTLFLAG_RW, &tcp_log_session_limit, 0, + "Maximum number of events maintained for each TCP session"); + +SYSCTL_UMA_MAX(_net_inet_tcp_bb, OID_AUTO, log_global_limit, CTLFLAG_RW, + &tcp_log_zone, "Maximum number of events maintained for all TCP sessions"); + +SYSCTL_UMA_CUR(_net_inet_tcp_bb, OID_AUTO, log_global_entries, CTLFLAG_RD, + &tcp_log_zone, "Current number of events maintained for all TCP sessions"); + +SYSCTL_UMA_MAX(_net_inet_tcp_bb, OID_AUTO, log_id_limit, CTLFLAG_RW, + &tcp_log_bucket_zone, "Maximum number of log IDs"); + +SYSCTL_UMA_CUR(_net_inet_tcp_bb, OID_AUTO, log_id_entries, CTLFLAG_RD, + &tcp_log_bucket_zone, "Current number of log IDs"); + +SYSCTL_UMA_MAX(_net_inet_tcp_bb, OID_AUTO, log_id_tcpcb_limit, CTLFLAG_RW, + &tcp_log_node_zone, "Maximum number of tcpcbs with log IDs"); + +SYSCTL_UMA_CUR(_net_inet_tcp_bb, OID_AUTO, log_id_tcpcb_entries, CTLFLAG_RD, + &tcp_log_node_zone, "Current number of tcpcbs with log IDs"); + +SYSCTL_U32(_net_inet_tcp_bb, OID_AUTO, log_version, CTLFLAG_RD, &tcp_log_version, + 0, "Version of log formats exported"); + +SYSCTL_U64(_net_inet_tcp_bb, OID_AUTO, log_auto_ratio, CTLFLAG_RW, + &tcp_log_auto_ratio, 0, "Do auto capturing for 1 out of N sessions"); + +SYSCTL_U32(_net_inet_tcp_bb, OID_AUTO, log_auto_mode, CTLFLAG_RW, + &tcp_log_auto_mode, TCP_LOG_STATE_HEAD_AUTO, + "Logging mode for auto-selected sessions (default is TCP_LOG_STATE_HEAD_AUTO)"); + +SYSCTL_BOOL(_net_inet_tcp_bb, OID_AUTO, log_auto_all, CTLFLAG_RW, + &tcp_log_auto_all, false, + "Auto-select from all sessions (rather than just those with IDs)"); + +#ifdef TCPLOG_DEBUG_COUNTERS +counter_u64_t tcp_log_queued; +counter_u64_t tcp_log_que_fail1; +counter_u64_t tcp_log_que_fail2; +counter_u64_t tcp_log_que_fail3; +counter_u64_t tcp_log_que_fail4; +counter_u64_t tcp_log_que_fail5; +counter_u64_t tcp_log_que_copyout; +counter_u64_t tcp_log_que_read; +counter_u64_t tcp_log_que_freed; + +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, queued, CTLFLAG_RD, + &tcp_log_queued, "Number of entries queued"); +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail1, CTLFLAG_RD, + &tcp_log_que_fail1, "Number of entries queued but fail 1"); +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail2, CTLFLAG_RD, + &tcp_log_que_fail2, "Number of entries queued but fail 2"); +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail3, CTLFLAG_RD, + &tcp_log_que_fail3, "Number of entries queued but fail 3"); +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail4, CTLFLAG_RD, + &tcp_log_que_fail4, "Number of entries queued but fail 4"); +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, fail5, CTLFLAG_RD, + &tcp_log_que_fail5, "Number of entries queued but fail 4"); +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, copyout, CTLFLAG_RD, + &tcp_log_que_copyout, "Number of entries copied out"); +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, read, CTLFLAG_RD, + &tcp_log_que_read, "Number of entries read from the queue"); +SYSCTL_COUNTER_U64(_net_inet_tcp_bb, OID_AUTO, freed, CTLFLAG_RD, + &tcp_log_que_freed, "Number of entries freed after reading"); +#endif + +#ifdef INVARIANTS +#define TCPLOG_DEBUG_RINGBUF +#endif + +struct tcp_log_mem +{ + STAILQ_ENTRY(tcp_log_mem) tlm_queue; + struct tcp_log_buffer tlm_buf; + struct tcp_log_verbose tlm_v; +#ifdef TCPLOG_DEBUG_RINGBUF + volatile int tlm_refcnt; +#endif +}; + +/* 60 bytes for the header, + 16 bytes for padding */ +static uint8_t zerobuf[76]; + +/* + * Lock order: + * 1. TCPID_TREE + * 2. TCPID_BUCKET + * 3. INP + * + * Rules: + * A. You need a lock on the Tree to add/remove buckets. + * B. You need a lock on the bucket to add/remove nodes from the bucket. + * C. To change information in a node, you need the INP lock if the tln_closed + * field is false. Otherwise, you need the bucket lock. (Note that the + * tln_closed field can change at any point, so you need to recheck the + * entry after acquiring the INP lock.) + * D. To remove a node from the bucket, you must have that entry locked, + * according to the criteria of Rule C. Also, the node must not be on + * the expiry queue. + * E. The exception to C is the expiry queue fields, which are locked by + * the TCPLOG_EXPIREQ lock. + * + * Buckets have a reference count. Each node is a reference. Further, + * other callers may add reference counts to keep a bucket from disappearing. + * You can add a reference as long as you own a lock sufficient to keep the + * bucket from disappearing. For example, a common use is: + * a. Have a locked INP, but need to lock the TCPID_BUCKET. + * b. Add a refcount on the bucket. (Safe because the INP lock prevents + * the TCPID_BUCKET from going away.) + * c. Drop the INP lock. + * d. Acquire a lock on the TCPID_BUCKET. + * e. Acquire a lock on the INP. + * f. Drop the refcount on the bucket. + * (At this point, the bucket may disappear.) + * + * Expire queue lock: + * You can acquire this with either the bucket or INP lock. Don't reverse it. + * When the expire code has committed to freeing a node, it resets the expiry + * time to SBT_MAX. That is the signal to everyone else that they should + * leave that node alone. + */ +static struct rwlock tcp_id_tree_lock; +#define TCPID_TREE_WLOCK() rw_wlock(&tcp_id_tree_lock) +#define TCPID_TREE_RLOCK() rw_rlock(&tcp_id_tree_lock) +#define TCPID_TREE_UPGRADE() rw_try_upgrade(&tcp_id_tree_lock) +#define TCPID_TREE_WUNLOCK() rw_wunlock(&tcp_id_tree_lock) +#define TCPID_TREE_RUNLOCK() rw_runlock(&tcp_id_tree_lock) +#define TCPID_TREE_WLOCK_ASSERT() rw_assert(&tcp_id_tree_lock, RA_WLOCKED) +#define TCPID_TREE_RLOCK_ASSERT() rw_assert(&tcp_id_tree_lock, RA_RLOCKED) +#define TCPID_TREE_UNLOCK_ASSERT() rw_assert(&tcp_id_tree_lock, RA_UNLOCKED) + +#define TCPID_BUCKET_LOCK_INIT(tlb) mtx_init(&((tlb)->tlb_mtx), "tcp log id bucket", NULL, MTX_DEF) +#define TCPID_BUCKET_LOCK_DESTROY(tlb) mtx_destroy(&((tlb)->tlb_mtx)) +#define TCPID_BUCKET_LOCK(tlb) mtx_lock(&((tlb)->tlb_mtx)) +#define TCPID_BUCKET_UNLOCK(tlb) mtx_unlock(&((tlb)->tlb_mtx)) +#define TCPID_BUCKET_LOCK_ASSERT(tlb) mtx_assert(&((tlb)->tlb_mtx), MA_OWNED) +#define TCPID_BUCKET_UNLOCK_ASSERT(tlb) mtx_assert(&((tlb)->tlb_mtx), MA_NOTOWNED) + +#define TCPID_BUCKET_REF(tlb) refcount_acquire(&((tlb)->tlb_refcnt)) +#define TCPID_BUCKET_UNREF(tlb) refcount_release(&((tlb)->tlb_refcnt)) + +#define TCPLOG_EXPIREQ_LOCK() mtx_lock(&tcp_log_expireq_mtx) +#define TCPLOG_EXPIREQ_UNLOCK() mtx_unlock(&tcp_log_expireq_mtx) + +SLIST_HEAD(tcp_log_id_head, tcp_log_id_node); + +struct tcp_log_id_bucket +{ + /* + * tlb_id must be first. This lets us use strcmp on + * (struct tcp_log_id_bucket *) and (char *) interchangeably. + */ + char tlb_id[TCP_LOG_ID_LEN]; + RB_ENTRY(tcp_log_id_bucket) tlb_rb; + struct tcp_log_id_head tlb_head; + struct mtx tlb_mtx; + volatile u_int tlb_refcnt; +}; + +struct tcp_log_id_node +{ + SLIST_ENTRY(tcp_log_id_node) tln_list; + STAILQ_ENTRY(tcp_log_id_node) tln_expireq; /* Locked by the expireq lock */ + sbintime_t tln_expiretime; /* Locked by the expireq lock */ + + /* + * If INP is NULL, that means the connection has closed. We've + * saved the connection endpoint information and the log entries + * in the tln_ie and tln_entries members. We've also saved a pointer + * to the enclosing bucket here. If INP is not NULL, the information is + * in the PCB and not here. + */ + struct inpcb *tln_inp; + struct tcpcb *tln_tp; + struct tcp_log_id_bucket *tln_bucket; + struct in_endpoints tln_ie; + struct tcp_log_stailq tln_entries; + int tln_count; + volatile int tln_closed; + uint8_t tln_af; +}; + +enum tree_lock_state { + TREE_UNLOCKED = 0, + TREE_RLOCKED, + TREE_WLOCKED, +}; + +/* Do we want to select this session for auto-logging? */ +static __inline bool +tcp_log_selectauto(void) +{ + + /* *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201803220940.w2M9e8T4067719>