Date: Thu, 8 Jun 2017 21:30:34 +0000 (UTC) From: Gleb Smirnoff <glebius@FreeBSD.org> To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r319722 - in head: sys/cam/ctl sys/dev/iscsi sys/kern sys/netgraph sys/netgraph/bluetooth/socket sys/netinet sys/ofed/drivers/infiniband/core sys/ofed/drivers/infiniband/ulp/sdp sys/rpc... Message-ID: <201706082130.v58LUY0j095589@repo.freebsd.org>
next in thread | raw e-mail | index | archive | help
Author: glebius Date: Thu Jun 8 21:30:34 2017 New Revision: 319722 URL: https://svnweb.freebsd.org/changeset/base/319722 Log: Listening sockets improvements. o Separate fields of struct socket that belong to listening from fields that belong to normal dataflow, and unionize them. This shrinks the structure a bit. - Take out selinfo's from the socket buffers into the socket. The first reason is to support braindamaged scenario when a socket is added to kevent(2) and then listen(2) is cast on it. The second reason is that there is future plan to make socket buffers pluggable, so that for a dataflow socket a socket buffer can be changed, and in this case we also want to keep same selinfos through the lifetime of a socket. - Remove struct struct so_accf. Since now listening stuff no longer affects struct socket size, just move its fields into listening part of the union. - Provide sol_upcall field and enforce that so_upcall_set() may be called only on a dataflow socket, which has buffers, and for listening sockets provide solisten_upcall_set(). o Remove ACCEPT_LOCK() global. - Add a mutex to socket, to be used instead of socket buffer lock to lock fields of struct socket that don't belong to a socket buffer. - Allow to acquire two socket locks, but the first one must belong to a listening socket. - Make soref()/sorele() to use atomic(9). This allows in some situations to do soref() without owning socket lock. There is place for improvement here, it is possible to make sorele() also to lock optionally. - Most protocols aren't touched by this change, except UNIX local sockets. See below for more information. o Reduce copy-and-paste in kernel modules that accept connections from listening sockets: provide function solisten_dequeue(), and use it in the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4), infiniband, rpc. o UNIX local sockets. - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX local sockets. Most races exist around spawning a new socket, when we are connecting to a local listening socket. To cover them, we need to hold locks on both PCBs when spawning a third one. This means holding them across sonewconn(). This creates a LOR between pcb locks and unp_list_lock. - To fix the new LOR, abandon the global unp_list_lock in favor of global unp_link_lock. Indeed, separating these two locks didn't provide us any extra parralelism in the UNIX sockets. - Now call into uipc_attach() may happen with unp_link_lock hold if, we are accepting, or without unp_link_lock in case if we are just creating a socket. - Another problem in UNIX sockets is that uipc_close() basicly did nothing for a listening socket. The vnode remained opened for connections. This is fixed by removing vnode in uipc_close(). Maybe the right way would be to do it for all sockets (not only listening), simply move the vnode teardown from uipc_detach() to uipc_close()? Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D9770 Modified: head/sys/cam/ctl/ctl_ha.c head/sys/dev/iscsi/icl_soft_proxy.c head/sys/kern/sys_socket.c head/sys/kern/uipc_accf.c head/sys/kern/uipc_debug.c head/sys/kern/uipc_sockbuf.c head/sys/kern/uipc_socket.c head/sys/kern/uipc_syscalls.c head/sys/kern/uipc_usrreq.c head/sys/netgraph/bluetooth/socket/ng_btsocket_l2cap.c head/sys/netgraph/bluetooth/socket/ng_btsocket_rfcomm.c head/sys/netgraph/bluetooth/socket/ng_btsocket_sco.c head/sys/netgraph/ng_ksocket.c head/sys/netinet/sctp_input.c head/sys/netinet/sctp_syscalls.c head/sys/netinet/sctp_sysctl.c head/sys/netinet/sctp_usrreq.c head/sys/netinet/tcp_subr.c head/sys/netinet/tcp_syncache.c head/sys/netinet/tcp_timewait.c head/sys/ofed/drivers/infiniband/core/iwcm.c head/sys/ofed/drivers/infiniband/ulp/sdp/sdp_main.c head/sys/rpc/svc_vc.c head/sys/sys/sockbuf.h head/sys/sys/socket.h head/sys/sys/socketvar.h head/usr.bin/netstat/inet.c Modified: head/sys/cam/ctl/ctl_ha.c ============================================================================== --- head/sys/cam/ctl/ctl_ha.c Thu Jun 8 21:06:18 2017 (r319721) +++ head/sys/cam/ctl/ctl_ha.c Thu Jun 8 21:30:34 2017 (r319722) @@ -458,45 +458,20 @@ out: static int ctl_ha_accept(struct ha_softc *softc) { - struct socket *so; + struct socket *lso, *so; struct sockaddr *sap; int error; - ACCEPT_LOCK(); - if (softc->ha_lso->so_rcv.sb_state & SBS_CANTRCVMORE) - softc->ha_lso->so_error = ECONNABORTED; - if (softc->ha_lso->so_error) { - error = softc->ha_lso->so_error; - softc->ha_lso->so_error = 0; - ACCEPT_UNLOCK(); + lso = softc->ha_lso; + SOLISTEN_LOCK(lso); + error = solisten_dequeue(lso, &so, 0); + if (error == EWOULDBLOCK) + return (error); + if (error) { printf("%s: socket error %d\n", __func__, error); goto out; } - so = TAILQ_FIRST(&softc->ha_lso->so_comp); - if (so == NULL) { - ACCEPT_UNLOCK(); - return (EWOULDBLOCK); - } - KASSERT(!(so->so_qstate & SQ_INCOMP), ("accept1: so SQ_INCOMP")); - KASSERT(so->so_qstate & SQ_COMP, ("accept1: so not SQ_COMP")); - /* - * Before changing the flags on the socket, we have to bump the - * reference count. Otherwise, if the protocol calls sofree(), - * the socket will be released due to a zero refcount. - */ - SOCK_LOCK(so); /* soref() and so_state update */ - soref(so); /* file descriptor reference */ - - TAILQ_REMOVE(&softc->ha_lso->so_comp, so, so_list); - softc->ha_lso->so_qlen--; - so->so_state |= SS_NBIO; - so->so_qstate &= ~SQ_COMP; - so->so_head = NULL; - - SOCK_UNLOCK(so); - ACCEPT_UNLOCK(); - sap = NULL; error = soaccept(so, &sap); if (error != 0) { @@ -556,9 +531,6 @@ ctl_ha_listen(struct ha_softc *softc) printf("%s: REUSEPORT setting failed %d\n", __func__, error); } - SOCKBUF_LOCK(&softc->ha_lso->so_rcv); - soupcall_set(softc->ha_lso, SO_RCV, ctl_ha_lupcall, softc); - SOCKBUF_UNLOCK(&softc->ha_lso->so_rcv); } memcpy(&sa, &softc->ha_peer_in, sizeof(sa)); @@ -572,6 +544,10 @@ ctl_ha_listen(struct ha_softc *softc) printf("%s: solisten() error %d\n", __func__, error); goto out; } + SOLISTEN_LOCK(softc->ha_lso); + softc->ha_lso->so_state |= SS_NBIO; + solisten_upcall_set(softc->ha_lso, ctl_ha_lupcall, softc); + SOLISTEN_UNLOCK(softc->ha_lso); return (0); out: Modified: head/sys/dev/iscsi/icl_soft_proxy.c ============================================================================== --- head/sys/dev/iscsi/icl_soft_proxy.c Thu Jun 8 21:06:18 2017 (r319721) +++ head/sys/dev/iscsi/icl_soft_proxy.c Thu Jun 8 21:30:34 2017 (r319722) @@ -92,7 +92,6 @@ struct icl_listen_sock { struct icl_listen *ils_listen; struct socket *ils_socket; bool ils_running; - bool ils_disconnecting; int ils_id; }; @@ -184,7 +183,9 @@ icl_listen_free(struct icl_listen *il) while (ils->ils_running) { ICL_DEBUG("waiting for accept thread to terminate"); sx_xunlock(&il->il_lock); - ils->ils_disconnecting = true; + SOLISTEN_LOCK(ils->ils_socket); + ils->ils_socket->so_error = ENOTCONN; + SOLISTEN_UNLOCK(ils->ils_socket); wakeup(&ils->ils_socket->so_timeo); pause("icl_unlisten", 1 * hz); sx_xlock(&il->il_lock); @@ -200,9 +201,9 @@ icl_listen_free(struct icl_listen *il) } /* - * XXX: Doing accept in a separate thread in each socket might not be the best way - * to do stuff, but it's pretty clean and debuggable - and you probably won't - * have hundreds of listening sockets anyway. + * XXX: Doing accept in a separate thread in each socket might not be the + * best way to do stuff, but it's pretty clean and debuggable - and you + * probably won't have hundreds of listening sockets anyway. */ static void icl_accept_thread(void *arg) @@ -218,55 +219,22 @@ icl_accept_thread(void *arg) ils->ils_running = true; for (;;) { - ACCEPT_LOCK(); - while (TAILQ_EMPTY(&head->so_comp) && head->so_error == 0 && ils->ils_disconnecting == false) { - if (head->so_rcv.sb_state & SBS_CANTRCVMORE) { - head->so_error = ECONNABORTED; - break; - } - error = msleep(&head->so_timeo, &accept_mtx, PSOCK | PCATCH, - "accept", 0); - if (error) { - ACCEPT_UNLOCK(); - ICL_WARN("msleep failed with error %d", error); - continue; - } - if (ils->ils_disconnecting) { - ACCEPT_UNLOCK(); - ICL_DEBUG("terminating"); - ils->ils_running = false; - kthread_exit(); - return; - } + SOLISTEN_LOCK(head); + error = solisten_dequeue(head, &so, 0); + if (error == ENOTCONN) { + /* + * XXXGL: ENOTCONN is our mark from icl_listen_free(). + * Neither socket code, nor msleep(9) may return it. + */ + ICL_DEBUG("terminating"); + ils->ils_running = false; + kthread_exit(); + return; } - if (head->so_error) { - error = head->so_error; - head->so_error = 0; - ACCEPT_UNLOCK(); - ICL_WARN("socket error %d", error); + if (error) { + ICL_WARN("solisten_dequeue error %d", error); continue; } - so = TAILQ_FIRST(&head->so_comp); - KASSERT(so != NULL, ("NULL so")); - KASSERT(!(so->so_qstate & SQ_INCOMP), ("accept1: so SQ_INCOMP")); - KASSERT(so->so_qstate & SQ_COMP, ("accept1: so not SQ_COMP")); - - /* - * Before changing the flags on the socket, we have to bump the - * reference count. Otherwise, if the protocol calls sofree(), - * the socket will be released due to a zero refcount. - */ - SOCK_LOCK(so); /* soref() and so_state update */ - soref(so); /* file descriptor reference */ - - TAILQ_REMOVE(&head->so_comp, so, so_list); - head->so_qlen--; - so->so_state |= (head->so_state & SS_NBIO); - so->so_qstate &= ~SQ_COMP; - so->so_head = NULL; - - SOCK_UNLOCK(so); - ACCEPT_UNLOCK(); sa = NULL; error = soaccept(so, &sa); Modified: head/sys/kern/sys_socket.c ============================================================================== --- head/sys/kern/sys_socket.c Thu Jun 8 21:06:18 2017 (r319721) +++ head/sys/kern/sys_socket.c Thu Jun 8 21:30:34 2017 (r319722) @@ -170,32 +170,36 @@ soo_ioctl(struct file *fp, u_long cmd, void *data, str break; case FIOASYNC: - /* - * XXXRW: This code separately acquires SOCK_LOCK(so) and - * SOCKBUF_LOCK(&so->so_rcv) even though they are the same - * mutex to avoid introducing the assumption that they are - * the same. - */ if (*(int *)data) { SOCK_LOCK(so); so->so_state |= SS_ASYNC; + if (SOLISTENING(so)) { + so->sol_sbrcv_flags |= SB_ASYNC; + so->sol_sbsnd_flags |= SB_ASYNC; + } else { + SOCKBUF_LOCK(&so->so_rcv); + so->so_rcv.sb_flags |= SB_ASYNC; + SOCKBUF_UNLOCK(&so->so_rcv); + SOCKBUF_LOCK(&so->so_snd); + so->so_snd.sb_flags |= SB_ASYNC; + SOCKBUF_UNLOCK(&so->so_snd); + } SOCK_UNLOCK(so); - SOCKBUF_LOCK(&so->so_rcv); - so->so_rcv.sb_flags |= SB_ASYNC; - SOCKBUF_UNLOCK(&so->so_rcv); - SOCKBUF_LOCK(&so->so_snd); - so->so_snd.sb_flags |= SB_ASYNC; - SOCKBUF_UNLOCK(&so->so_snd); } else { SOCK_LOCK(so); so->so_state &= ~SS_ASYNC; + if (SOLISTENING(so)) { + so->sol_sbrcv_flags &= ~SB_ASYNC; + so->sol_sbsnd_flags &= ~SB_ASYNC; + } else { + SOCKBUF_LOCK(&so->so_rcv); + so->so_rcv.sb_flags &= ~SB_ASYNC; + SOCKBUF_UNLOCK(&so->so_rcv); + SOCKBUF_LOCK(&so->so_snd); + so->so_snd.sb_flags &= ~SB_ASYNC; + SOCKBUF_UNLOCK(&so->so_snd); + } SOCK_UNLOCK(so); - SOCKBUF_LOCK(&so->so_rcv); - so->so_rcv.sb_flags &= ~SB_ASYNC; - SOCKBUF_UNLOCK(&so->so_rcv); - SOCKBUF_LOCK(&so->so_snd); - so->so_snd.sb_flags &= ~SB_ASYNC; - SOCKBUF_UNLOCK(&so->so_snd); } break; @@ -706,7 +710,6 @@ soaio_process_sb(struct socket *so, struct sockbuf *sb sb->sb_flags &= ~SB_AIO_RUNNING; SOCKBUF_UNLOCK(sb); - ACCEPT_LOCK(); SOCK_LOCK(so); sorele(so); } Modified: head/sys/kern/uipc_accf.c ============================================================================== --- head/sys/kern/uipc_accf.c Thu Jun 8 21:06:18 2017 (r319721) +++ head/sys/kern/uipc_accf.c Thu Jun 8 21:30:34 2017 (r319722) @@ -173,13 +173,13 @@ accept_filt_getopt(struct socket *so, struct sockopt * error = EINVAL; goto out; } - if ((so->so_options & SO_ACCEPTFILTER) == 0) { + if (so->sol_accept_filter == NULL) { error = EINVAL; goto out; } - strcpy(afap->af_name, so->so_accf->so_accept_filter->accf_name); - if (so->so_accf->so_accept_filter_str != NULL) - strcpy(afap->af_arg, so->so_accf->so_accept_filter_str); + strcpy(afap->af_name, so->sol_accept_filter->accf_name); + if (so->sol_accept_filter_str != NULL) + strcpy(afap->af_arg, so->sol_accept_filter_str); out: SOCK_UNLOCK(so); if (error == 0) @@ -193,31 +193,57 @@ accept_filt_setopt(struct socket *so, struct sockopt * { struct accept_filter_arg *afap; struct accept_filter *afp; - struct so_accf *newaf; - int error = 0; + char *accept_filter_str = NULL; + void *accept_filter_arg = NULL; + int error; /* * Handle the simple delete case first. */ if (sopt == NULL || sopt->sopt_val == NULL) { + struct socket *sp, *sp1; + int wakeup; + SOCK_LOCK(so); if ((so->so_options & SO_ACCEPTCONN) == 0) { SOCK_UNLOCK(so); return (EINVAL); } - if (so->so_accf != NULL) { - struct so_accf *af = so->so_accf; - if (af->so_accept_filter != NULL && - af->so_accept_filter->accf_destroy != NULL) { - af->so_accept_filter->accf_destroy(so); - } - if (af->so_accept_filter_str != NULL) - free(af->so_accept_filter_str, M_ACCF); - free(af, M_ACCF); - so->so_accf = NULL; + if (so->sol_accept_filter == NULL) { + SOCK_UNLOCK(so); + return (0); } + if (so->sol_accept_filter->accf_destroy != NULL) + so->sol_accept_filter->accf_destroy(so); + if (so->sol_accept_filter_str != NULL) + free(so->sol_accept_filter_str, M_ACCF); + so->sol_accept_filter = NULL; + so->sol_accept_filter_arg = NULL; + so->sol_accept_filter_str = NULL; so->so_options &= ~SO_ACCEPTFILTER; - SOCK_UNLOCK(so); + + /* + * Move from incomplete queue to complete only those + * connections, that are blocked by us. + */ + wakeup = 0; + TAILQ_FOREACH_SAFE(sp, &so->sol_incomp, so_list, sp1) { + SOCK_LOCK(sp); + if (sp->so_options & SO_ACCEPTFILTER) { + TAILQ_REMOVE(&so->sol_incomp, sp, so_list); + TAILQ_INSERT_TAIL(&so->sol_comp, sp, so_list); + sp->so_qstate = SQ_COMP; + sp->so_options &= ~SO_ACCEPTFILTER; + so->sol_incqlen--; + so->sol_qlen++; + wakeup = 1; + } + SOCK_UNLOCK(sp); + } + if (wakeup) + solisten_wakeup(so); /* unlocks */ + else + SOLISTEN_UNLOCK(so); return (0); } @@ -238,17 +264,10 @@ accept_filt_setopt(struct socket *so, struct sockopt * free(afap, M_TEMP); return (ENOENT); } - /* - * Allocate the new accept filter instance storage. We may - * have to free it again later if we fail to attach it. If - * attached properly, 'newaf' is NULLed to avoid a free() - * while in use. - */ - newaf = malloc(sizeof(*newaf), M_ACCF, M_WAITOK | M_ZERO); if (afp->accf_create != NULL && afap->af_name[0] != '\0') { size_t len = strlen(afap->af_name) + 1; - newaf->so_accept_filter_str = malloc(len, M_ACCF, M_WAITOK); - strcpy(newaf->so_accept_filter_str, afap->af_name); + accept_filter_str = malloc(len, M_ACCF, M_WAITOK); + strcpy(accept_filter_str, afap->af_name); } /* @@ -256,8 +275,8 @@ accept_filt_setopt(struct socket *so, struct sockopt * * without first removing it. */ SOCK_LOCK(so); - if (((so->so_options & SO_ACCEPTCONN) == 0) || - (so->so_accf != NULL)) { + if ((so->so_options & SO_ACCEPTCONN) == 0 || + so->sol_accept_filter != NULL) { error = EINVAL; goto out; } @@ -268,25 +287,20 @@ accept_filt_setopt(struct socket *so, struct sockopt * * can't block. */ if (afp->accf_create != NULL) { - newaf->so_accept_filter_arg = - afp->accf_create(so, afap->af_arg); - if (newaf->so_accept_filter_arg == NULL) { + accept_filter_arg = afp->accf_create(so, afap->af_arg); + if (accept_filter_arg == NULL) { error = EINVAL; goto out; } } - newaf->so_accept_filter = afp; - so->so_accf = newaf; + so->sol_accept_filter = afp; + so->sol_accept_filter_arg = accept_filter_arg; + so->sol_accept_filter_str = accept_filter_str; so->so_options |= SO_ACCEPTFILTER; - newaf = NULL; out: SOCK_UNLOCK(so); - if (newaf != NULL) { - if (newaf->so_accept_filter_str != NULL) - free(newaf->so_accept_filter_str, M_ACCF); - free(newaf, M_ACCF); - } - if (afap != NULL) - free(afap, M_TEMP); + if (accept_filter_str != NULL) + free(accept_filter_str, M_ACCF); + free(afap, M_TEMP); return (error); } Modified: head/sys/kern/uipc_debug.c ============================================================================== --- head/sys/kern/uipc_debug.c Thu Jun 8 21:06:18 2017 (r319721) +++ head/sys/kern/uipc_debug.c Thu Jun 8 21:30:34 2017 (r319722) @@ -448,8 +448,6 @@ db_print_socket(struct socket *so, const char *socketn db_printf(")\n"); db_print_indent(indent); - db_printf("so_qstate: 0x%x (", so->so_qstate); - db_print_soqstate(so->so_qstate); db_printf(") "); db_printf("so_pcb: %p ", so->so_pcb); db_printf("so_proto: %p\n", so->so_proto); @@ -458,24 +456,28 @@ db_print_socket(struct socket *so, const char *socketn db_print_protosw(so->so_proto, "so_proto", indent); db_print_indent(indent); - db_printf("so_head: %p ", so->so_head); - db_printf("so_incomp first: %p ", TAILQ_FIRST(&so->so_incomp)); - db_printf("so_comp first: %p\n", TAILQ_FIRST(&so->so_comp)); + if (so->so_options & SO_ACCEPTCONN) { + db_printf("sol_incomp first: %p ", + TAILQ_FIRST(&so->sol_incomp)); + db_printf("sol_comp first: %p\n", TAILQ_FIRST(&so->sol_comp)); + db_printf("sol_qlen: %d ", so->sol_qlen); + db_printf("sol_incqlen: %d ", so->sol_incqlen); + db_printf("sol_qlimit: %d ", so->sol_qlimit); + } else { + db_printf("so_qstate: 0x%x (", so->so_qstate); + db_print_soqstate(so->so_qstate); + db_printf("so_listen: %p ", so->so_listen); + /* so_list skipped */ + db_printf("so_timeo: %d ", so->so_timeo); + db_printf("so_error: %d\n", so->so_error); - db_print_indent(indent); - /* so_list skipped */ - db_printf("so_qlen: %u ", so->so_qlen); - db_printf("so_incqlen: %u ", so->so_incqlen); - db_printf("so_qlimit: %u ", so->so_qlimit); - db_printf("so_timeo: %d ", so->so_timeo); - db_printf("so_error: %d\n", so->so_error); + db_print_indent(indent); + db_printf("so_sigio: %p ", so->so_sigio); + db_printf("so_oobmark: %lu ", so->so_oobmark); - db_print_indent(indent); - db_printf("so_sigio: %p ", so->so_sigio); - db_printf("so_oobmark: %lu ", so->so_oobmark); - - db_print_sockbuf(&so->so_rcv, "so_rcv", indent); - db_print_sockbuf(&so->so_snd, "so_snd", indent); + db_print_sockbuf(&so->so_rcv, "so_rcv", indent); + db_print_sockbuf(&so->so_snd, "so_snd", indent); + } } DB_SHOW_COMMAND(socket, db_show_socket) Modified: head/sys/kern/uipc_sockbuf.c ============================================================================== --- head/sys/kern/uipc_sockbuf.c Thu Jun 8 21:06:18 2017 (r319721) +++ head/sys/kern/uipc_sockbuf.c Thu Jun 8 21:30:34 2017 (r319722) @@ -314,14 +314,14 @@ sowakeup(struct socket *so, struct sockbuf *sb) SOCKBUF_LOCK_ASSERT(sb); - selwakeuppri(&sb->sb_sel, PSOCK); - if (!SEL_WAITING(&sb->sb_sel)) + selwakeuppri(sb->sb_sel, PSOCK); + if (!SEL_WAITING(sb->sb_sel)) sb->sb_flags &= ~SB_SEL; if (sb->sb_flags & SB_WAIT) { sb->sb_flags &= ~SB_WAIT; wakeup(&sb->sb_acc); } - KNOTE_LOCKED(&sb->sb_sel.si_note, 0); + KNOTE_LOCKED(&sb->sb_sel->si_note, 0); if (sb->sb_upcall != NULL && !(so->so_state & SS_ISDISCONNECTED)) { ret = sb->sb_upcall(so, sb->sb_upcallarg, M_NOWAIT); if (ret == SU_ISCONNECTED) { Modified: head/sys/kern/uipc_socket.c ============================================================================== --- head/sys/kern/uipc_socket.c Thu Jun 8 21:06:18 2017 (r319721) +++ head/sys/kern/uipc_socket.c Thu Jun 8 21:30:34 2017 (r319722) @@ -106,6 +106,7 @@ __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" #include "opt_compat.h" +#include "opt_sctp.h" #include <sys/param.h> #include <sys/systm.h> @@ -154,13 +155,21 @@ __FBSDID("$FreeBSD$"); static int soreceive_rcvoob(struct socket *so, struct uio *uio, int flags); +static void so_rdknl_lock(void *); +static void so_rdknl_unlock(void *); +static void so_rdknl_assert_locked(void *); +static void so_rdknl_assert_unlocked(void *); +static void so_wrknl_lock(void *); +static void so_wrknl_unlock(void *); +static void so_wrknl_assert_locked(void *); +static void so_wrknl_assert_unlocked(void *); static void filt_sordetach(struct knote *kn); static int filt_soread(struct knote *kn, long hint); static void filt_sowdetach(struct knote *kn); static int filt_sowrite(struct knote *kn, long hint); -static int inline hhook_run_socket(struct socket *so, void *hctx, int32_t h_id); static int filt_soempty(struct knote *kn, long hint); +static int inline hhook_run_socket(struct socket *so, void *hctx, int32_t h_id); fo_kqfilter_t soo_kqfilter; static struct filterops soread_filtops = { @@ -393,8 +402,16 @@ soalloc(struct vnet *vnet) return (NULL); } + /* + * The socket locking protocol allows to lock 2 sockets at a time, + * however, the first one must be a listening socket. WITNESS lacks + * a feature to change class of an existing lock, so we use DUPOK. + */ + mtx_init(&so->so_lock, "socket", NULL, MTX_DEF | MTX_DUPOK); SOCKBUF_LOCK_INIT(&so->so_snd, "so_snd"); SOCKBUF_LOCK_INIT(&so->so_rcv, "so_rcv"); + so->so_rcv.sb_sel = &so->so_rdsel; + so->so_snd.sb_sel = &so->so_wrsel; sx_init(&so->so_snd.sb_sx, "so_snd_sx"); sx_init(&so->so_rcv.sb_sx, "so_rcv_sx"); TAILQ_INIT(&so->so_snd.sb_aiojobq); @@ -450,9 +467,6 @@ sodealloc(struct socket *so) if (so->so_snd.sb_hiwat) (void)chgsbsize(so->so_cred->cr_uidinfo, &so->so_snd.sb_hiwat, 0, RLIM_INFINITY); - /* remove accept filter if one is present. */ - if (so->so_accf != NULL) - accept_filt_setopt(so, NULL); #ifdef MAC mac_socket_destroy(so); #endif @@ -460,10 +474,16 @@ sodealloc(struct socket *so) crfree(so->so_cred); khelp_destroy_osd(&so->osd); - sx_destroy(&so->so_snd.sb_sx); - sx_destroy(&so->so_rcv.sb_sx); - SOCKBUF_LOCK_DESTROY(&so->so_snd); - SOCKBUF_LOCK_DESTROY(&so->so_rcv); + if (SOLISTENING(so)) { + if (so->sol_accept_filter != NULL) + accept_filt_setopt(so, NULL); + } else { + sx_destroy(&so->so_snd.sb_sx); + sx_destroy(&so->so_rcv.sb_sx); + SOCKBUF_LOCK_DESTROY(&so->so_snd); + SOCKBUF_LOCK_DESTROY(&so->so_rcv); + } + mtx_destroy(&so->so_lock); uma_zfree(socket_zone, so); } @@ -506,8 +526,6 @@ socreate(int dom, struct socket **aso, int type, int p if (so == NULL) return (ENOBUFS); - TAILQ_INIT(&so->so_incomp); - TAILQ_INIT(&so->so_comp); so->so_type = type; so->so_cred = crhold(cred); if ((prp->pr_domain->dom_family == PF_INET) || @@ -520,9 +538,10 @@ socreate(int dom, struct socket **aso, int type, int p #ifdef MAC mac_socket_create(cred, so); #endif - knlist_init_mtx(&so->so_rcv.sb_sel.si_note, SOCKBUF_MTX(&so->so_rcv)); - knlist_init_mtx(&so->so_snd.sb_sel.si_note, SOCKBUF_MTX(&so->so_snd)); - so->so_count = 1; + knlist_init(&so->so_rdsel.si_note, so, so_rdknl_lock, so_rdknl_unlock, + so_rdknl_assert_locked, so_rdknl_assert_unlocked); + knlist_init(&so->so_wrsel.si_note, so, so_wrknl_lock, so_wrknl_unlock, + so_wrknl_assert_locked, so_wrknl_assert_unlocked); /* * Auto-sizing of socket buffers is managed by the protocols and * the appropriate flags must be set in the pru_attach function. @@ -531,12 +550,10 @@ socreate(int dom, struct socket **aso, int type, int p error = (*prp->pr_usrreqs->pru_attach)(so, proto, td); CURVNET_RESTORE(); if (error) { - KASSERT(so->so_count == 1, ("socreate: so_count %d", - so->so_count)); - so->so_count = 0; sodealloc(so); return (error); } + soref(so); *aso = so; return (0); } @@ -564,11 +581,11 @@ sonewconn(struct socket *head, int connstatus) static int overcount; struct socket *so; - int over; + u_int over; - ACCEPT_LOCK(); - over = (head->so_qlen > 3 * head->so_qlimit / 2); - ACCEPT_UNLOCK(); + SOLISTEN_LOCK(head); + over = (head->sol_qlen > 3 * head->sol_qlimit / 2); + SOLISTEN_UNLOCK(head); #ifdef REGRESSION if (regression_sonewconn_earlytest && over) { #else @@ -580,15 +597,15 @@ sonewconn(struct socket *head, int connstatus) log(LOG_DEBUG, "%s: pcb %p: Listen queue overflow: " "%i already in queue awaiting acceptance " "(%d occurrences)\n", - __func__, head->so_pcb, head->so_qlen, overcount); + __func__, head->so_pcb, head->sol_qlen, overcount); overcount = 0; } return (NULL); } - VNET_ASSERT(head->so_vnet != NULL, ("%s:%d so_vnet is NULL, head=%p", - __func__, __LINE__, head)); + VNET_ASSERT(head->so_vnet != NULL, ("%s: so %p vnet is NULL", + __func__, head)); so = soalloc(head->so_vnet); if (so == NULL) { log(LOG_DEBUG, "%s: pcb %p: New socket allocation failure: " @@ -596,11 +613,8 @@ sonewconn(struct socket *head, int connstatus) __func__, head->so_pcb); return (NULL); } - if ((head->so_options & SO_ACCEPTFILTER) != 0) - connstatus = 0; - so->so_head = head; + so->so_listen = head; so->so_type = head->so_type; - so->so_options = head->so_options &~ SO_ACCEPTCONN; so->so_linger = head->so_linger; so->so_state = head->so_state | SS_NOFDREF; so->so_fibnum = head->so_fibnum; @@ -609,10 +623,12 @@ sonewconn(struct socket *head, int connstatus) #ifdef MAC mac_socket_newconn(head, so); #endif - knlist_init_mtx(&so->so_rcv.sb_sel.si_note, SOCKBUF_MTX(&so->so_rcv)); - knlist_init_mtx(&so->so_snd.sb_sel.si_note, SOCKBUF_MTX(&so->so_snd)); + knlist_init(&so->so_rdsel.si_note, so, so_rdknl_lock, so_rdknl_unlock, + so_rdknl_assert_locked, so_rdknl_assert_unlocked); + knlist_init(&so->so_wrsel.si_note, so, so_wrknl_lock, so_wrknl_unlock, + so_wrknl_assert_locked, so_wrknl_assert_unlocked); VNET_SO_ASSERT(head); - if (soreserve(so, head->so_snd.sb_hiwat, head->so_rcv.sb_hiwat)) { + if (soreserve(so, head->sol_sbsnd_hiwat, head->sol_sbrcv_hiwat)) { sodealloc(so); log(LOG_DEBUG, "%s: pcb %p: soreserve() failed\n", __func__, head->so_pcb); @@ -624,32 +640,24 @@ sonewconn(struct socket *head, int connstatus) __func__, head->so_pcb); return (NULL); } - so->so_rcv.sb_lowat = head->so_rcv.sb_lowat; - so->so_snd.sb_lowat = head->so_snd.sb_lowat; - so->so_rcv.sb_timeo = head->so_rcv.sb_timeo; - so->so_snd.sb_timeo = head->so_snd.sb_timeo; - so->so_rcv.sb_flags |= head->so_rcv.sb_flags & SB_AUTOSIZE; - so->so_snd.sb_flags |= head->so_snd.sb_flags & SB_AUTOSIZE; + so->so_rcv.sb_lowat = head->sol_sbrcv_lowat; + so->so_snd.sb_lowat = head->sol_sbsnd_lowat; + so->so_rcv.sb_timeo = head->sol_sbrcv_timeo; + so->so_snd.sb_timeo = head->sol_sbsnd_timeo; + so->so_rcv.sb_flags |= head->sol_sbrcv_flags & SB_AUTOSIZE; + so->so_snd.sb_flags |= head->sol_sbsnd_flags & SB_AUTOSIZE; + + SOLISTEN_LOCK(head); + if (head->sol_accept_filter != NULL) + connstatus = 0; so->so_state |= connstatus; - ACCEPT_LOCK(); - /* - * The accept socket may be tearing down but we just - * won a race on the ACCEPT_LOCK. - * However, if sctp_peeloff() is called on a 1-to-many - * style socket, the SO_ACCEPTCONN doesn't need to be set. - */ - if (!(head->so_options & SO_ACCEPTCONN) && - ((head->so_proto->pr_protocol != IPPROTO_SCTP) || - (head->so_type != SOCK_SEQPACKET))) { - SOCK_LOCK(so); - so->so_head = NULL; - sofree(so); /* NB: returns ACCEPT_UNLOCK'ed. */ - return (NULL); - } + so->so_options = head->so_options & ~SO_ACCEPTCONN; + soref(head); /* A socket on (in)complete queue refs head. */ if (connstatus) { - TAILQ_INSERT_TAIL(&head->so_comp, so, so_list); - so->so_qstate |= SQ_COMP; - head->so_qlen++; + TAILQ_INSERT_TAIL(&head->sol_comp, so, so_list); + so->so_qstate = SQ_COMP; + head->sol_qlen++; + solisten_wakeup(head); /* unlocks */ } else { /* * Keep removing sockets from the head until there's room for @@ -658,28 +666,86 @@ sonewconn(struct socket *head, int connstatus) * threads and soabort() requires dropping locks, we must * loop waiting for the condition to be true. */ - while (head->so_incqlen > head->so_qlimit) { + while (head->sol_incqlen > head->sol_qlimit) { struct socket *sp; - sp = TAILQ_FIRST(&head->so_incomp); - TAILQ_REMOVE(&head->so_incomp, sp, so_list); - head->so_incqlen--; - sp->so_qstate &= ~SQ_INCOMP; - sp->so_head = NULL; - ACCEPT_UNLOCK(); + + sp = TAILQ_FIRST(&head->sol_incomp); + TAILQ_REMOVE(&head->sol_incomp, sp, so_list); + head->sol_incqlen--; + SOCK_LOCK(sp); + sp->so_qstate = SQ_NONE; + sp->so_listen = NULL; + SOCK_UNLOCK(sp); + sorele(head); /* does SOLISTEN_UNLOCK, head stays */ soabort(sp); - ACCEPT_LOCK(); + SOLISTEN_LOCK(head); } - TAILQ_INSERT_TAIL(&head->so_incomp, so, so_list); - so->so_qstate |= SQ_INCOMP; - head->so_incqlen++; + TAILQ_INSERT_TAIL(&head->sol_incomp, so, so_list); + so->so_qstate = SQ_INCOMP; + head->sol_incqlen++; + SOLISTEN_UNLOCK(head); } - ACCEPT_UNLOCK(); - if (connstatus) { - sorwakeup(head); - wakeup_one(&head->so_timeo); + return (so); +} + +#ifdef SCTP +/* + * Socket part of sctp_peeloff(). Detach a new socket from an + * association. The new socket is returned with a reference. + */ +struct socket * +sopeeloff(struct socket *head) +{ + struct socket *so; + + VNET_ASSERT(head->so_vnet != NULL, ("%s:%d so_vnet is NULL, head=%p", + __func__, __LINE__, head)); + so = soalloc(head->so_vnet); + if (so == NULL) { + log(LOG_DEBUG, "%s: pcb %p: New socket allocation failure: " + "limit reached or out of memory\n", + __func__, head->so_pcb); + return (NULL); } + so->so_type = head->so_type; + so->so_options = head->so_options; + so->so_linger = head->so_linger; + so->so_state = (head->so_state & SS_NBIO) | SS_ISCONNECTED; + so->so_fibnum = head->so_fibnum; + so->so_proto = head->so_proto; + so->so_cred = crhold(head->so_cred); +#ifdef MAC + mac_socket_newconn(head, so); +#endif + knlist_init(&so->so_rdsel.si_note, so, so_rdknl_lock, so_rdknl_unlock, + so_rdknl_assert_locked, so_rdknl_assert_unlocked); + knlist_init(&so->so_wrsel.si_note, so, so_wrknl_lock, so_wrknl_unlock, + so_wrknl_assert_locked, so_wrknl_assert_unlocked); + VNET_SO_ASSERT(head); + if (soreserve(so, head->so_snd.sb_hiwat, head->so_rcv.sb_hiwat)) { + sodealloc(so); + log(LOG_DEBUG, "%s: pcb %p: soreserve() failed\n", + __func__, head->so_pcb); + return (NULL); + } + if ((*so->so_proto->pr_usrreqs->pru_attach)(so, 0, NULL)) { + sodealloc(so); + log(LOG_DEBUG, "%s: pcb %p: pru_attach() failed\n", + __func__, head->so_pcb); + return (NULL); + } + so->so_rcv.sb_lowat = head->so_rcv.sb_lowat; + so->so_snd.sb_lowat = head->so_snd.sb_lowat; + so->so_rcv.sb_timeo = head->so_rcv.sb_timeo; + so->so_snd.sb_timeo = head->so_snd.sb_timeo; + so->so_rcv.sb_flags |= head->so_rcv.sb_flags & SB_AUTOSIZE; + so->so_snd.sb_flags |= head->so_snd.sb_flags & SB_AUTOSIZE; + + soref(so); + return (so); } +#endif /* SCTP */ int sobind(struct socket *so, struct sockaddr *nam, struct thread *td) @@ -741,16 +807,140 @@ solisten_proto_check(struct socket *so) void solisten_proto(struct socket *so, int backlog) { + int sbrcv_lowat, sbsnd_lowat; + u_int sbrcv_hiwat, sbsnd_hiwat; + short sbrcv_flags, sbsnd_flags; + sbintime_t sbrcv_timeo, sbsnd_timeo; SOCK_LOCK_ASSERT(so); + if (SOLISTENING(so)) + goto listening; + + /* + * Change this socket to listening state. + */ + sbrcv_lowat = so->so_rcv.sb_lowat; + sbsnd_lowat = so->so_snd.sb_lowat; + sbrcv_hiwat = so->so_rcv.sb_hiwat; + sbsnd_hiwat = so->so_snd.sb_hiwat; + sbrcv_flags = so->so_rcv.sb_flags; + sbsnd_flags = so->so_snd.sb_flags; + sbrcv_timeo = so->so_rcv.sb_timeo; + sbsnd_timeo = so->so_snd.sb_timeo; + + sbdestroy(&so->so_snd, so); + sbdestroy(&so->so_rcv, so); + sx_destroy(&so->so_snd.sb_sx); + sx_destroy(&so->so_rcv.sb_sx); + SOCKBUF_LOCK_DESTROY(&so->so_snd); + SOCKBUF_LOCK_DESTROY(&so->so_rcv); + +#ifdef INVARIANTS + bzero(&so->so_rcv, + sizeof(struct socket) - offsetof(struct socket, so_rcv)); +#endif + + so->sol_sbrcv_lowat = sbrcv_lowat; + so->sol_sbsnd_lowat = sbsnd_lowat; + so->sol_sbrcv_hiwat = sbrcv_hiwat; + so->sol_sbsnd_hiwat = sbsnd_hiwat; + so->sol_sbrcv_flags = sbrcv_flags; + so->sol_sbsnd_flags = sbsnd_flags; + so->sol_sbrcv_timeo = sbrcv_timeo; + so->sol_sbsnd_timeo = sbsnd_timeo; + + so->sol_qlen = so->sol_incqlen = 0; + TAILQ_INIT(&so->sol_incomp); + TAILQ_INIT(&so->sol_comp); + + so->sol_accept_filter = NULL; + so->sol_accept_filter_arg = NULL; + so->sol_accept_filter_str = NULL; + + so->so_options |= SO_ACCEPTCONN; + +listening: if (backlog < 0 || backlog > somaxconn) backlog = somaxconn; - so->so_qlimit = backlog; - so->so_options |= SO_ACCEPTCONN; + so->sol_qlimit = backlog; } /* + * Wakeup listeners/subsystems once we have a complete connection. + * Enters with lock, returns unlocked. + */ +void +solisten_wakeup(struct socket *sol) +{ + + if (sol->sol_upcall != NULL) + (void )sol->sol_upcall(sol, sol->sol_upcallarg, M_NOWAIT); + else { + selwakeuppri(&sol->so_rdsel, PSOCK); + KNOTE_LOCKED(&sol->so_rdsel.si_note, 0); + } + SOLISTEN_UNLOCK(sol); + wakeup_one(&sol->sol_comp); +} + +/* + * Return single connection off a listening socket queue. Main consumer of + * the function is kern_accept4(). Some modules, that do their own accept + * management also use the function. + * + * Listening socket must be locked on entry and is returned unlocked on + * return. + * The flags argument is set of accept4(2) flags and ACCEPT4_INHERIT. + */ +int +solisten_dequeue(struct socket *head, struct socket **ret, int flags) +{ + struct socket *so; + int error; + + SOLISTEN_LOCK_ASSERT(head); + + while (!(head->so_state & SS_NBIO) && TAILQ_EMPTY(&head->sol_comp) && + head->so_error == 0) { + error = msleep(&head->sol_comp, &head->so_lock, PSOCK | PCATCH, + "accept", 0); + if (error != 0) { + SOLISTEN_UNLOCK(head); + return (error); + } + } + if (head->so_error) { + error = head->so_error; + head->so_error = 0; + SOLISTEN_UNLOCK(head); + return (error); + } + if ((head->so_state & SS_NBIO) && TAILQ_EMPTY(&head->sol_comp)) { + SOLISTEN_UNLOCK(head); + return (EWOULDBLOCK); + } + so = TAILQ_FIRST(&head->sol_comp); + SOCK_LOCK(so); + KASSERT(so->so_qstate == SQ_COMP, + ("%s: so %p not SQ_COMP", __func__, so)); + soref(so); + head->sol_qlen--; + so->so_qstate = SQ_NONE; + so->so_listen = NULL; + TAILQ_REMOVE(&head->sol_comp, so, so_list); + if (flags & ACCEPT4_INHERIT) + so->so_state |= (head->so_state & SS_NBIO); + else + so->so_state |= (flags & SOCK_NONBLOCK) ? SS_NBIO : 0; + SOCK_UNLOCK(so); + sorele(head); + + *ret = so; + return (0); +} + +/* * Evaluate the reference count and named references on a socket; if no * references remain, free it. This should be called whenever a reference is * released, such as in sorele(), but also when named reference flags are @@ -774,44 +964,62 @@ void sofree(struct socket *so) { struct protosw *pr = so->so_proto; - struct socket *head; - ACCEPT_LOCK_ASSERT(); SOCK_LOCK_ASSERT(so); if ((so->so_state & SS_NOFDREF) == 0 || so->so_count != 0 || - (so->so_state & SS_PROTOREF) || (so->so_qstate & SQ_COMP)) { + (so->so_state & SS_PROTOREF) || (so->so_qstate == SQ_COMP)) { SOCK_UNLOCK(so); - ACCEPT_UNLOCK(); return; } - head = so->so_head; - if (head != NULL) { - KASSERT((so->so_qstate & SQ_COMP) != 0 || - (so->so_qstate & SQ_INCOMP) != 0, - ("sofree: so_head != NULL, but neither SQ_COMP nor " - "SQ_INCOMP")); - KASSERT((so->so_qstate & SQ_COMP) == 0 || - (so->so_qstate & SQ_INCOMP) == 0, - ("sofree: so->so_qstate is SQ_COMP and also SQ_INCOMP")); - TAILQ_REMOVE(&head->so_incomp, so, so_list); - head->so_incqlen--; - so->so_qstate &= ~SQ_INCOMP; - so->so_head = NULL; + if (!SOLISTENING(so) && so->so_qstate == SQ_INCOMP) { + struct socket *sol; + + sol = so->so_listen; + KASSERT(sol, ("%s: so %p on incomp of NULL", __func__, so)); + + /* + * To solve race between close of a listening socket and + * a socket on its incomplete queue, we need to lock both. + * The order is first listening socket, then regular. + * Since we don't have SS_NOFDREF neither SS_PROTOREF, this + * function and the listening socket are the only pointers + * to so. To preserve so and sol, we reference both and then + * relock. + * After relock the socket may not move to so_comp since it + * doesn't have PCB already, but it may be removed from + * so_incomp. If that happens, we share responsiblity on + * freeing the socket, but soclose() has already removed *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201706082130.v58LUY0j095589>