From owner-freebsd-performance@FreeBSD.ORG Sat May 6 14:16:50 2006 Return-Path: X-Original-To: performance@FreeBSD.org Delivered-To: freebsd-performance@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id EB2D516A401; Sat, 6 May 2006 14:16:49 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1466643D46; Sat, 6 May 2006 14:16:48 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 709FF46C1F; Sat, 6 May 2006 10:16:48 -0400 (EDT) Date: Sat, 6 May 2006 15:16:48 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: current@FreeBSD.org Message-ID: <20060506150622.C17611@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: performance@FreeBSD.org Subject: Fine-grained locking for POSIX local sockets (UNIX domain sockets) X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 06 May 2006 14:16:50 -0000 Dear all, Attached, please find a patch implementing more fine-grained locking for the POSIX local socket subsystem (UNIX domain socket subsystem). In the current implementation, we use a single global subsystem lock to protect all local IPC over the PF_LOCAL socket type. This has low overhead, but can result in significant contention, especially for workloads like MySQL which push a great deal of data through UNIX domain sockets, and involve many parties. The hope is that by improving granularity, we can reduce contention sufficiently to overcome the added cost of increased locking overhead (a side-effect of greater granularity). At a high level, here are the changes made: - Introduced a mutex for each UNIX domain socket protocol control block (unpcb). - Use the global lock as a guard lock around simultaneous locking of two unpcbs, which have no defined order. - Introduce new states associated with "binding" and "connecting" of sockets -- this corrects several long-standing race conditions in this code relating to blocking file system lookups, especially where sockets are shared between process or threads. Certain critical paths do still acquire the global lock since they involve multiple pcbs (send, receive) but the scope is now reduced. I hope to further refine these changes to eliminate the acquisition of the global lock in these critical cases. In local measurements, I have observed a 0% change in MySQL performance on uniprocessor systems, and on a dual-processor system I have observed a 4%-5% performance improvement with two client MySQL threads. I'm looking for two things: (1) Stability and correctness testing. Does this break your applications and/or kernel. (2) Performance assessment. Does this change performance for critical applications that use UNIX domain sockets, such as MySQL, especially with respect to CPU count. I hope to get this committed to the 7.x tree in the next week or so if there is sufficient positive feedback on performance, and sufficiently little negative feedback on stability and performance. Thanks, Robert N M Watson --- //depot/vendor/freebsd/src/sys/kern/uipc_usrreq.c 2006/04/24 19:10:17 +++ //depot/user/rwatson/proto/src/sys/kern/uipc_usrreq.c 2006/05/06 12:44:20 @@ -88,32 +88,99 @@ struct mbuf *unp_addsockcred(struct thread *, struct mbuf *); /* - * Currently, UNIX domain sockets are protected by a single subsystem lock, - * which covers global data structures and variables, the contents of each - * per-socket unpcb structure, and the so_pcb field in sockets attached to - * the UNIX domain. This provides for a moderate degree of paralellism, as - * receive operations on UNIX domain sockets do not need to acquire the - * subsystem lock. Finer grained locking to permit send() without acquiring - * a global lock would be a logical next step. + * Both send and receive buffers are allocated PIPSIZ bytes of buffering + * for stream sockets, although the total for sender and receiver is + * actually only PIPSIZ. + * Datagram sockets really use the sendspace as the maximum datagram size, + * and don't really want to reserve the sendspace. Their recvspace should + * be large enough for at least one max-size datagram plus address. + */ +#ifndef PIPSIZ +#define PIPSIZ 8192 +#endif +static u_long unpst_sendspace = PIPSIZ; +static u_long unpst_recvspace = PIPSIZ; +static u_long unpdg_sendspace = 2*1024; /* really max datagram size */ +static u_long unpdg_recvspace = 4*1024; + +static int unp_rights; /* file descriptors in flight */ + +SYSCTL_DECL(_net_local_stream); +SYSCTL_ULONG(_net_local_stream, OID_AUTO, sendspace, CTLFLAG_RW, + &unpst_sendspace, 0, ""); +SYSCTL_ULONG(_net_local_stream, OID_AUTO, recvspace, CTLFLAG_RW, + &unpst_recvspace, 0, ""); +SYSCTL_DECL(_net_local_dgram); +SYSCTL_ULONG(_net_local_dgram, OID_AUTO, maxdgram, CTLFLAG_RW, + &unpdg_sendspace, 0, ""); +SYSCTL_ULONG(_net_local_dgram, OID_AUTO, recvspace, CTLFLAG_RW, + &unpdg_recvspace, 0, ""); +SYSCTL_DECL(_net_local); +SYSCTL_INT(_net_local, OID_AUTO, inflight, CTLFLAG_RD, &unp_rights, 0, ""); + +/* + * Locking and synchronization: + * + * A global UNIX domain socket mutex protects all global variables in the + * implementation, as well as the linked lists tracking the set of allocated + * UNIX domain sockets. These variables/fields may be read lockless using + * atomic operations if stale values are permissible; otherwise the global + * mutex is required to read or read-modify-write. The global mutex also + * serves to prevent deadlock when multiple PCB locks may be acquired at once + * (see below). Finally, the global mutex protects uncounted references from + * vnodes to sockets bound to those vnodes: to safely dereference the + * v_socket pointer, the global mutex must be held while a full reference is + * acquired. + * + * UNIX domain sockets each have one unpcb PCB associated with them from + * pru_attach() to pru_detach() via the so_pcb pointer. The validity of that + * reference is an invariant for the lifetime of the socket, so no lock is + * required to dereference the so_pcb pointer if a valid socket reference is + * held. + * + * Each PCB has a back-pointer to its socket, unp_socket. This pointer may + * only be safely dereferenced as long as a valid reference to the PCB is + * held. Typically, this reference will be from the socket, or from another + * PCB when the referring PCB's lock is held (in order that the reference not + * be invalidated during use). In particular, to follow + * unp->unp_conn->unp_socket, you need unlock the lock on unp, not unp_conn. + * + * Fields of PCBs are locked using a per-unpcb lock, unp_mtx. Individual + * atomic reads without the lock may be performed "lockless", but more + * complex reads and read-modify-writes require the mutex to be held. No + * lock order is defined between PCB locks -- multiple PCB locks may be + * acquired at the same time only when holding the global UNIX domain socket + * mutex, which prevents deadlocks. To prevent inter-PCB references from + * becoming invalid, the lock protecting the reference must be held for the + * lifetime of use of the reference. * - * The UNIX domain socket lock preceds all socket layer locks, including the - * socket lock and socket buffer lock, permitting UNIX domain socket code to - * call into socket support routines without releasing its locks. + * Blocking with UNIX domain sockets is a tricky issue: unlike most network + * protocols, bind() is a non-atomic operation, and connect() requires + * potential sleeping in the protocol, due to potentially waiting on local or + * distributed file systems. We try to separate "lookup" operations, which + * may sleep, and the IPC operations themselves, which typically can occur + * with relative atomicity as locks can be held over the entire operation. * - * Some caution is required in areas where the UNIX domain socket code enters - * VFS in order to create or find rendezvous points. This results in - * dropping of the UNIX domain socket subsystem lock, acquisition of the - * Giant lock, and potential sleeping. This increases the chances of races, - * and exposes weaknesses in the socket->protocol API by offering poor - * failure modes. + * Another tricky issue is simultaneous multi-threaded or multi-process + * access to a single UNIX domain socket. These are handled by the flags + * UNP_CONNECTING and UNP_BINDING. */ -static struct mtx unp_mtx; -#define UNP_LOCK_INIT() \ - mtx_init(&unp_mtx, "unp", NULL, MTX_DEF) -#define UNP_LOCK() mtx_lock(&unp_mtx) -#define UNP_UNLOCK() mtx_unlock(&unp_mtx) -#define UNP_LOCK_ASSERT() mtx_assert(&unp_mtx, MA_OWNED) -#define UNP_UNLOCK_ASSERT() mtx_assert(&unp_mtx, MA_NOTOWNED) +static struct mtx unp_global_mtx; + +#define UNP_GLOBAL_LOCK_INIT() mtx_init(&unp_global_mtx, \ + "unp_global_mtx", NULL, MTX_DEF) +#define UNP_GLOBAL_LOCK() mtx_lock(&unp_global_mtx) +#define UNP_GLOBAL_UNLOCK() mtx_unlock(&unp_global_mtx) +#define UNP_GLOBAL_UNLOCK_ASSERT() mtx_assert(&unp_global_mtx, MA_NOTOWNED) +#define UNP_GLOBAL_LOCK_ASSERT() mtx_assert(&unp_global_mtx, MA_OWNED) + +#define UNP_PCB_LOCK_INIT(unp) mtx_init(&(unp)->unp_mtx, \ + "unp_mtx", "unp_mtx", \ + MTX_DUPOK|MTX_DEF|MTX_RECURSE) +#define UNP_PCB_LOCK_DESTROY(unp) mtx_destroy(&(unp)->unp_mtx) +#define UNP_PCB_LOCK(unp) mtx_lock(&(unp)->unp_mtx) +#define UNP_PCB_UNLOCK(unp) mtx_unlock(&(unp)->unp_mtx) +#define UNP_PCB_LOCK_ASSERT(unp) mtx_assert(&(unp)->unp_mtx, MA_OWNED) /* * Garbage collection of cyclic file descriptor/socket references occurs @@ -123,12 +190,10 @@ */ static struct task unp_gc_task; -static int unp_attach(struct socket *); static void unp_detach(struct unpcb *); -static int unp_bind(struct unpcb *,struct sockaddr *, struct thread *); static int unp_connect(struct socket *,struct sockaddr *, struct thread *); static int unp_connect2(struct socket *so, struct socket *so2, int); -static void unp_disconnect(struct unpcb *); +static void unp_disconnect(struct unpcb *unp, struct unpcb *unp2); static void unp_shutdown(struct unpcb *); static void unp_drop(struct unpcb *, int); static void unp_gc(__unused void *, int); @@ -137,8 +202,6 @@ static void unp_discard(struct file *); static void unp_freerights(struct file **, int); static int unp_internalize(struct mbuf **, struct thread *); -static int unp_listen(struct socket *, struct unpcb *, int, - struct thread *); static void uipc_abort(struct socket *so) @@ -147,83 +210,238 @@ unp = sotounpcb(so); KASSERT(unp != NULL, ("uipc_abort: unp == NULL")); - UNP_LOCK(); + + UNP_GLOBAL_LOCK(); + UNP_PCB_LOCK(unp); unp_drop(unp, ECONNABORTED); unp_detach(unp); - UNP_UNLOCK_ASSERT(); + UNP_GLOBAL_UNLOCK_ASSERT(); } static int uipc_accept(struct socket *so, struct sockaddr **nam) { - struct unpcb *unp; + struct unpcb *unp, *unp2; const struct sockaddr *sa; /* - * Pass back name of connected socket, - * if it was bound and we are still connected - * (our peer may have closed already!). + * Pass back name of connected socket, if it was bound and we are + * still connected (our peer may have closed already!). */ unp = sotounpcb(so); KASSERT(unp != NULL, ("uipc_accept: unp == NULL")); + *nam = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK); - UNP_LOCK(); - if (unp->unp_conn != NULL && unp->unp_conn->unp_addr != NULL) + UNP_GLOBAL_LOCK(); + UNP_PCB_LOCK(unp); + unp2 = unp->unp_conn; + if (unp->unp_conn != NULL && unp->unp_conn->unp_addr != NULL) { + UNP_PCB_LOCK(unp2); sa = (struct sockaddr *) unp->unp_conn->unp_addr; - else + bcopy(sa, *nam, sa->sa_len); + UNP_PCB_UNLOCK(unp2); + } else { sa = &sun_noname; - bcopy(sa, *nam, sa->sa_len); - UNP_UNLOCK(); + bcopy(sa, *nam, sa->sa_len); + } + UNP_PCB_UNLOCK(unp); + UNP_GLOBAL_UNLOCK(); return (0); } static int uipc_attach(struct socket *so, int proto, struct thread *td) { + u_long sendspace, recvspace; + struct unpcb *unp; + int error; + + KASSERT(so->so_pcb == NULL, ("uipc_attach: so_pcb != NULL")); + if (so->so_snd.sb_hiwat == 0 || so->so_rcv.sb_hiwat == 0) { + switch (so->so_type) { + case SOCK_STREAM: + sendspace = unpst_sendspace; + recvspace = unpst_recvspace; + break; + + case SOCK_DGRAM: + sendspace = unpdg_sendspace; + recvspace = unpdg_recvspace; + break; + + default: + panic("uipc_attach"); + } + error = soreserve(so, sendspace, recvspace); + if (error) + return (error); + } + unp = uma_zalloc(unp_zone, M_WAITOK | M_ZERO); + if (unp == NULL) + return (ENOBUFS); + LIST_INIT(&unp->unp_refs); + UNP_PCB_LOCK_INIT(unp); + unp->unp_socket = so; + so->so_pcb = unp; + + UNP_GLOBAL_LOCK(); + unp->unp_gencnt = ++unp_gencnt; + unp_count++; + LIST_INSERT_HEAD(so->so_type == SOCK_DGRAM ? &unp_dhead + : &unp_shead, unp, unp_link); + UNP_GLOBAL_UNLOCK(); - return (unp_attach(so)); + return (0); } static int uipc_bind(struct socket *so, struct sockaddr *nam, struct thread *td) { + struct sockaddr_un *soun = (struct sockaddr_un *)nam; + struct vnode *vp; + struct mount *mp; + struct vattr vattr; + int error, namelen; + struct nameidata nd; struct unpcb *unp; - int error; + char *buf; unp = sotounpcb(so); KASSERT(unp != NULL, ("uipc_bind: unp == NULL")); - UNP_LOCK(); - error = unp_bind(unp, nam, td); - UNP_UNLOCK(); + + namelen = soun->sun_len - offsetof(struct sockaddr_un, sun_path); + if (namelen <= 0) + return (EINVAL); + + /* + * We don't allow simultaneous bind() calls on a single UNIX domain + * socket, so flag in-progress operations, and return an error if an + * operation is already in progress. + * + * Historically, we have not allowed a socket to be rebound, so this + * also returns an error. Not allowing re-binding certainly + * simplifies the implementation and avoids a great many possible + * failure modes. + */ + UNP_PCB_LOCK(unp); + if (unp->unp_vnode != NULL) { + UNP_PCB_UNLOCK(unp); + return (EINVAL); + } + if (unp->unp_flags & UNP_BINDING) { + UNP_PCB_UNLOCK(unp); + return (EALREADY); + } + unp->unp_flags |= UNP_BINDING; + UNP_PCB_UNLOCK(unp); + + buf = malloc(namelen + 1, M_TEMP, M_WAITOK); + strlcpy(buf, soun->sun_path, namelen + 1); + + mtx_lock(&Giant); +restart: + mtx_assert(&Giant, MA_OWNED); + NDINIT(&nd, CREATE, NOFOLLOW | LOCKPARENT | SAVENAME, UIO_SYSSPACE, + buf, td); +/* SHOULD BE ABLE TO ADOPT EXISTING AND wakeup() ALA FIFO's */ + error = namei(&nd); + if (error) + goto error; + vp = nd.ni_vp; + if (vp != NULL || vn_start_write(nd.ni_dvp, &mp, V_NOWAIT) != 0) { + NDFREE(&nd, NDF_ONLY_PNBUF); + if (nd.ni_dvp == vp) + vrele(nd.ni_dvp); + else + vput(nd.ni_dvp); + if (vp != NULL) { + vrele(vp); + error = EADDRINUSE; + goto error; + } + error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH); + if (error) + goto error; + goto restart; + } + VATTR_NULL(&vattr); + vattr.va_type = VSOCK; + vattr.va_mode = (ACCESSPERMS & ~td->td_proc->p_fd->fd_cmask); +#ifdef MAC + error = mac_check_vnode_create(td->td_ucred, nd.ni_dvp, &nd.ni_cnd, + &vattr); +#endif + if (error == 0) { + VOP_LEASE(nd.ni_dvp, td, td->td_ucred, LEASE_WRITE); + error = VOP_CREATE(nd.ni_dvp, &nd.ni_vp, &nd.ni_cnd, &vattr); + } + NDFREE(&nd, NDF_ONLY_PNBUF); + vput(nd.ni_dvp); + if (error) { + vn_finished_write(mp); + goto error; + } + vp = nd.ni_vp; + ASSERT_VOP_LOCKED(vp, "uipc_bind"); + soun = (struct sockaddr_un *)sodupsockaddr(nam, M_WAITOK); + + /* + * XXXRW: handle race against another consumer also frobbing + * v_socket? Or not. + */ + UNP_GLOBAL_LOCK(); + UNP_PCB_LOCK(unp); + vp->v_socket = unp->unp_socket; + unp->unp_vnode = vp; + unp->unp_addr = soun; + unp->unp_flags &= ~UNP_BINDING; + UNP_PCB_UNLOCK(unp); + UNP_GLOBAL_UNLOCK(); + VOP_UNLOCK(vp, 0, td); + vn_finished_write(mp); + mtx_unlock(&Giant); + free(buf, M_TEMP); + return (0); + +error: + UNP_PCB_LOCK(unp); + unp->unp_flags &= ~UNP_BINDING; + UNP_PCB_UNLOCK(unp); + mtx_unlock(&Giant); + free(buf, M_TEMP); return (error); } static int uipc_connect(struct socket *so, struct sockaddr *nam, struct thread *td) { - struct unpcb *unp; int error; KASSERT(td == curthread, ("uipc_connect: td != curthread")); - unp = sotounpcb(so); - KASSERT(unp != NULL, ("uipc_connect: unp == NULL")); - UNP_LOCK(); + + UNP_GLOBAL_LOCK(); error = unp_connect(so, nam, td); - UNP_UNLOCK(); + UNP_GLOBAL_UNLOCK(); return (error); } int uipc_connect2(struct socket *so1, struct socket *so2) { - struct unpcb *unp; + struct unpcb *unp, *unp2; int error; - unp = sotounpcb(so1); + UNP_GLOBAL_LOCK(); + unp = so1->so_pcb; KASSERT(unp != NULL, ("uipc_connect2: unp == NULL")); - UNP_LOCK(); + UNP_PCB_LOCK(unp); + unp2 = so2->so_pcb; + KASSERT(unp2 != NULL, ("uipc_connect2: unp2 == NULL")); + UNP_PCB_LOCK(unp2); error = unp_connect2(so1, so2, PRU_CONNECT2); - UNP_UNLOCK(); + UNP_PCB_UNLOCK(unp2); + UNP_PCB_UNLOCK(unp); + UNP_GLOBAL_UNLOCK(); return (error); } @@ -236,21 +454,31 @@ unp = sotounpcb(so); KASSERT(unp != NULL, ("uipc_detach: unp == NULL")); - UNP_LOCK(); + + UNP_GLOBAL_LOCK(); + UNP_PCB_LOCK(unp); unp_detach(unp); - UNP_UNLOCK_ASSERT(); + UNP_GLOBAL_UNLOCK_ASSERT(); } static int uipc_disconnect(struct socket *so) { - struct unpcb *unp; + struct unpcb *unp, *unp2; unp = sotounpcb(so); KASSERT(unp != NULL, ("uipc_disconnect: unp == NULL")); - UNP_LOCK(); - unp_disconnect(unp); - UNP_UNLOCK(); + + UNP_GLOBAL_LOCK(); + UNP_PCB_LOCK(unp); + unp2 = unp->unp_conn; + if (unp2 != NULL) { + UNP_PCB_LOCK(unp2); + unp_disconnect(unp, unp2); + UNP_PCB_UNLOCK(unp2); + } + UNP_PCB_UNLOCK(unp); + UNP_GLOBAL_UNLOCK(); return (0); } @@ -262,81 +490,105 @@ unp = sotounpcb(so); KASSERT(unp != NULL, ("uipc_listen: unp == NULL")); - UNP_LOCK(); + + UNP_PCB_LOCK(unp); if (unp->unp_vnode == NULL) { - UNP_UNLOCK(); + UNP_PCB_UNLOCK(unp); return (EINVAL); } - error = unp_listen(so, unp, backlog, td); - UNP_UNLOCK(); + + SOCK_LOCK(so); + error = solisten_proto_check(so); + if (error == 0) { + cru2x(td->td_ucred, &unp->unp_peercred); + unp->unp_flags |= UNP_HAVEPCCACHED; + solisten_proto(so, backlog); + } + SOCK_UNLOCK(so); + UNP_PCB_UNLOCK(unp); return (error); } static int uipc_peeraddr(struct socket *so, struct sockaddr **nam) { - struct unpcb *unp; + struct unpcb *unp, *unp2; const struct sockaddr *sa; unp = sotounpcb(so); KASSERT(unp != NULL, ("uipc_peeraddr: unp == NULL")); + *nam = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK); - UNP_LOCK(); - if (unp->unp_conn != NULL && unp->unp_conn->unp_addr!= NULL) - sa = (struct sockaddr *) unp->unp_conn->unp_addr; - else { - /* - * XXX: It seems that this test always fails even when - * connection is established. So, this else clause is - * added as workaround to return PF_LOCAL sockaddr. - */ + UNP_PCB_LOCK(unp); + /* + * XXX: It seems that this test always fails even when connection is + * established. So, this else clause is added as workaround to + * return PF_LOCAL sockaddr. + */ + unp2 = unp->unp_conn; + if (unp2 != NULL) { + UNP_PCB_LOCK(unp2); + if (unp2->unp_addr != NULL) + sa = (struct sockaddr *) unp->unp_conn->unp_addr; + else + sa = &sun_noname; + bcopy(sa, *nam, sa->sa_len); + UNP_PCB_UNLOCK(unp2); + } else { sa = &sun_noname; + bcopy(sa, *nam, sa->sa_len); } - bcopy(sa, *nam, sa->sa_len); - UNP_UNLOCK(); + UNP_PCB_UNLOCK(unp); return (0); } static int uipc_rcvd(struct socket *so, int flags) { - struct unpcb *unp; + struct unpcb *unp, *unp2; struct socket *so2; u_long newhiwat; unp = sotounpcb(so); KASSERT(unp != NULL, ("uipc_rcvd: unp == NULL")); - UNP_LOCK(); - switch (so->so_type) { - case SOCK_DGRAM: + + if (so->so_type == SOCK_DGRAM) panic("uipc_rcvd DGRAM?"); - /*NOTREACHED*/ - case SOCK_STREAM: - if (unp->unp_conn == NULL) - break; - so2 = unp->unp_conn->unp_socket; - SOCKBUF_LOCK(&so2->so_snd); - SOCKBUF_LOCK(&so->so_rcv); - /* - * Adjust backpressure on sender - * and wakeup any waiting to write. - */ - so2->so_snd.sb_mbmax += unp->unp_mbcnt - so->so_rcv.sb_mbcnt; - unp->unp_mbcnt = so->so_rcv.sb_mbcnt; - newhiwat = so2->so_snd.sb_hiwat + unp->unp_cc - - so->so_rcv.sb_cc; - (void)chgsbsize(so2->so_cred->cr_uidinfo, &so2->so_snd.sb_hiwat, - newhiwat, RLIM_INFINITY); - unp->unp_cc = so->so_rcv.sb_cc; - SOCKBUF_UNLOCK(&so->so_rcv); - sowwakeup_locked(so2); - break; + if (so->so_type != SOCK_STREAM) + panic("uipc_rcvd unknown socktype"); - default: - panic("uipc_rcvd unknown socktype"); + /* + * Adjust backpressure on sender and wakeup any waiting to write. + * + * The consistency requirements here are a bit complex: we must + * acquire the lock for our own unpcb in order to prevent it from + * disconnecting while in use, changing the unp_conn peer. We do not + * need unp2's lock, since the unp2->unp_socket pointer will remain + * static as long as the unp2 pcb is valid, which it will be until we + * release unp's lock to allow a disconnect. We do need socket + * mutexes for both socket endpoints since we manipulate fields in + * both; we hold both locks at once since we access both + * simultaneously. + */ + UNP_PCB_LOCK(unp); + unp2 = unp->unp_conn; + if (unp2 == NULL) { + UNP_PCB_UNLOCK(unp); + return (0); } - UNP_UNLOCK(); + so2 = unp2->unp_socket; + SOCKBUF_LOCK(&so2->so_snd); + SOCKBUF_LOCK(&so->so_rcv); + so2->so_snd.sb_mbmax += unp->unp_mbcnt - so->so_rcv.sb_mbcnt; + unp->unp_mbcnt = so->so_rcv.sb_mbcnt; + newhiwat = so2->so_snd.sb_hiwat + unp->unp_cc - so->so_rcv.sb_cc; + (void)chgsbsize(so2->so_cred->cr_uidinfo, &so2->so_snd.sb_hiwat, + newhiwat, RLIM_INFINITY); + unp->unp_cc = so->so_rcv.sb_cc; + SOCKBUF_UNLOCK(&so->so_rcv); + sowwakeup_locked(so2); + UNP_PCB_UNLOCK(unp); return (0); } @@ -346,13 +598,14 @@ uipc_send(struct socket *so, int flags, struct mbuf *m, struct sockaddr *nam, struct mbuf *control, struct thread *td) { - int error = 0; - struct unpcb *unp; + struct unpcb *unp, *unp2; struct socket *so2; u_long newhiwat; + int error = 0; unp = sotounpcb(so); KASSERT(unp != NULL, ("uipc_send: unp == NULL")); + if (flags & PRUS_OOB) { error = EOPNOTSUPP; goto release; @@ -361,32 +614,38 @@ if (control != NULL && (error = unp_internalize(&control, td))) goto release; - UNP_LOCK(); + UNP_GLOBAL_LOCK(); + UNP_PCB_LOCK(unp); switch (so->so_type) { case SOCK_DGRAM: { const struct sockaddr *from; + unp2 = unp->unp_conn; if (nam != NULL) { - if (unp->unp_conn != NULL) { + if (unp2 != NULL) { error = EISCONN; break; } + UNP_PCB_UNLOCK(unp); error = unp_connect(so, nam, td); + UNP_PCB_LOCK(unp); if (error) break; + unp2 = unp->unp_conn; } else { - if (unp->unp_conn == NULL) { + if (unp2 == NULL) { error = ENOTCONN; break; } } - so2 = unp->unp_conn->unp_socket; + UNP_PCB_LOCK(unp2); + so2 = unp2->unp_socket; if (unp->unp_addr != NULL) from = (struct sockaddr *)unp->unp_addr; else from = &sun_noname; - if (unp->unp_conn->unp_flags & UNP_WANTCRED) + if (unp2->unp_flags & UNP_WANTCRED) control = unp_addsockcred(td, control); SOCKBUF_LOCK(&so2->so_rcv); if (sbappendaddr_locked(&so2->so_rcv, from, m, control)) { @@ -398,19 +657,22 @@ error = ENOBUFS; } if (nam != NULL) - unp_disconnect(unp); + unp_disconnect(unp, unp2); + UNP_PCB_UNLOCK(unp2); break; } case SOCK_STREAM: /* Connect if not connected yet. */ /* - * Note: A better implementation would complain - * if not equal to the peer's address. + * Note: A better implementation would complain if not equal + * to the peer's address. */ if ((so->so_state & SS_ISCONNECTED) == 0) { if (nam != NULL) { + UNP_PCB_UNLOCK(unp); error = unp_connect(so, nam, td); + UNP_PCB_LOCK(unp); if (error) break; /* XXX */ } else { @@ -419,22 +681,34 @@ } } + /* + * Lock order here has to be handled carefully: we hold the + * global lock, so acquiring two unpcb locks is OK. We must + * acquire both before acquiring any socket mutexes. We must + * also acquire the local socket send mutex before the remote + * socket receive mutex. The only tricky thing is making + * sure to acquire the unp2 lock before the local socket send + * lock, or we will experience deadlocks. + */ + unp2 = unp->unp_conn; + KASSERT(unp2 != NULL, + ("uipc_send connected but no connection?")); + UNP_PCB_LOCK(unp2); SOCKBUF_LOCK(&so->so_snd); if (so->so_snd.sb_state & SBS_CANTSENDMORE) { SOCKBUF_UNLOCK(&so->so_snd); + UNP_PCB_UNLOCK(unp2); error = EPIPE; break; } - if (unp->unp_conn == NULL) - panic("uipc_send connected but no connection?"); - so2 = unp->unp_conn->unp_socket; + so2 = unp2->unp_socket; SOCKBUF_LOCK(&so2->so_rcv); - if (unp->unp_conn->unp_flags & UNP_WANTCRED) { + if (unp2->unp_flags & UNP_WANTCRED) { /* * Credentials are passed only once on * SOCK_STREAM. */ - unp->unp_conn->unp_flags &= ~UNP_WANTCRED; + unp2->unp_flags &= ~UNP_WANTCRED; control = unp_addsockcred(td, control); } /* @@ -445,19 +719,19 @@ if (control != NULL) { if (sbappendcontrol_locked(&so2->so_rcv, m, control)) control = NULL; - } else { + } else sbappend_locked(&so2->so_rcv, m); - } so->so_snd.sb_mbmax -= so2->so_rcv.sb_mbcnt - unp->unp_conn->unp_mbcnt; - unp->unp_conn->unp_mbcnt = so2->so_rcv.sb_mbcnt; + unp2->unp_mbcnt = so2->so_rcv.sb_mbcnt; newhiwat = so->so_snd.sb_hiwat - - (so2->so_rcv.sb_cc - unp->unp_conn->unp_cc); + (so2->so_rcv.sb_cc - unp2->unp_cc); (void)chgsbsize(so->so_cred->cr_uidinfo, &so->so_snd.sb_hiwat, newhiwat, RLIM_INFINITY); SOCKBUF_UNLOCK(&so->so_snd); - unp->unp_conn->unp_cc = so2->so_rcv.sb_cc; + unp2->unp_cc = so2->so_rcv.sb_cc; sorwakeup_locked(so2); + UNP_PCB_UNLOCK(unp2); m = NULL; break; @@ -473,7 +747,8 @@ socantsendmore(so); unp_shutdown(unp); } - UNP_UNLOCK(); + UNP_PCB_UNLOCK(unp); + UNP_GLOBAL_UNLOCK(); if (control != NULL && error != 0) unp_dispose(control); @@ -489,22 +764,28 @@ static int uipc_sense(struct socket *so, struct stat *sb) { - struct unpcb *unp; + struct unpcb *unp, *unp2; struct socket *so2; unp = sotounpcb(so); KASSERT(unp != NULL, ("uipc_sense: unp == NULL")); - UNP_LOCK(); + + UNP_GLOBAL_LOCK(); + UNP_PCB_LOCK(unp); sb->st_blksize = so->so_snd.sb_hiwat; - if (so->so_type == SOCK_STREAM && unp->unp_conn != NULL) { - so2 = unp->unp_conn->unp_socket; + unp2 = unp->unp_conn; + if (so->so_type == SOCK_STREAM && unp2 != NULL) { + UNP_PCB_LOCK(unp2); + so2 = unp2->unp_socket; sb->st_blksize += so2->so_rcv.sb_cc; + UNP_PCB_UNLOCK(unp2); } sb->st_dev = NODEV; if (unp->unp_ino == 0) unp->unp_ino = (++unp_ino == 0) ? ++unp_ino : unp_ino; sb->st_ino = unp->unp_ino; - UNP_UNLOCK(); + UNP_PCB_UNLOCK(unp); + UNP_GLOBAL_UNLOCK(); return (0); } @@ -515,10 +796,13 @@ unp = sotounpcb(so); KASSERT(unp != NULL, ("uipc_shutdown: unp == NULL")); - UNP_LOCK(); + + UNP_GLOBAL_LOCK(); + UNP_PCB_LOCK(unp); socantsendmore(so); unp_shutdown(unp); - UNP_UNLOCK(); + UNP_PCB_UNLOCK(unp); + UNP_GLOBAL_UNLOCK(); return (0); } @@ -530,14 +814,15 @@ unp = sotounpcb(so); KASSERT(unp != NULL, ("uipc_sockaddr: unp == NULL")); + *nam = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK); - UNP_LOCK(); + UNP_PCB_LOCK(unp); if (unp->unp_addr != NULL) sa = (struct sockaddr *) unp->unp_addr; else sa = &sun_noname; bcopy(sa, *nam, sa->sa_len); - UNP_UNLOCK(); + UNP_PCB_UNLOCK(unp); return (0); } @@ -574,12 +859,13 @@ unp = sotounpcb(so); KASSERT(unp != NULL, ("uipc_ctloutput: unp == NULL")); - UNP_LOCK(); + error = 0; switch (sopt->sopt_dir) { case SOPT_GET: switch (sopt->sopt_name) { case LOCAL_PEERCRED: + UNP_PCB_LOCK(unp); if (unp->unp_flags & UNP_HAVEPC) xu = unp->unp_peercred; else { @@ -588,22 +874,31 @@ else error = EINVAL; } + UNP_PCB_UNLOCK(unp); if (error == 0) error = sooptcopyout(sopt, &xu, sizeof(xu)); break; + case LOCAL_CREDS: + UNP_PCB_LOCK(unp); optval = unp->unp_flags & UNP_WANTCRED ? 1 : 0; + UNP_PCB_UNLOCK(unp); error = sooptcopyout(sopt, &optval, sizeof(optval)); break; + case LOCAL_CONNWAIT: + UNP_PCB_LOCK(unp); optval = unp->unp_flags & UNP_CONNWAIT ? 1 : 0; + UNP_PCB_UNLOCK(unp); error = sooptcopyout(sopt, &optval, sizeof(optval)); break; + default: error = EOPNOTSUPP; break; } break; + case SOPT_SET: switch (sopt->sopt_name) { case LOCAL_CREDS: @@ -613,19 +908,24 @@ if (error) break; -#define OPTSET(bit) \ - if (optval) \ - unp->unp_flags |= bit; \ - else \ - unp->unp_flags &= ~bit; +#define OPTSET(bit) do { \ + UNP_PCB_LOCK(unp); \ + if (optval) \ + unp->unp_flags |= bit; \ + else \ + unp->unp_flags &= ~bit; \ + UNP_PCB_UNLOCK(unp); \ +} while (0) switch (sopt->sopt_name) { case LOCAL_CREDS: OPTSET(UNP_WANTCRED); break; + case LOCAL_CONNWAIT: OPTSET(UNP_CONNWAIT); break; + default: break; } @@ -636,117 +936,60 @@ break; } break; + default: error = EOPNOTSUPP; break; } - UNP_UNLOCK(); return (error); } -/* - * Both send and receive buffers are allocated PIPSIZ bytes of buffering - * for stream sockets, although the total for sender and receiver is - * actually only PIPSIZ. - * Datagram sockets really use the sendspace as the maximum datagram size, - * and don't really want to reserve the sendspace. Their recvspace should - * be large enough for at least one max-size datagram plus address. - */ -#ifndef PIPSIZ -#define PIPSIZ 8192 -#endif -static u_long unpst_sendspace = PIPSIZ; -static u_long unpst_recvspace = PIPSIZ; -static u_long unpdg_sendspace = 2*1024; /* really max datagram size */ -static u_long unpdg_recvspace = 4*1024; - -static int unp_rights; /* file descriptors in flight */ - -SYSCTL_DECL(_net_local_stream); -SYSCTL_ULONG(_net_local_stream, OID_AUTO, sendspace, CTLFLAG_RW, - &unpst_sendspace, 0, ""); -SYSCTL_ULONG(_net_local_stream, OID_AUTO, recvspace, CTLFLAG_RW, - &unpst_recvspace, 0, ""); -SYSCTL_DECL(_net_local_dgram); -SYSCTL_ULONG(_net_local_dgram, OID_AUTO, maxdgram, CTLFLAG_RW, - &unpdg_sendspace, 0, ""); -SYSCTL_ULONG(_net_local_dgram, OID_AUTO, recvspace, CTLFLAG_RW, - &unpdg_recvspace, 0, ""); -SYSCTL_DECL(_net_local); -SYSCTL_INT(_net_local, OID_AUTO, inflight, CTLFLAG_RD, &unp_rights, 0, ""); - -static int -unp_attach(struct socket *so) -{ - struct unpcb *unp; - int error; - - KASSERT(so->so_pcb == NULL, ("unp_attach: so_pcb != NULL")); - if (so->so_snd.sb_hiwat == 0 || so->so_rcv.sb_hiwat == 0) { - switch (so->so_type) { - - case SOCK_STREAM: - error = soreserve(so, unpst_sendspace, unpst_recvspace); - break; - - case SOCK_DGRAM: - error = soreserve(so, unpdg_sendspace, unpdg_recvspace); - break; - - default: - panic("unp_attach"); - } - if (error) - return (error); - } - unp = uma_zalloc(unp_zone, M_WAITOK | M_ZERO); - if (unp == NULL) - return (ENOBUFS); - LIST_INIT(&unp->unp_refs); - unp->unp_socket = so; - so->so_pcb = unp; - - UNP_LOCK(); - unp->unp_gencnt = ++unp_gencnt; - unp_count++; - LIST_INSERT_HEAD(so->so_type == SOCK_DGRAM ? &unp_dhead - : &unp_shead, unp, unp_link); - UNP_UNLOCK(); - - return (0); -} - static void unp_detach(struct unpcb *unp) { + int local_unp_rights; struct vnode *vp; - int local_unp_rights; + struct unpcb *unp2; - UNP_LOCK_ASSERT(); + UNP_GLOBAL_LOCK_ASSERT(); + UNP_PCB_LOCK_ASSERT(unp); LIST_REMOVE(unp, unp_link); unp->unp_gencnt = ++unp_gencnt; --unp_count; + + /* + * XXXRW: What if v_socket != our soket? + */ if ((vp = unp->unp_vnode) != NULL) { - /* - * XXXRW: should v_socket be frobbed only while holding - * Giant? - */ unp->unp_vnode->v_socket = NULL; unp->unp_vnode = NULL; } - if (unp->unp_conn != NULL) - unp_disconnect(unp); + unp2 = unp->unp_conn; + if (unp2 != NULL) { + UNP_PCB_LOCK(unp2); + unp_disconnect(unp, unp2); + UNP_PCB_UNLOCK(unp2); + } + + /* + * We hold the global lock, so it's OK to acquire multiple pcb locks + * at a time. + */ while (!LIST_EMPTY(&unp->unp_refs)) { struct unpcb *ref = LIST_FIRST(&unp->unp_refs); + + UNP_PCB_LOCK(ref); unp_drop(ref, ECONNRESET); + UNP_PCB_UNLOCK(ref); } + UNP_GLOBAL_UNLOCK(); soisdisconnected(unp->unp_socket); unp->unp_socket->so_pcb = NULL; local_unp_rights = unp_rights; - UNP_UNLOCK(); if (unp->unp_addr != NULL) FREE(unp->unp_addr, M_SONAME); + UNP_PCB_LOCK_DESTROY(unp); uma_zfree(unp_zone, unp); if (vp) { int vfslocked; @@ -760,97 +1003,6 @@ } static int -unp_bind(struct unpcb *unp, struct sockaddr *nam, struct thread *td) -{ - struct sockaddr_un *soun = (struct sockaddr_un *)nam; - struct vnode *vp; - struct mount *mp; - struct vattr vattr; - int error, namelen; - struct nameidata nd; - char *buf; - - UNP_LOCK_ASSERT(); - - /* - * XXXRW: This test-and-set of unp_vnode is non-atomic; the - * unlocked read here is fine, but the value of unp_vnode needs - * to be tested again after we do all the lookups to see if the - * pcb is still unbound? - */ - if (unp->unp_vnode != NULL) - return (EINVAL); - - namelen = soun->sun_len - offsetof(struct sockaddr_un, sun_path); - if (namelen <= 0) - return (EINVAL); - - UNP_UNLOCK(); - - buf = malloc(namelen + 1, M_TEMP, M_WAITOK); - strlcpy(buf, soun->sun_path, namelen + 1); - - mtx_lock(&Giant); -restart: - mtx_assert(&Giant, MA_OWNED); - NDINIT(&nd, CREATE, NOFOLLOW | LOCKPARENT | SAVENAME, UIO_SYSSPACE, - buf, td); -/* SHOULD BE ABLE TO ADOPT EXISTING AND wakeup() ALA FIFO's */ - error = namei(&nd); - if (error) - goto done; - vp = nd.ni_vp; - if (vp != NULL || vn_start_write(nd.ni_dvp, &mp, V_NOWAIT) != 0) { - NDFREE(&nd, NDF_ONLY_PNBUF); - if (nd.ni_dvp == vp) - vrele(nd.ni_dvp); - else - vput(nd.ni_dvp); - if (vp != NULL) { - vrele(vp); - error = EADDRINUSE; - goto done; - } - error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH); - if (error) - goto done; - goto restart; - } - VATTR_NULL(&vattr); - vattr.va_type = VSOCK; - vattr.va_mode = (ACCESSPERMS & ~td->td_proc->p_fd->fd_cmask); -#ifdef MAC - error = mac_check_vnode_create(td->td_ucred, nd.ni_dvp, &nd.ni_cnd, - &vattr); -#endif - if (error == 0) { - VOP_LEASE(nd.ni_dvp, td, td->td_ucred, LEASE_WRITE); - error = VOP_CREATE(nd.ni_dvp, &nd.ni_vp, &nd.ni_cnd, &vattr); - } - NDFREE(&nd, NDF_ONLY_PNBUF); - vput(nd.ni_dvp); - if (error) { - vn_finished_write(mp); - goto done; - } - vp = nd.ni_vp; - ASSERT_VOP_LOCKED(vp, "unp_bind"); - soun = (struct sockaddr_un *)sodupsockaddr(nam, M_WAITOK); - UNP_LOCK(); - vp->v_socket = unp->unp_socket; - unp->unp_vnode = vp; - unp->unp_addr = soun; - UNP_UNLOCK(); - VOP_UNLOCK(vp, 0, td); - vn_finished_write(mp); -done: - mtx_unlock(&Giant); - free(buf, M_TEMP); - UNP_LOCK(); - return (error); -} - -static int unp_connect(struct socket *so, struct sockaddr *nam, struct thread *td) { struct sockaddr_un *soun = (struct sockaddr_un *)nam; @@ -862,15 +1014,25 @@ char buf[SOCK_MAXADDRLEN]; struct sockaddr *sa; - UNP_LOCK_ASSERT(); + UNP_GLOBAL_LOCK_ASSERT(); + UNP_GLOBAL_UNLOCK(); unp = sotounpcb(so); KASSERT(unp != NULL, ("unp_connect: unp == NULL")); + len = nam->sa_len - offsetof(struct sockaddr_un, sun_path); if (len <= 0) return (EINVAL); strlcpy(buf, soun->sun_path, len + 1); - UNP_UNLOCK(); + + UNP_PCB_LOCK(unp); + if (unp->unp_flags & UNP_CONNECTING) { + UNP_PCB_UNLOCK(unp); + return (EALREADY); + } + unp->unp_flags |= UNP_CONNECTING; + UNP_PCB_UNLOCK(unp); + sa = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK); mtx_lock(&Giant); NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF, UIO_SYSSPACE, buf, td); @@ -892,9 +1054,15 @@ if (error) goto bad; mtx_unlock(&Giant); - UNP_LOCK(); + unp = sotounpcb(so); KASSERT(unp != NULL, ("unp_connect: unp == NULL")); + + /* + * Lock global lock for two reasons: make sure v_socket is stable, + * and to protect simultaneous locking of multiple pcbs. + */ + UNP_GLOBAL_LOCK(); so2 = vp->v_socket; if (so2 == NULL) { error = ECONNREFUSED; @@ -907,14 +1075,18 @@ if (so->so_proto->pr_flags & PR_CONNREQUIRED) { if (so2->so_options & SO_ACCEPTCONN) { /* - * NB: drop locks here so unp_attach is entered + * NB: drop locks here so uipc_attach is entered * w/o locks; this avoids a recursive lock * of the head and holding sleep locks across * a (potentially) blocking malloc. + * + * XXXRW: This is actually a non-blocking alloc. + * Lock order issue is real. Releasing lock here + * may invalidate so2 pointer, however. */ - UNP_UNLOCK(); + UNP_GLOBAL_UNLOCK(); so3 = sonewconn(so2, 0); - UNP_LOCK(); + UNP_GLOBAL_LOCK(); } else so3 = NULL; if (so3 == NULL) { @@ -924,6 +1096,9 @@ unp = sotounpcb(so); unp2 = sotounpcb(so2); unp3 = sotounpcb(so3); + UNP_PCB_LOCK(unp); + UNP_PCB_LOCK(unp2); + UNP_PCB_LOCK(unp3); if (unp2->unp_addr != NULL) { bcopy(unp2->unp_addr, sa, unp2->unp_addr->sun_len); unp3->unp_addr = (struct sockaddr_un *) sa; @@ -941,7 +1116,7 @@ /* * The receiver's (server's) credentials are copied * from the unp_peercred member of socket on which the - * former called listen(); unp_listen() cached that + * former called listen(); uipc_listen() cached that * process's credentials at that time so we can use * them now. */ @@ -952,6 +1127,9 @@ unp->unp_flags |= UNP_HAVEPC; if (unp2->unp_flags & UNP_WANTCRED) unp3->unp_flags |= UNP_WANTCRED; + UNP_PCB_UNLOCK(unp3); + UNP_PCB_UNLOCK(unp2); + UNP_PCB_UNLOCK(unp); #ifdef MAC SOCK_LOCK(so); mac_set_socket_peer_from_socket(so, so3); @@ -961,9 +1139,17 @@ so2 = so3; } + unp = sotounpcb(so); + KASSERT(unp != NULL, ("unp_connect: unp == NULL")); + unp2 = sotounpcb(so2); + KASSERT(unp2 != NULL, ("unp_connect: unp2 == NULL")); + UNP_PCB_LOCK(unp); + UNP_PCB_LOCK(unp2); error = unp_connect2(so, so2, PRU_CONNECT); + UNP_PCB_UNLOCK(unp2); + UNP_PCB_UNLOCK(unp); bad2: - UNP_UNLOCK(); + UNP_GLOBAL_UNLOCK(); mtx_lock(&Giant); bad: mtx_assert(&Giant, MA_OWNED); @@ -971,25 +1157,33 @@ vput(vp); mtx_unlock(&Giant); free(sa, M_SONAME); - UNP_LOCK(); + UNP_GLOBAL_LOCK(); + UNP_PCB_LOCK(unp); + unp->unp_flags &= ~UNP_CONNECTING; + UNP_PCB_UNLOCK(unp); return (error); } static int unp_connect2(struct socket *so, struct socket *so2, int req) { - struct unpcb *unp = sotounpcb(so); + struct unpcb *unp; struct unpcb *unp2; - UNP_LOCK_ASSERT(); + unp = sotounpcb(so); + KASSERT(unp != NULL, ("unp_connect2: unp == NULL")); + unp2 = sotounpcb(so2); + KASSERT(unp2 != NULL, ("unp_connect2: unp2 == NULL")); + + UNP_GLOBAL_LOCK_ASSERT(); + UNP_PCB_LOCK_ASSERT(unp); + UNP_PCB_LOCK_ASSERT(unp2); if (so2->so_type != so->so_type) return (EPROTOTYPE); - unp2 = sotounpcb(so2); - KASSERT(unp2 != NULL, ("unp_connect2: unp2 == NULL")); unp->unp_conn = unp2; + switch (so->so_type) { - case SOCK_DGRAM: LIST_INSERT_HEAD(&unp2->unp_refs, unp, unp_reflink); soisconnected(so); @@ -1012,15 +1206,16 @@ } static void -unp_disconnect(struct unpcb *unp) +unp_disconnect(struct unpcb *unp, struct unpcb *unp2) { - struct unpcb *unp2 = unp->unp_conn; struct socket *so; - UNP_LOCK_ASSERT(); + KASSERT(unp2 != NULL, ("unp_disconnect: unp2 == NULL")); + + UNP_GLOBAL_LOCK_ASSERT(); + UNP_PCB_LOCK_ASSERT(unp); + UNP_PCB_LOCK_ASSERT(unp2); - if (unp2 == NULL) - return; unp->unp_conn = NULL; switch (unp->unp_socket->so_type) { case SOCK_DGRAM: @@ -1039,16 +1234,6 @@ } } -#ifdef notdef -void -unp_abort(struct unpcb *unp) -{ - - unp_detach(unp); - UNP_UNLOCK_ASSERT(); -} -#endif - /* * unp_pcblist() assumes that UNIX domain socket memory is never reclaimed * by the zone (UMA_ZONE_NOFREE), and as such potentially stale pointers @@ -1087,10 +1272,10 @@ * OK, now we're committed to doing something. */ xug = malloc(sizeof(*xug), M_TEMP, M_WAITOK); - UNP_LOCK(); + UNP_GLOBAL_LOCK(); gencnt = unp_gencnt; n = unp_count; - UNP_UNLOCK(); + UNP_GLOBAL_UNLOCK(); xug->xug_len = sizeof *xug; xug->xug_count = n; @@ -1104,23 +1289,34 @@ unp_list = malloc(n * sizeof *unp_list, M_TEMP, M_WAITOK); - UNP_LOCK(); + /* + * XXXRW: Note, this code relies very explicitly in pcb's being type + * stable. + */ + UNP_GLOBAL_LOCK(); for (unp = LIST_FIRST(head), i = 0; unp && i < n; unp = LIST_NEXT(unp, unp_link)) { + UNP_PCB_LOCK(unp); if (unp->unp_gencnt <= gencnt) { if (cr_cansee(req->td->td_ucred, unp->unp_socket->so_cred)) continue; unp_list[i++] = unp; } + UNP_PCB_UNLOCK(unp); } - UNP_UNLOCK(); + UNP_GLOBAL_UNLOCK(); n = i; /* in case we lost some during malloc */ + /* + * XXXRW: The logic below asumes that it is OK to lock a mutex in + * an unpcb that may have been freed. + */ error = 0; xu = malloc(sizeof(*xu), M_TEMP, M_WAITOK | M_ZERO); for (i = 0; i < n; i++) { unp = unp_list[i]; + UNP_PCB_LOCK(unp); if (unp->unp_gencnt <= gencnt) { xu->xu_len = sizeof *xu; xu->xu_unpp = unp; @@ -1138,8 +1334,10 @@ unp->unp_conn->unp_addr->sun_len); bcopy(unp, &xu->xu_unp, sizeof *unp); sotoxsocket(unp->unp_socket, &xu->xu_socket); + UNP_PCB_UNLOCK(unp); error = SYSCTL_OUT(req, xu, sizeof *xu); - } + } else + UNP_PCB_UNLOCK(unp); } free(xu, M_TEMP); if (!error) { @@ -1170,33 +1368,38 @@ static void unp_shutdown(struct unpcb *unp) { + struct unpcb *unp2; struct socket *so; - UNP_LOCK_ASSERT(); + UNP_GLOBAL_LOCK_ASSERT(); + UNP_PCB_LOCK_ASSERT(unp); - if (unp->unp_socket->so_type == SOCK_STREAM && unp->unp_conn && - (so = unp->unp_conn->unp_socket)) - socantrcvmore(so); + unp2 = unp->unp_conn; + if (unp->unp_socket->so_type == SOCK_STREAM && unp2 != NULL) { + so = unp2->unp_socket; + if (so != NULL) + socantrcvmore(so); + } } static void unp_drop(struct unpcb *unp, int errno) { struct socket *so = unp->unp_socket; + struct unpcb *unp2; - UNP_LOCK_ASSERT(); + UNP_GLOBAL_LOCK_ASSERT(); + UNP_PCB_LOCK_ASSERT(unp); so->so_error = errno; - unp_disconnect(unp); -} + unp2 = unp->unp_conn; + if (unp2 == NULL) + return; -#ifdef notdef -void -unp_drain(void) -{ - + UNP_PCB_LOCK(unp2); + unp_disconnect(unp, unp2); + UNP_PCB_UNLOCK(unp2); } -#endif static void unp_freerights(struct file **rp, int fdcount) @@ -1205,7 +1408,6 @@ struct file *fp; for (i = 0; i < fdcount; i++) { - fp = *rp; /* * zero the pointer before calling * unp_discard since it may end up @@ -1213,7 +1415,8 @@ * * XXXRW: This is less true than it used to be. */ - *rp++ = 0; + fp = *rp; + *rp++ = NULL; unp_discard(fp); } } @@ -1233,7 +1436,7 @@ int f; u_int newlen; - UNP_UNLOCK_ASSERT(); + UNP_GLOBAL_UNLOCK_ASSERT(); error = 0; if (controlp != NULL) /* controlp == NULL => free control messages */ @@ -1338,6 +1541,7 @@ void unp_init(void) { + unp_zone = uma_zcreate("unpcb", sizeof(struct unpcb), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); if (unp_zone == NULL) @@ -1348,7 +1552,7 @@ LIST_INIT(&unp_dhead); LIST_INIT(&unp_shead); TASK_INIT(&unp_gc_task, 0, unp_gc, NULL); - UNP_LOCK_INIT(); + UNP_GLOBAL_LOCK_INIT(); } static int @@ -1368,7 +1572,7 @@ int error, oldfds; u_int newlen; - UNP_UNLOCK_ASSERT(); + UNP_GLOBAL_UNLOCK_ASSERT(); error = 0; *controlp = NULL; @@ -1748,25 +1952,6 @@ unp_scan(m, unp_discard); } -static int -unp_listen(struct socket *so, struct unpcb *unp, int backlog, - struct thread *td) -{ - int error; - - UNP_LOCK_ASSERT(); - - SOCK_LOCK(so); - error = solisten_proto_check(so); - if (error == 0) { - cru2x(td->td_ucred, &unp->unp_peercred); - unp->unp_flags |= UNP_HAVEPCCACHED; - solisten_proto(so, backlog); - } - SOCK_UNLOCK(so); - return (error); -} - static void unp_scan(struct mbuf *m0, void (*op)(struct file *)) { @@ -1819,6 +2004,9 @@ static void unp_mark(struct file *fp) { + + /* XXXRW: Should probably assert file list lock here. */ + if (fp->f_gcflag & FMARK) return; unp_defer++; @@ -1828,11 +2016,12 @@ static void unp_discard(struct file *fp) { - UNP_LOCK(); + + UNP_GLOBAL_LOCK(); FILE_LOCK(fp); fp->f_msgcount--; unp_rights--; FILE_UNLOCK(fp); - UNP_UNLOCK(); + UNP_GLOBAL_UNLOCK(); (void) closef(fp, (struct thread *)NULL); } --- //depot/vendor/freebsd/src/sys/sys/unpcb.h 2005/04/13 00:05:41 +++ //depot/user/rwatson/proto/src/sys/sys/unpcb.h 2006/04/27 20:47:29 @@ -78,6 +78,7 @@ unp_gen_t unp_gencnt; /* generation count of this instance */ int unp_flags; /* flags */ struct xucred unp_peercred; /* peer credentials, if applicable */ + struct mtx unp_mtx; /* mutex */ }; /* @@ -98,6 +99,14 @@ #define UNP_WANTCRED 0x004 /* credentials wanted */ #define UNP_CONNWAIT 0x008 /* connect blocks until accepted */ +/* + * These flags are used to handle non-atomicity in connect() and bind() + * operations on a socket: in particular, to avoid races between multiple + * threads or processes operating simultaneously on the same socket. + */ +#define UNP_CONNECTING 0x010 /* Currently connecting. */ +#define UNP_BINDING 0x020 /* Currently binding. */ + #define sotounpcb(so) ((struct unpcb *)((so)->so_pcb)) /* Hack alert -- this structure depends on . */ From owner-freebsd-performance@FreeBSD.ORG Sat May 6 14:20:53 2006 Return-Path: X-Original-To: performance@FreeBSD.org Delivered-To: freebsd-performance@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3FFEA16A401; Sat, 6 May 2006 14:20:53 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id A919E43D60; Sat, 6 May 2006 14:20:52 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 5605246C36; Sat, 6 May 2006 10:20:52 -0400 (EDT) Date: Sat, 6 May 2006 15:20:52 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: current@FreeBSD.org In-Reply-To: <20060506150622.C17611@fledge.watson.org> Message-ID: <20060506152016.I17611@fledge.watson.org> References: <20060506150622.C17611@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: performance@FreeBSD.org Subject: Re: Fine-grained locking for POSIX local sockets (UNIX domain sockets) X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 06 May 2006 14:20:53 -0000 On Sat, 6 May 2006, Robert Watson wrote: > Attached, please find a patch implementing more fine-grained locking for the > POSIX local socket subsystem (UNIX domain socket subsystem). In the current > implementation, we use a single global subsystem lock to protect all local > IPC over the PF_LOCAL socket type. This has low overhead, but can result in > significant contention, especially for workloads like MySQL which push a > great deal of data through UNIX domain sockets, and involve many parties. > The hope is that by improving granularity, we can reduce contention > sufficiently to overcome the added cost of increased locking overhead (a > side-effect of greater granularity). At a high level, here are the changes > made: FYI, you can download a copy unmangled by my and your mail systems at the following URL: http://www.watson.org/~robert/freebsd/netperf/20060506a-uds-fine-grain.diff Robert N M Watson From owner-freebsd-performance@FreeBSD.ORG Sat May 6 20:32:08 2006 Return-Path: X-Original-To: performance@FreeBSD.org Delivered-To: freebsd-performance@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A56C716A40E; Sat, 6 May 2006 20:32:08 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from people.fsn.hu (people.fsn.hu [195.228.252.137]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5DCC243D6E; Sat, 6 May 2006 20:32:04 +0000 (GMT) (envelope-from bra@fsn.hu) Received: from localhost (localhost [127.0.0.1]) by people.fsn.hu (Postfix) with ESMTP id 82E9584423; Sat, 6 May 2006 22:32:03 +0200 (CEST) Received: from people.fsn.hu ([127.0.0.1]) by localhost (people.fsn.hu [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 33631-04; Sat, 6 May 2006 22:31:58 +0200 (CEST) Received: from [172.16.164.67] (fw.axelero.hu [195.228.243.120]) by people.fsn.hu (Postfix) with ESMTP id CD6EA8441E; Sat, 6 May 2006 22:31:57 +0200 (CEST) Message-ID: <445D07C1.7080403@fsn.hu> Date: Sat, 06 May 2006 22:32:01 +0200 From: Attila Nagy User-Agent: Thunderbird 1.5.0.2 (Windows/20060308) MIME-Version: 1.0 To: Robert Watson References: <20060506150622.C17611@fledge.watson.org> In-Reply-To: <20060506150622.C17611@fledge.watson.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: amavisd-new at fsn.hu Cc: performance@FreeBSD.org, current@FreeBSD.org Subject: Re: Fine-grained locking for POSIX local sockets (UNIX domain sockets) X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 06 May 2006 20:32:09 -0000 Hello, On 2006. 05. 06. 16:16, Robert Watson wrote: > In local measurements, I have observed a 0% change in MySQL performance > on uniprocessor systems, and on a dual-processor system I have observed > a 4%-5% performance improvement with two client MySQL threads. Just a quick, nowhere correct test: http://people.fsn.hu/~bra/freebsd/20060506a-uds-fine-grain.diff/domsock.png The machine is a quad core Xeon LV server, the client side is sysbench, accessing mysql 4.1.8 on a socket. Heap table, simple test. -- Attila Nagy e-mail: Attila.Nagy@fsn.hu Free Software Network (FSN.HU) phone: +3630 306 6758 http://www.fsn.hu/ From owner-freebsd-performance@FreeBSD.ORG Sat May 6 20:50:36 2006 Return-Path: X-Original-To: performance@FreeBSD.org Delivered-To: freebsd-performance@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A280B16A400; Sat, 6 May 2006 20:50:36 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 52EDA43D45; Sat, 6 May 2006 20:50:36 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id D88E946BD8; Sat, 6 May 2006 16:50:35 -0400 (EDT) Date: Sat, 6 May 2006 21:50:35 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Attila Nagy In-Reply-To: <445D07C1.7080403@fsn.hu> Message-ID: <20060506214838.Q46997@fledge.watson.org> References: <20060506150622.C17611@fledge.watson.org> <445D07C1.7080403@fsn.hu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: performance@FreeBSD.org, current@FreeBSD.org Subject: Re: Fine-grained locking for POSIX local sockets (UNIX domain sockets) X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 06 May 2006 20:50:36 -0000 On Sat, 6 May 2006, Attila Nagy wrote: > On 2006. 05. 06. 16:16, Robert Watson wrote: >> In local measurements, I have observed a 0% change in MySQL performance on >> uniprocessor systems, and on a dual-processor system I have observed a >> 4%-5% performance improvement with two client MySQL threads. > Just a quick, nowhere correct test: > http://people.fsn.hu/~bra/freebsd/20060506a-uds-fine-grain.diff/domsock.png > > The machine is a quad core Xeon LV server, the client side is sysbench, > accessing mysql 4.1.8 on a socket. Heap table, simple test. Which threading library is that with, btw? If libpthread, could you run the same test with libthr, and vice versa? I've noticed that libpthread has the interesting property that it greatly improves contention issues on certain locks when those locks are contended on within a process. The reason is that it avoids having threads within the process compete with themselves. The big locks threaded processes hit when contending with themselves are the file descriptor array lock and signal delivery related locks. Robert N M Watson From owner-freebsd-performance@FreeBSD.ORG Sat May 6 22:19:09 2006 Return-Path: X-Original-To: performance@FreeBSD.org Delivered-To: freebsd-performance@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C0F8216A402; Sat, 6 May 2006 22:19:09 +0000 (UTC) (envelope-from kris@obsecurity.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6A24943D45; Sat, 6 May 2006 22:19:09 +0000 (GMT) (envelope-from kris@obsecurity.org) Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196]) by elvis.mu.org (Postfix) with ESMTP id 07E231A4D8E; Sat, 6 May 2006 15:19:09 -0700 (PDT) Received: by obsecurity.dyndns.org (Postfix, from userid 1000) id 6B17B51A2C; Sat, 6 May 2006 18:19:08 -0400 (EDT) Date: Sat, 6 May 2006 18:19:08 -0400 From: Kris Kennaway To: Robert Watson Message-ID: <20060506221908.GB51268@xor.obsecurity.org> References: <20060506150622.C17611@fledge.watson.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="1LKvkjL3sHcu1TtY" Content-Disposition: inline In-Reply-To: <20060506150622.C17611@fledge.watson.org> User-Agent: Mutt/1.4.2.1i Cc: performance@FreeBSD.org, current@FreeBSD.org Subject: Re: Fine-grained locking for POSIX local sockets (UNIX domain sockets) X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 06 May 2006 22:19:09 -0000 --1LKvkjL3sHcu1TtY Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, May 06, 2006 at 03:16:48PM +0100, Robert Watson wrote: >=20 > Dear all, >=20 > Attached, please find a patch implementing more fine-grained locking for= =20 > the POSIX local socket subsystem (UNIX domain socket subsystem). Dear Sir, Per your request, please find attached the results of my measurements using super-smack on a 12-cpu E4500. supersmack queries/second with n worker threads: norwatson =3D without your patch (but with some other local locking patches) rwatson =3D also with your patch x norwatson-4 + rwatson-4 +------------------------------------------------------------+ | x xx + + | |x xx xx x + ++++ +++| | |_AM_| |__A___| | +------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 3067.92 3098.05 3086.945 3084.402 8.8815574 + 10 3245.06 3287.8 3270.52 3270.475 13.241953 Difference at 95.0% confidence 186.073 +/- 10.5935 6.03271% +/- 0.343455% (Student's t, pooled s =3D 11.2746) x norwatson-6 + rwatson-6 +------------------------------------------------------------+ | xx x + | |x *xxxx + + + ++ + ++ | | |__A__| |_____________A_____M________|| +------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 3641.11 3693.89 3679.735 3677.083 14.648967 + 10 3672.23 3896.32 3869.415 3845.071 66.826543 Difference at 95.0% confidence 167.988 +/- 45.4534 4.56851% +/- 1.23613% (Student's t, pooled s =3D 48.3755) i.e. in both cases there is a clear net gain in throughput with your patch. Without your patch, 6 clients is the optimum client load on this 12-cpu machine. At higher loads performance drops, even though formally all CPUs are not saturated. This is due to rapidly diverging lock contention (see below). x norwatson-8 + rwatson-8 +------------------------------------------------------------+ | + | | + + + x x| |+ + +++ + x xxxxx x x| | |_________A___M______| |___A___| | +------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 2601.46 2700.26 2650.52 2653.441 30.758034 + 10 2240.86 2516.87 2496.085 2468.468 81.868576 Difference at 95.0% confidence -184.973 +/- 58.1052 -6.97106% +/- 2.1898% (Student's t, pooled s =3D 61.8406) We see the drop in performance in both cases indicating that we are in the "overloaded" regime. The fact that your patch seems to give worse performance is puzzling at first sight. Running mutex profiling (and only keeping the unp mutex entries and the 10 most contended for clarity) shows the following: norwatson, 8 clients: debug.mutex.prof.stats: max total count avg cnt_hold cnt_lock name 5 40 9 4 0 3 kern/uipc_u= srreq.c:170 (unp) 8 8 1 8 0 0 vm/uma_core= .c:2101 (unpcb) 13 283 52 5 0 0 vm/uma_core= .c:890 (unpcb) 14 1075 200 5 0 0 vm/uma_core= .c:1885 (unpcb) 4 52 18 2 4 6 kern/uipc_u= srreq.c:577 (unp) 5 39 9 4 4 2 kern/uipc_u= srreq.c:534 (unp) 5 35 11 3 6 6 kern/uipc_u= srreq.c:974 (unp) 5 45 11 4 7 4 kern/uipc_u= srreq.c:210 (unp) 171 1164 9 129 7 2 kern/uipc_u= srreq.c:917 (unp) 14 78 20 3 11 2872481 kern/uipc_u= srreq.c:709 (unp) 70 156 11 14 13 4 kern/uipc_u= srreq.c:895 (unp) 43 581 20 29 24 6 kern/uipc_u= srreq.c:239 (unp) 44 429 18 23 26 8 kern/uipc_u= srreq.c:518 (unp) 55 491 12 40 30 10 kern/uipc_u= srreq.c:251 (unp) =2E.. 449 20000519 320038 62 15158 0 kern/uipc_u= srreq.c:431 (so_rcv) 459 86616085 2880079 30 15699 4944 kern/uipc_u= srreq.c:319 (so_snd) 146 2273360 640315 3 27918 29789 kern/kern_s= ig.c:1002 (process lock) 387 3325481 640099 5 38143 47670 kern/kern_d= escrip.c:420 (filedesc structure) 150 1881990 640155 2 64111 49033 kern/kern_d= escrip.c:368 (filedesc structure) 496 13792853 3685885 3 101692 132480 kern/kern_d= escrip.c:1988 (filedesc structure) 207 4061793 551604 7 115427 118242 kern/kern_s= ynch.c:220 (process lock) 391 10332282 3685885 2 194387 129547 kern/kern_d= escrip.c:1967 (filedesc structure) 465 25504709 320042 79 1632192 294498 kern/uipc_u= srreq.c:364 (unp) 470 124263922 2880084 43 13222757 2685853 kern/uipc_u= srreq.c:309 (unp) i.e. there is indeed heavy contention on the unp lock (column 5 counts the number of times we tried to acquire it and failed because someone else had the lock) - in fact about 5 times as many contentions as successful acquisitions. With your patch and the same load: 3 20 9 2 0 0 kern/uipc_u= srreq.c:1028 (unp_mtx) 3 22 9 2 0 0 kern/uipc_u= srreq.c:1161 (unp_mtx) 5 29 9 3 0 2 kern/uipc_u= srreq.c:1065 (unp_global_mtx) 5 53 18 2 0 76488 kern/uipc_u= srreq.c:287 (unp_global_mtx) 6 33 9 3 0 0 kern/uipc_u= srreq.c:236 (unp_mtx) 6 37 9 4 0 0 kern/uipc_u= srreq.c:819 (unp_mtx) 7 7 1 7 0 0 vm/uma_core= .c:2101 (unpcb) 8 49 9 5 0 0 kern/uipc_u= srreq.c:1101 (unp_mtx) 11 136 18 7 0 1 kern/uipc_u= srreq.c:458 (unp_global_mtx) 32 143 9 15 0 1 kern/uipc_u= srreq.c:1160 (unp_global_mtx) 44 472 18 26 0 0 kern/uipc_u= srreq.c:801 (unp_mtx) 123 310 9 34 0 0 kern/uipc_u= srreq.c:1100 (unp_mtx) 147 452 9 50 0 0 kern/uipc_u= srreq.c:1099 (unp_mtx) 172 748 9 83 0 0 kern/uipc_u= srreq.c:473 (unp_mtx) 337 1592 9 176 0 0 kern/uipc_u= srreq.c:1147 (unp_mtx) 350 1790 9 198 0 0 kern/uipc_u= srreq.c:1146 (unp_mtx) 780 39405928 320038 123 0 0 kern/uipc_u= srreq.c:618 (unp_mtx) 18 140 9 15 1 0 kern/uipc_u= srreq.c:235 (unp_global_mtx) 70 717 18 39 1 3 kern/uipc_u= srreq.c:800 (unp_global_mtx) 528 2444 9 271 1 1 kern/uipc_u= srreq.c:1089 (unp_global_mtx) 158 616 9 68 2 2 kern/uipc_u= srreq.c:476 (unp_mtx) 794 175382857 2880084 60 2 7686 kern/uipc_u= srreq.c:574 (unp_mtx) 4 25 9 2 3 2 kern/uipc_u= srreq.c:422 (unp_global_mtx) 186 874 9 97 3 3 kern/uipc_u= srreq.c:472 (unp_global_mtx) 768 33783759 320038 105 7442 0 kern/uipc_u= srreq.c:696 (unp_mtx) =2E.. 465 913127 320045 2 43130 35046 kern/uipc_s= ocket.c:1101 (so_snd) 483 2453927 628737 3 44768 46177 kern/kern_s= ig.c:1002 (process lock) 767 124298544 2880082 43 70037 59994 kern/uipc_u= srreq.c:581 (so_snd) 794 45176699 320038 141 83252 72140 kern/uipc_u= srreq.c:617 (unp_global_mtx) 549 9858281 3200210 3 579269 712643 kern/kern_r= esource.c:1172 (sleep mtxpool) 554 17122245 631715 27 641888 268243 kern/kern_d= escrip.c:420 (filedesc structure) 388 3009912 631753 4 653540 260590 kern/kern_d= escrip.c:368 (filedesc structure) 642 49626755 3681446 13 1642954 682669 kern/kern_d= escrip.c:1988 (filedesc structure) 530 13802687 3681446 3 1663244 616899 kern/kern_d= escrip.c:1967 (filedesc structure) 477 23472709 2810986 8 5671248 1900047 kern/kern_s= ynch.c:220 (process lock) The top 10 heavily contended mutexes are very different (but note the number of mutex acquisitions, column 3, is about the same). There is not much contention on unp_global_mtx any longer, but there is a lot more on some of the other mutexes, especially the process lock via msleep(). Off-hand I don't know what is the cause of this bottleneck (note: libthr is used as threading library and libpthread is not ported to sparc64). Also, a lot of the contention that used to be on the unp lock seems to have fallen through onto contending *two* of the filedesc locks (all about 1.6 million contentions). This may also help to explain the performance drop. With only 6 clients, the contention is about an order of magnitude less on most of the top 10, even though the number of mutex calls is only about 25% fewer than with 8 clients: 195 715786 240037 2 47462 48821 kern/uipc_s= ocket.c:1101 (so_snd) 524 3456427 480079 7 50257 53368 kern/kern_d= escrip.c:420 (filedesc structure) 647 21650810 240030 90 50609 2 kern/uipc_u= srreq.c:705 (so_rcv) 710 37962453 240031 158 63743 57814 kern/uipc_u= srreq.c:617 (unp_global_mtx) 345 1624193 488866 3 80349 62950 kern/kern_d= escrip.c:368 (filedesc structure) 595 108074003 2160067 50 83327 63451 kern/uipc_u= srreq.c:581 (so_snd) 453 3706420 519735 7 119947 181434 kern/kern_s= ynch.c:220 (process lock) 469 13085667 2800771 4 122344 132478 kern/kern_d= escrip.c:1988 (filedesc structure) 320 8814736 2800771 3 200492 148967 kern/kern_d= escrip.c:1967 (filedesc structure) 440 7591194 2400171 3 544692 507583 kern/kern_r= esource.c:1172 (sleep mtxpool) In summary, this is a good test case since it shows both the benefits of your patch and the areas of remaining concern. Yours sincerely, Kristian D. Kennaway --1LKvkjL3sHcu1TtY Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (FreeBSD) iD8DBQFEXSDbWry0BWjoQKURAh8HAJkBGINyDyuC30ghHYqo1oSi6F25hACg9Yy+ ggHk4zGOlXNVL8NAZtCnp8g= =Sc2u -----END PGP SIGNATURE----- --1LKvkjL3sHcu1TtY-- From owner-freebsd-performance@FreeBSD.ORG Sat May 6 22:39:59 2006 Return-Path: X-Original-To: performance@FreeBSD.org Delivered-To: freebsd-performance@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C02F516A416; Sat, 6 May 2006 22:39:59 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from people.fsn.hu (people.fsn.hu [195.228.252.137]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2577E43D53; Sat, 6 May 2006 22:39:58 +0000 (GMT) (envelope-from bra@fsn.hu) Received: from localhost (localhost [127.0.0.1]) by people.fsn.hu (Postfix) with ESMTP id 174438441F; Sun, 7 May 2006 00:39:57 +0200 (CEST) Received: from people.fsn.hu ([127.0.0.1]) by localhost (people.fsn.hu [127.0.0.1]) (amavisd-new, port 10024) with LMTP id 44147-01-3; Sun, 7 May 2006 00:39:51 +0200 (CEST) Received: from [172.16.164.66] (fw.axelero.hu [195.228.243.120]) by people.fsn.hu (Postfix) with ESMTP id 5256A84408; Sun, 7 May 2006 00:39:51 +0200 (CEST) Message-ID: <445D25BA.9020508@fsn.hu> Date: Sun, 07 May 2006 00:39:54 +0200 From: Attila Nagy User-Agent: Thunderbird 1.5.0.2 (Windows/20060308) MIME-Version: 1.0 To: Robert Watson References: <20060506150622.C17611@fledge.watson.org> <445D07C1.7080403@fsn.hu> <20060506214838.Q46997@fledge.watson.org> In-Reply-To: <20060506214838.Q46997@fledge.watson.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: amavisd-new at fsn.hu Cc: performance@FreeBSD.org, current@FreeBSD.org Subject: Re: Fine-grained locking for POSIX local sockets (UNIX domain sockets) X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 06 May 2006 22:39:59 -0000 On 2006. 05. 06. 22:50, Robert Watson wrote: >> The machine is a quad core Xeon LV server, the client side is >> sysbench, accessing mysql 4.1.8 on a socket. Heap table, simple test. > Which threading library is that with, btw? If libpthread, could you run > the same test with libthr, and vice versa? thr and with dynamically linked mysql, because when I link it with -static, it dies with sig11 at startup. You can find the updated picture at the previous link. -- Attila Nagy e-mail: Attila.Nagy@fsn.hu Free Software Network (FSN.HU) phone: +3630 306 6758 http://www.fsn.hu/