Date: Sat, 20 Oct 2007 19:55:16 +0100 (BST) From: Robert Watson <rwatson@FreeBSD.org> To: arch@FreeBSD.org Subject: Lock profiling results on TCP and an 8.x project Message-ID: <20071020184330.C70919@fledge.watson.org>
next in thread | raw e-mail | index | archive | help
Dear all: This is just an FYI e-mail since I have some early measurement results I thought I'd share. I've started to look at decomposing tcbinfo, and preparatory to that, I ran some lock profiling on simple TCP workloads to explore where and how contention is arising. I used a three-machine configuration in the netperf cluster -- two client boxes (tiger-1, tiger-3) linked to a 4-core amd64 server by two dedicated gig-e ethernet (cheetah). In one test, I ran netserver on cheetah, and in the other, netrate's httpd. I ran the respective clients on tiger-1 and tiger-3 in both tests. One important property of this configuration is that because there are independent network links, the ithreads for the two devices can be scheduled independently, and run the network stack to completion via direct dispatch, offering the opportunity for full parallelism in the TCP input path (subject to limits in our locking model). Each sample was gathered for approximately 10 seconds during the run (40 seconds oof CPUish time over 4 cores). In the netperf test, I used two independent TCP streams, one per interface with the TCP stream benchmark in the steady state. This should essentially consist of cheetah receiving large data packets and sending back small ACKs; in principle the two workloads are entirely independent, although in practice TCP locking doesn't allow that, and you get potential interactions due to the memory allocator, scheduler, etc. In the http test, I configured 32 workers each on tiger-1 and tiger-3, and serviced them with a 128-worker httpd on cheetah. The file transfered was 1k, and it was the same 1k file repeatedly sent via sendfile. Unlike the netperf test, this resulted in very little steady state TCP traffic--the entire request fits in one segment, and the file fits in a second segment. Also, workers are presumably available to move back and forth between work sources, and theres a single shared listen socket. I.e., opportunities for completely independent operation are significantly reduced, and there are lots of globally visible TCP state changes. Netperf test top wait_total locks: Seconds Instance 5.75s tcp_usrreq.c:729 (inp) 2.18s tcp_input.c:479 (inp) 1.67s tcp_input.c:400 (tcp) 0.32s uipc_socket.c:1424 (so_rcv) 0.28s tcp_input.c:1191 (so_rcv) 0.20s kern_timeout.c:419 (callout) 0.09s route.c:147 (radix node head) ... In this test, the top four locking points are responsible for consuming 25% of available CPU*. We can reasonably assume that the contention on 'inp' and to a lesser degree 'so_rcv' is between the ithread and netserver processes for each network interface and that they duke it out significantly due to generating ACKs, moving data in and out of socket buffers, etc. Only the 'tcp' lock reflects interference between the two otherwise independent sessions operating over the independent links. Http test top wait_total locks: Seconds Instance 8.50s tcp_input.c:400 (tcp) 2.21s tcp_usrreq.c:568 (tcp) 1.96s tcp_usrreq.c:955 (tcp) 0.78s subr_turnstile.c:546 (chain) 0.52s tcp_usrreq.c:606 (inp) 0.16s subr_turnstile.c:536 (chain) 0.13s tcp_input.c:2867 (so_rcv) 0.12s kern_timout.c:419 (callout) 0.08s route.c:147 (radix node head) ... In this test, the top four locking points are responsible for consuming 34% of available CPU*. Here, it is clear that the global 'tcp' lock is responsible for most of the suffering as a result of the lock getting held across the input path for most packets. This is in contrast to the steady state flow, in which most packets require only brief tcbinfo lookups and not extended holding time required for packets that may lead to state changes (syn, fin, etc). Also, this is the send path, which is directly dispatched from the user code all the way to the interface queue or device driver, so there's no heavy contention in the handoff between the two (and hence inp hammering) in this direction. Jeff and Attilio tell me that the turnstile contention is simply a symptom of heavy contention on the other mutexes in the work, and not a first order effect. These results appear to confirm that we need to look at breaking out the tcbinfo lock as a key goal for 8.0, with serious thoughts about an MFC once stabilized and if ABI-safe--these two workloads represent the extremes of TCP workloads, and real world configurations will fall in between, but as the number of cores goes up and the desire to spread work over CPUs goes up, so will contention on a single global lock. An important part of this will be establishing models for distributing the work over CPUs in such a way as to avoid contention while still allowing load balancing. Anyhow, just an early report as I continue my investigation of this issue... * When talking about percentage of available CPUs, I make the assumption that due to a sufficient quantity of CPUs, in most cases lock acquisition will occur as a result of adaptive spinning rather than sleeping. In the netperf case, this is not true, since the number of potential workers exceeds the number of CPUs, hence the turnstile contention. However, as sleeping on locks itself is very expensive, it's reasonable to assume we would recover a lot of CPU none-the-less. Robert N M Watson Computer Laboratory University of Cambridge
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071020184330.C70919>