Date: Mon, 14 Jul 2008 22:34:46 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Robert Watson <rwatson@FreeBSD.org> Cc: FreeBSD Net <freebsd-net@FreeBSD.org>, Andre Oppermann <andre@FreeBSD.org>, Ingo Flaschberger <if@xip.at>, Paul <paul@gtcomm.net> Subject: Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp] Message-ID: <20080714212912.D885@besplex.bde.org> In-Reply-To: <20080707142018.U63144@fledge.watson.org> References: <4867420D.7090406@gtcomm.net> <486A7E45.3030902@gtcomm.net> <486A8F24.5010000@gtcomm.net> <486A9A0E.6060308@elischer.org> <486B41D5.3060609@gtcomm.net> <alpine.LFD.1.10.0807021052041.557@filebunker.xip.at> <486B4F11.6040906@gtcomm.net> <alpine.LFD.1.10.0807021155280.557@filebunker.xip.at> <486BC7F5.5070604@gtcomm.net> <20080703160540.W6369@delplex.bde.org> <486C7F93.7010308@gtcomm.net> <20080703195521.O6973@delplex.bde.org> <486D35A0.4000302@gtcomm.net> <alpine.LFD.1.10.0807041106591.19613@filebunker.xip.at> <486DF1A3.9000409@gtcomm.net> <alpine.LFD.1.10.0807041303490.20760@filebunker.xip.at> <486E65E6.3060301@gtcomm.net> <alpine.LFD.1.10.0807052356130.2145@filebunker.xip.at> <4871DB8E.5070903@freebsd.org> <20080707191918.B4703@besplex.bde.org> <4871FB66.1060406@freebsd.org> <20080707213356.G7572@besplex.bde.org> <20080707134036.S63144@fledge.watson.org> <20080707224659.B7844@besplex.bde.org> <20080707142018.U63144@fledge.watson.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 7 Jul 2008, Robert Watson wrote: > On Mon, 7 Jul 2008, Bruce Evans wrote: > >>> (1) sendto() to a specific address and port on a socket that has been >>> bound to >>> INADDR_ANY and a specific port. >>> >>> (2) sendto() on a specific address and port on a socket that has been >>> bound to >>> a specific IP address (not INADDR_ANY) and a specific port. >>> >>> (3) send() on a socket that has been connect()'d to a specific IP address >>> and >>> a specific port, and bound to INADDR_ANY and a specific port. >>> >>> (4) send() on a socket that has been connect()'d to a specific IP address >>> and a specific port, and bound to a specific IP address (not >>> INADDR_ANY) >>> and a specific port. >>> >>> The last of these should really be quite a bit faster than the first of >>> these, but I'd be interested in seeing specific measurements for each if >>> that's possible! >> >> Not sure if I understand networking well enough to set these up quickly. >> Does netrate use one of (3) or (4) now? > > (3) and (4) are effectively the same thing, I think, since connect(2) should > force the selection of a source IP address, but I think it's not a bad idea > to confirm that. :-) > > The structure of the desired micro-benchmark here is basically: > ... I hacked netblast.c to do this: % --- /usr/src/tools/tools/netrate/netblast/netblast.c Fri Dec 16 17:02:44 2005 % +++ netblast.c Mon Jul 14 21:26:52 2008 % @@ -44,9 +44,11 @@ % { % % - fprintf(stderr, "netblast [ip] [port] [payloadsize] [duration]\n"); % - exit(-1); % + fprintf(stderr, "netblast ip port payloadsize duration bind connect\n"); % + exit(1); % } % % +static int gconnected; % static int global_stop_flag; % +static struct sockaddr_in *gsin; % % static void % @@ -116,6 +118,13 @@ % counter++; % } % - if (send(s, packet, packet_len, 0) < 0) % + if (gconnected && send(s, packet, packet_len, 0) < 0) { % send_errors++; % + usleep(1000); % + } % + if (!gconnected && sendto(s, packet, packet_len, 0, % + (struct sockaddr *)gsin, sizeof(*gsin)) < 0) { % + send_errors++; % + usleep(1000); % + } % send_calls++; % } % @@ -146,9 +155,10 @@ % struct sockaddr_in sin; % char *dummy, *packet; % - int s; % + int bind_desired, connect_desired, s; % % - if (argc != 5) % + if (argc != 7) % usage(); % % + gsin = &sin; % bzero(&sin, sizeof(sin)); % sin.sin_len = sizeof(sin); % @@ -176,4 +186,7 @@ % usage(); % % + bind_desired = (strcmp(argv[5], "b") == 0); % + connect_desired = (strcmp(argv[6], "c") == 0); % + % packet = malloc(payloadsize); % if (packet == NULL) { % @@ -189,7 +202,19 @@ % } % % - if (connect(s, (struct sockaddr *)&sin, sizeof(sin)) < 0) { % - perror("connect"); % - return (-1); % + if (bind_desired) { % + struct sockaddr_in osin; % + % + osin = sin; % + if (inet_aton("0", &sin.sin_addr) == 0) % + perror("inet_aton(0)"); % + if (bind(s, (struct sockaddr *)&sin, sizeof(sin)) < 0) % + err(-1, "bind"); % + sin = osin; % + } % + % + if (connect_desired) { % + if (connect(s, (struct sockaddr *)&sin, sizeof(sin)) < 0) % + err(-1, "connect"); % + gconnected = 1; % } % This also fixes some bugs in usage() (bogus [] around non-optional args and bogus exit code) and adds a sleep after send failure. Without the sleep, netblast distorts the measurements by taking 100% CPU. This depends on kernel queues having enough buffering to not run dry during the sleep time (rounded up to a tick boundary). I use ifq_maxlen = DRIVER_TX_RING_CNT + imax(2 * tick / 4, 10000) = 10512 for DRIVER = bge and HZ = 100. This is actually wrong now. The magic 2 is to round up to a tick boundary and the magic 4 is for bge taking a minimum of 4 usec per packet on old hadware, but bge actually takes about 1.5 usec on the test hardware and I'd like it to take 0.66 usec. The queues rarely run dry in practice, but running dry just a few times for a few msec each would explain some anomalies. Old SGI ttcp uses a select timeout of 18 msec here. nttcp and netsend use more sophisticated methods that don't work unless HZ is too small. It's just impossible for a program to schedule its sleeps with a fine enough resolution to ensure waking up before the queue runs dry, unless HZ is too small or the queue is too large. select() for writing doesn't work for the queue part of socket i/o. Results: ~5.2 sendto (1): 630 kpps 98% CPU 11 cm/p (cache misses/packet (min)) -cur sendto: 590 kpps 100% CPU 10 cm/p (July 8 -current) (2): no significant difference - see below ~5.2 send (3): 620 kpps 75% CPU 9.5 cm/p -cur send: 520 kpps 60% CPU 8 cm/p (4): no significant difference - see below send() has lower CPU overheads as expected. For some reason, send() gets lower throughput than sendto(). I think the reason is just that the queue runs dry due to the lower CPU overhead making it possible for the userland sender to outrun the hardware -- userland sees more ENOBUFS and sleeps more often, so it sometimes sleeps too long due to my out of date hack for increasing the queue length. For some reason, this affects -current much more than ~5.2 (the bge drivers in each have lots of modifications which are supposed to be equivalent here). Probably the same reason. sendto() still 5-10% higher overhead in -current than in ~5.2 and runs out of CPU. It runs out under ~5.2 testing ttcp too. > If you look at the design of the higher performance UDP applications, they > will generally bind a specific IP (perhaps every IP on the host with its own > socket), and if they do sustained communication to a specific endpoint they > will use connect(2) rather than providing an address for each send(2) system > call to the kernel. I couldn't see any effect from binding. I'm only testing sending, and it doesn't seem to be possible to bind to anything except local addresses (0.0.0.0, the NIC's address and 127.0.0.1) but these seem to be equivalent (with no extra work for translation on every packet?) and seem to be used by default anyway. In the above, sin.sin_addr has to be set to the receiver's ip from the command line (else it defaults to a local address), and the above temporarily sets it back to 0.0.0.0 so as to use the same sin for the local bind()). > udp_output(2) makes the trade-offs there fairly clear: with the most recent > rev, the optimal case is one connect(2) has been called, allowing a single > inpcb read lock and no global data structure access, vs. an application > calling sendto(2) for each system call and the local binding remaining > INADDR_ANY. Middle ground applications, such as named(8) will force a local > binding using bind(2), but then still have to pass an address to each > sendto(2). In the future, this case will be further optimized in our code by > using a global read lock rather than a global write lock: we have to check > for collisions, but we don't actually have to reserve the new 4-tuple for the > UDP socket as it's an ephemeral association rather than a connect(2). The July 8 -current should have this rev. Note that I'm not testing SMP or stessing locking, or nontrivial routine tables, or forwarding, and don't plan to. UP with a direct connection is hard enough and short of CPU enough to understand and make efficient. Locking barely shows up in older tests, only partly because it is mostly inline. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080714212912.D885>