Date: Thu, 12 Apr 2012 13:18:59 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Luigi Rizzo <rizzo@iet.unipi.it> Cc: Barney Wolff <barney@databus.com>, Andre Oppermann <andre@freebsd.org>, current@freebsd.org, net@freebsd.org Subject: Re: strange ping response times... Message-ID: <20120412112401.Q981@besplex.bde.org> In-Reply-To: <20120411131920.GA61027@onelab2.iet.unipi.it> References: <20120410225257.GB53350@onelab2.iet.unipi.it> <4F84B6DB.5040904@freebsd.org> <20120410230500.GA22829@pit.databus.com> <20120410233211.GA53829@onelab2.iet.unipi.it> <4F855E5E.5000107@freebsd.org> <20120411110036.GA60031@onelab2.iet.unipi.it> <4F857631.8040309@freebsd.org> <20120411131920.GA61027@onelab2.iet.unipi.it>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 11 Apr 2012, Luigi Rizzo wrote: > On Wed, Apr 11, 2012 at 02:16:49PM +0200, Andre Oppermann wrote: >> On 11.04.2012 13:00, Luigi Rizzo wrote: >>> On Wed, Apr 11, 2012 at 12:35:10PM +0200, Andre Oppermann wrote: >>>> On 11.04.2012 01:32, Luigi Rizzo wrote: >>>> >>>> Things going through loopback go through a NETISR and may >>>> end up queued to avoid LOR situations. In addition per-cpu >>>> queues with hash-distribution for affinity may cause your >>>> packet to be processed by a different core. Hence the additional >>>> delay. >>> >>> so you suggest that the (de)scheduling is costing several microseconds ? >> >> Not directly. I'm just trying to explain what's going on to >> get a better idea where it may go wrong. >> >> There may be a poor ISR/scheduler interaction that prevents that >> causes the packet to be processed only on the next tick or something >> like that. I don't have a better explanation for this. It's certainly abysmally slow. Just the extra context switching made in FreeBSD-5 made the RTT for pinging localhost 3-4 times slower than in FreeBSD-3 in old tests (I compared with FreeBSD-3 instead of FreeBSD-4 since general bloat had already made FreeBSD-4 significantly slower, although not 3-4 times). Direct dispatch of netisrs never did anything good in old tests, and the situation doesn't seem to have improved -- you now need an i7 2600 (SMP?) to get the same speed as my Athlon64 2000 (UP) in the best cases for both (2-3 usec RTT). SMP and multiple cores give more chances for scheduler pessimizations. > ok, some final remarks just for archival purposes > (still related to the loopback ping) > > ping takes a timestamp in userspace before trying to transmit > the packet, and then the timestamp for the received packet > is recorded in the kernel (in the interrupt or netisr thread > i believe -- anyways, not in userspace). No, all timestamps recorded used by ping are recorded in userland. IIRC, there is no kernel timestamping at all for ping packets, unless ping is invoked with "-M time" to make it use ICMP_TSTAMP, and ICMP_TSTAMP gives at best milliseconds resolution so it is useless for measuring RTTs in the 2-999 usec range. (ICMP_TSTAMP uses iptime(), and the protocol only supports milliseconds resolution, which was good enough for 1 Mbps ethernet. iptime() is more broken than that (except in my version), since it uses getmicrotime() instead of microtime(). getmicrotime() gives at best 1/HZ resolution, so it is not even good enough for 1 Mbps ethernet when HZ is small, and now it may give extra inaccuracies from stopping the 1/HZ clock while in sleep states.) This reminds me that slow timecounters make measuring small differences in times difficult. It can take longer to read the timecounter than the entire RTT. I tested this by pessimizing kern.timecounter.hardware from TSC to i8254. On my test system, clock_gettime() with CLOCK_MONOTONIC takes an average of 273 nsec with the TSC timecounter and 4695 nsec with the i8254 timecounter. ping uses gettimeofday() which is slightly slower and more broken (since it uses CLOCK_REALTIME). My normal ping -fq localhost RTT is 2-3 usec (closer to 3; another bug in this area is that the timestamps only have microseconds resolution so you can't see if 3 is actually more like 2.5. I was thinking of changing the resolution to nanoseconds 8-10 years ago, before the FreeBSD-5 pessimizations and CPU speeds hitting a wall made this not really necessary), but the kernel I'm testing with uses ipfw which bloats the RTT to 8-9 usec. Then kern.timecounter.hardware=i8254 bloates the RTT to 24-25! That's 16 usec extra, enough for the extra overhead of 4 gettimeofday() calls. Timecounter statistics confirm that there are many more than 2 timecounter calls per packet: - 7 binuptime calls per packet. That's the hardware part that is very slow with an i8254 timecounter. It apparently takes more like 3000 nsec than 4695 nsec (to fit 7 in 24-25 usec). - 3 bintime calls per packet. bintime calls binuptime, so this accounts for 3 of the above 7. The other 4 are apparently for context switching. There are 2 context switches per packet :-(. I can't explain why there are apparently 2 timestamps per context switch. (Note that -current uses the inferior cputicker mechanism instead of timecounters for timestamping context switches. It does this because some timecounters are very slow. But when the timecounter is the TSC, it binuptime() only takes a few cycles more than cpu_ticks(). (The above time of 273 nsec for reading the TSC timecounter is from userland. The kernel part takes only about 30 nsec, while cpu_ticks() might take 15 nsec.) So -current wouldn't be pessimized for this part by changing the timecounter to i8254, but without the pessimization it would be only a few nsec faster than old kernels provided the timecounter for old kernels is the TSC.) - 2 microtime calls per packet. These are for the 2 gettimeofday()s. microtime calls bintime, so this accounts for 2 of the above 3. The other 1 is unaccounted for. - 1 getmicrouptime call per packet. This doesn't involve the hardware. This is for the select() timeout. If we fixed the resolution bugs in select(), then we would have to be careful not to pessimize it too much by replacing this by microuptime. - 0 calls to iptime() per packet. Benchmarks like this show what a bad idea the get*time() timecounter APIs are. When you actually need to read the time a lot, you need high resolution and thus the full functions, and optimizing for when only low resolution is needed only optimizes the part that doesn't take very long anyway (except for broken programs). > A thing that seems to make > the difference is the choice of idle states: > machdep.idle = {acpi, hlt, mwait, spin} > hw.acpi.cpu.cx_lowest={C1, C2, C3} > results in 10-15% of samples with significantly different values > (see table below, all values in microseconds) > > config normal top 10-15% > ----------------------------------------- > acpi+c3 13 85 > acpi+c2 10 40 > acpi+c1 7 10 > > spin 7 8 > hlt,mwait 6..10 (no clear distinction) > ping -f 5-6 > > This suggests that sometimes the responding thread > may be started on a sleeping core and so takes > longer to wake up. > > The high values are not periodic but the ratio is close to > the 128/1000 which relates profhz and tick, so that might be > something to look at. I guess this is to be expected on a multi-core system. The scheduler should probably try to use a different core for netisrs, but this would cause both cores to block waiting for each other, so both might sleep. Any batching of netisrs would give bursts with longer sleeps in between. I didn't see any large differences between ping -f and ping -i 0.01 on a UP system. My HZ is 100, so -i 0.01 is not really possible and gives more like 0.015 (average) or perhaps 0.02 or even 0.03 after synchronization, as we discussed previously. -i 0.1 is closer to possible and gave a small increase in RTT (-i 1.0 gives the default and a much larger increase). The increases are absolute and not so large relatively starting with the slowness given by ipfw. Note that ping -f doesn't actually flood, since it waits for 1/100 seconds for something to come back, and 1/100 was only a short time with 1 Mbps ethernet. This can be worked around to some extent using ping -i to give a wait time of less than 1/100 seconds. It is still impossible to flood, since wait times of less than 1/HZ are impossible, and the resolution of ping -i is gratuitously limited to 1 msec. You could try setting the -i arg to 1/HZ to see the effects of this, or hack the hard-coded -f timeout to 1 usec. With hardware, there are many more ways for the RTT to vary. One way is that although ping -f doesn't flood, injecting an extra packet on some 1/100 second boundaries can cause queue lengths to build up. This might be only with my version of ping (it keeps track of the boundaries an might send an extra packet when it shouldn't). When the queue lengths increase for any reason (due to injecting packets or when there is a delay for other system activity), packets are sent and received in bursts; throughput goes up almost proporitionally to the queue lengths, and RTT goes up. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120412112401.Q981>