Date: Sun, 13 Jan 2013 19:09:40 +0100 From: Marius Strobl <marius@alchemy.franken.de> To: Alexander Motin <mav@FreeBSD.org> Cc: Davide Italiano <davide@FreeBSD.org>, FreeBSD Current <freebsd-current@FreeBSD.org>, freebsd-arch@FreeBSD.org Subject: Re: [RFC/RFT] calloutng Message-ID: <20130113180940.GM26039@alchemy.franken.de> In-Reply-To: <50EBF921.2000304@FreeBSD.org> References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> <50D03173.9080904@FreeBSD.org> <20121225232126.GA47692@alchemy.franken.de> <50DB4EFE.2020600@FreeBSD.org> <20130106152313.GD26039@alchemy.franken.de> <50EBF921.2000304@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote: > On 06.01.2013 17:23, Marius Strobl wrote: > > On Wed, Dec 26, 2012 at 09:24:46PM +0200, Alexander Motin wrote: > >> On 26.12.2012 01:21, Marius Strobl wrote: > >>> On Tue, Dec 18, 2012 at 11:03:47AM +0200, Alexander Motin wrote: > >>>> Experiments with dummynet shown ineffective support for very short > >>>> tick-based callouts. New version fixes that, allowing to get as many > >>>> tick-based callout events as hz value permits, while still be able to > >>>> aggregate events and generating minimum of interrupts. > >>>> > >>>> Also this version modifies system load average calculation to fix some > >>>> cases existing in HEAD and 9 branches, that could be fixed with new > >>>> direct callout functionality. > >>>> > >>>> http://people.freebsd.org/~mav/calloutng_12_17.patch > >>>> > >>>> With several important changes made last time I am going to delay commit > >>>> to HEAD for another week to do more testing. Comments and new test cases > >>>> are welcome. Thanks for staying tuned and commenting. > >>> > >>> FYI, I gave both calloutng_12_15_1.patch and calloutng_12_17.patch a > >>> try on sparc64 and it at least survives a buildworld there. However, > >>> with the patched kernels, buildworld times seem to increase slightly but > >>> reproducible by 1-2% (I only did four runs but typically buildworld > >>> times are rather stable and don't vary more than a minute for the > >>> same kernel and source here). Is this an expected trade-off (system > >>> time as such doesn't seem to increase)? > >> > >> I don't think build process uses significant number of callouts to > >> affect results directly. I think this additional time could be result of > >> the deeper next event look up, done by the new code, that is practically > >> useless for sparc64, which effectively has no cpu_idle() routine. It > >> wouldn't affect system time and wouldn't show up in any statistics > >> (except PMC or something alike) because it is executed inside timer > >> hardware interrupt handler. If my guess is right, that is a part that > >> probably still could be optimized. I'll look on it. Thanks. > >> > >>> Is there anything specific to test? > >> > >> Since the most of code is MI, for sparc64 I would mostly look on related > >> MD parts (eventtimers and timecounters) to make sure they are working > >> reliably in more stressful conditions. I still have some worries about > >> possible deadlock on hardware where IPIs are used to fetch present time > >> from other CPU. > > > > Well, I've just learnt two things the hard way: > > a) We really need the mutex in that path. > > b) Assuming that the initial synchronization of the counters is good > > enough and they won't drift considerably accross the CPUs so we can > > always use the local one makes things go south pretty soon after > > boot. At least with your calloutng_12_26.patch applied. > > Do you think it means they are not really synchronized for some reason? There's definitely no hardware in place which would synchronize them. I've no idea how to properly measure the difference between two tick counters, but I think it's rarther their drift and not the software synchronization we do when starting APs that is causing problems. Mainly, because I can't really think of a better algorithm for doing the latter when startiing the APs. The symptoms are that bout 30 to 60 seconds after that I start to see weird timeouts from device drivers. I'd need to check how long these timeouts actually are to see whether it could be a problem right from the start though. In any case, it seems foolish to think that synchronizing them once would be sufficient and they won't drift until shutdown. Linux probably also doesn't keep re-synchronize them without a reason. Just using a single timecounter source simply appears to be the better choice. > > > I'm not really sure what to do about that. Earlier you already said > > that sched_bind(9) also isn't an option in case if td_critnest > 1. > > To be honest, I don't really unerstand why using a spin lock in the > > timecounter path makes sparc64 the only problematic architecture > > for your changes. The x86 i8254_get_timecount() also uses a spin lock > > so it should be in the same boat. > > The problem is not in using spinlock, but in waiting for other CPU while > spinlock is held. Other CPU may also hold spinlock and wait for > something, causing deadlock. i8254 code uses spinlock just to atomically > access hardware registers, so it causes no problems. Okay, but wouldn't that be a general problem then? Pretty much anything triggering an IPI holds smp_ipi_mtx while doing so and the lower level IPI stuff waits for other CPU(s), including on x86. > > > The affected machines are equipped with a x86-style south bridge > > which exposes a powermanagment unit (intended to be used as a SMBus > > bridge only in these machines) on the PCI bus. Actually, this device > > also includes an ACPI power management timer. However, I've just > > spent a day trying to get that one working without success - it > > just doesn't increment. Probably its clock input isn't connected as > > it's not intended to be used in these machines. > > That south bridge also includes 8254 compatible timers on the ISA/ > > LPC side, but are hidden from the OFW device tree. I can hack these > > devices into existence and give it a try, but even if that works this > > likely would use the same code as the x86 i8254_get_timecount() so I > > don't see what would be gained with that. > > > > The last thing in order to avoid using the tick counter as timecounter > > in the MP case I can think of is that the Broadcom MACs in the affected > > machines also provide a counter driven by a 1 MHz clock. If that's good > > enough for a timecounter I can hook these up (in case these work ...) > > and hack bge(4) to not detach in that case (given that we can't detach > > timecounters ...). > > i8254 on x86 is also just a bit above 1MHz. > > >> Here is small tool we are using for test correctness and performance of > >> different user-level APIs: http://people.freebsd.org/~mav/testsleep.c > >> > > > > I've run Ian's set of tests on a v215 with and without your > > calloutng_12_26.patch and on a v210 (these uses the IPI approach) > > with the latter also applied. > > I'm not really sure what to make out of the numbers. > > > > v215 w/o v215 w/ v210 w/ > > ---------- ---------------- ---------------- ---------------- > > select 1 1999.61 1 23.87 1 29.97 > > poll 1 1999.70 1 1069.61 1 1075.24 > > usleep 1 1999.86 1 23.43 1 28.99 > > nanosleep 1 999.92 1 23.28 1 28.66 > > kqueue 1 1000.12 1 1071.13 1 1076.35 > > kqueueto 1 999.56 1 26.33 1 31.34 > > syscall 1 1.89 1 1.92 1 2.88 FYI, these are the results of the v215 (btw., these (ab)use a bus cycle counter of the host-PCI-bridge as timecounter) with your calloutng_12_17.patch and kern.timecounter.alloweddeviation=0: select 1 23.82 poll 1 1008.23 usleep 1 23.31 nanosleep 1 23.17 kqueue 1 1010.35 kqueueto 1 26.26 syscall 1 1.91 select 300 307.72 poll 300 1008.23 usleep 300 307.64 nanosleep 300 23.21 kqueue 300 1010.49 kqueueto 300 26.27 syscall 300 1.92 select 3000 3009.95 poll 3000 3013.33 usleep 3000 3013.56 nanosleep 3000 23.17 kqueue 3000 3011.09 kqueueto 3000 26.24 syscall 3000 1.91 select 30000 30013.51 poll 30000 30010.63 usleep 30000 30010.64 nanosleep 30000 36.91 kqueue 30000 30012.38 kqueueto 30000 39.90 syscall 30000 1.90 select 300000 300017.52 poll 300000 300013.00 usleep 300000 300012.64 nanosleep 300000 307.59 kqueue 300000 300017.07 kqueueto 300000 310.24 syscall 300000 1.93 > > Numbers are not bad, respecting the fact that to protect from lost > interrupts eventtimer code on sparc64 now sets minimal programming > interval to 15us. It was made to reduce race window between the timer > read-modify-write and some long NMIs. Uhm, there are no NMIs on sparc64. Does it make sense to bypass this adjustment on sparc64? > May be with rereading counter > after programming comparator (same as done for HPET, reading which is > probably much more expensive) this value could be reduced. > I see. There are some bigger fish to fry at the moment though :) Marius
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130113180940.GM26039>