Date: Sun, 13 Jan 2013 21:36:11 +0200 From: Alexander Motin <mav@FreeBSD.org> To: Marius Strobl <marius@alchemy.franken.de> Cc: Davide Italiano <davide@FreeBSD.org>, FreeBSD Current <freebsd-current@FreeBSD.org>, freebsd-arch@FreeBSD.org Subject: Re: [RFC/RFT] calloutng Message-ID: <50F30CAB.3000001@FreeBSD.org> In-Reply-To: <20130113180940.GM26039@alchemy.franken.de> References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> <50D03173.9080904@FreeBSD.org> <20121225232126.GA47692@alchemy.franken.de> <50DB4EFE.2020600@FreeBSD.org> <20130106152313.GD26039@alchemy.franken.de> <50EBF921.2000304@FreeBSD.org> <20130113180940.GM26039@alchemy.franken.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On 13.01.2013 20:09, Marius Strobl wrote: > On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote: >> On 06.01.2013 17:23, Marius Strobl wrote: >>> I'm not really sure what to do about that. Earlier you already said >>> that sched_bind(9) also isn't an option in case if td_critnest > 1. >>> To be honest, I don't really unerstand why using a spin lock in the >>> timecounter path makes sparc64 the only problematic architecture >>> for your changes. The x86 i8254_get_timecount() also uses a spin lock >>> so it should be in the same boat. >> >> The problem is not in using spinlock, but in waiting for other CPU while >> spinlock is held. Other CPU may also hold spinlock and wait for >> something, causing deadlock. i8254 code uses spinlock just to atomically >> access hardware registers, so it causes no problems. > > Okay, but wouldn't that be a general problem then? Pretty much > anything triggering an IPI holds smp_ipi_mtx while doing so and > the lower level IPI stuff waits for other CPU(s), including on > x86. The problem is general. But now it works because single smp_ipi_mtx is used in all cases where IPI result is waited. As soon as spinning happens with interrupts still enabled, there is no deadlocks. But problem reappears if any different lock is used, or locks are nested. In existing code in HEAD and 9 timecounters are never called with spin mutex held. I intentionally tried to avoid that in existing eventtimers code. Callout code same time can be called in any environment with any locks held. And new callout code may need to know precise current time in any of those conditions. Attempt to use an IPI and wait there can be fatal. >>> The affected machines are equipped with a x86-style south bridge >>> which exposes a powermanagment unit (intended to be used as a SMBus >>> bridge only in these machines) on the PCI bus. Actually, this device >>> also includes an ACPI power management timer. However, I've just >>> spent a day trying to get that one working without success - it >>> just doesn't increment. Probably its clock input isn't connected as >>> it's not intended to be used in these machines. >>> That south bridge also includes 8254 compatible timers on the ISA/ >>> LPC side, but are hidden from the OFW device tree. I can hack these >>> devices into existence and give it a try, but even if that works this >>> likely would use the same code as the x86 i8254_get_timecount() so I >>> don't see what would be gained with that. >>> >>> The last thing in order to avoid using the tick counter as timecounter >>> in the MP case I can think of is that the Broadcom MACs in the affected >>> machines also provide a counter driven by a 1 MHz clock. If that's good >>> enough for a timecounter I can hook these up (in case these work ...) >>> and hack bge(4) to not detach in that case (given that we can't detach >>> timecounters ...). >> >> i8254 on x86 is also just a bit above 1MHz. >> >>>> Here is small tool we are using for test correctness and performance of >>>> different user-level APIs: http://people.freebsd.org/~mav/testsleep.c >>>> >>> >>> I've run Ian's set of tests on a v215 with and without your >>> calloutng_12_26.patch and on a v210 (these uses the IPI approach) >>> with the latter also applied. >>> I'm not really sure what to make out of the numbers. >>> >>> v215 w/o v215 w/ v210 w/ >>> ---------- ---------------- ---------------- ---------------- >>> select 1 1999.61 1 23.87 1 29.97 >>> poll 1 1999.70 1 1069.61 1 1075.24 >>> usleep 1 1999.86 1 23.43 1 28.99 >>> nanosleep 1 999.92 1 23.28 1 28.66 >>> kqueue 1 1000.12 1 1071.13 1 1076.35 >>> kqueueto 1 999.56 1 26.33 1 31.34 >>> syscall 1 1.89 1 1.92 1 2.88 > > FYI, these are the results of the v215 (btw., these (ab)use a bus > cycle counter of the host-PCI-bridge as timecounter) with your > calloutng_12_17.patch and kern.timecounter.alloweddeviation=0: > select 1 23.82 > poll 1 1008.23 > usleep 1 23.31 > nanosleep 1 23.17 > kqueue 1 1010.35 > kqueueto 1 26.26 > syscall 1 1.91 > select 300 307.72 > poll 300 1008.23 > usleep 300 307.64 > nanosleep 300 23.21 > kqueue 300 1010.49 > kqueueto 300 26.27 > syscall 300 1.92 > select 3000 3009.95 > poll 3000 3013.33 > usleep 3000 3013.56 > nanosleep 3000 23.17 > kqueue 3000 3011.09 > kqueueto 3000 26.24 > syscall 3000 1.91 > select 30000 30013.51 > poll 30000 30010.63 > usleep 30000 30010.64 > nanosleep 30000 36.91 > kqueue 30000 30012.38 > kqueueto 30000 39.90 > syscall 30000 1.90 > select 300000 300017.52 > poll 300000 300013.00 > usleep 300000 300012.64 > nanosleep 300000 307.59 > kqueue 300000 300017.07 > kqueueto 300000 310.24 > syscall 300000 1.93 It seems that extra delay is only about 10-17us. >> Numbers are not bad, respecting the fact that to protect from lost >> interrupts eventtimer code on sparc64 now sets minimal programming >> interval to 15us. It was made to reduce race window between the timer >> read-modify-write and some long NMIs. > > Uhm, there are no NMIs on sparc64. Does it make sense to bypass this > adjustment on sparc64? If it is not possible or not good to to stop timer during programming, there will always be some race window between code execution and timer ticking. So some minimal safety range should be reserved. Though it probably can be significantly reduced. In case of x86/HPET there is additional factor of NMI, that extends race to unpredictable level and so makes additional post-read almost mandatory. >> May be with rereading counter >> after programming comparator (same as done for HPET, reading which is >> probably much more expensive) this value could be reduced. > > I see. There are some bigger fish to fry at the moment though :) :) -- Alexander Motin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50F30CAB.3000001>