Date: Mon, 14 Jan 2013 11:38:20 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Alexander Motin <mav@FreeBSD.org> Cc: Davide Italiano <davide@FreeBSD.org>, freebsd-arch@FreeBSD.org, FreeBSD Current <freebsd-current@FreeBSD.org>, Marius Strobl <marius@alchemy.franken.de> Subject: Re: [RFC/RFT] calloutng Message-ID: <20130114102118.V1045@besplex.bde.org> In-Reply-To: <50F30CAB.3000001@FreeBSD.org> References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> <50D03173.9080904@FreeBSD.org> <20121225232126.GA47692@alchemy.franken.de> <50DB4EFE.2020600@FreeBSD.org> <20130106152313.GD26039@alchemy.franken.de> <50EBF921.2000304@FreeBSD.org> <20130113180940.GM26039@alchemy.franken.de> <50F30CAB.3000001@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 13 Jan 2013, Alexander Motin wrote: > On 13.01.2013 20:09, Marius Strobl wrote: >> On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote: >>> On 06.01.2013 17:23, Marius Strobl wrote: >>>> I'm not really sure what to do about that. Earlier you already said >>>> that sched_bind(9) also isn't an option in case if td_critnest > 1. >>>> To be honest, I don't really unerstand why using a spin lock in the >>>> timecounter path makes sparc64 the only problematic architecture >>>> for your changes. The x86 i8254_get_timecount() also uses a spin lock >>>> so it should be in the same boat. >>> >>> The problem is not in using spinlock, but in waiting for other CPU while >>> spinlock is held. Other CPU may also hold spinlock and wait for >>> something, causing deadlock. i8254 code uses spinlock just to atomically >>> access hardware registers, so it causes no problems. >> >> Okay, but wouldn't that be a general problem then? Pretty much >> anything triggering an IPI holds smp_ipi_mtx while doing so and >> the lower level IPI stuff waits for other CPU(s), including on >> x86. > > The problem is general. But now it works because single smp_ipi_mtx is > used in all cases where IPI result is waited. As soon as spinning > happens with interrupts still enabled, there is no deadlocks. But > problem reappears if any different lock is used, or locks are nested. > > In existing code in HEAD and 9 timecounters are never called with spin > mutex held. I intentionally tried to avoid that in existing eventtimers > code. Er, timecounters are called with a spin mutex held in existing code: though it is dangerous to do so, timecounters are called from fast interrupt handlers for very timekeeping-critical purposes: - to implement the TIOCTIMESTAMP ioctl (except this is broken in -current). This was a primitive version of pps timestamping. - for pps timestamping. The interrupt handler (which should be a fast interrupt handler to minimize latency) calls pps_capture() which calls tc_get_timecount() and does other "lock-free" accesses to the timecounter state. This still works in -current (at least there is still code for it). OTOH, all drivers that call pps_capture() from their interrupt handler then immediately call pps_event(). This has always been very broken, and became even more broken with SMPng. pps_event() does many more timecounter and pps accesses whose locking is unclear at best, and in some configurations it calls hardpps(), which is only locked by Giant, despite comments in kern_ntptime.c still saying that it (and many other functions in kern_ntptime.c) must be called at splclock() or higher. splclock() is of course now null, but the locking requirements in kern_ntptime.c haven't changed much. kern_ntptime.c always needed to be locked by the equivalent of a spin mutex, which is stronger locking than was given by splclock(). pps_event() would have to aquire the spin mutex before calling hardpps(), although this is bad for fast interrupt handlers. The correct implementation is probably to only do the capture part from fast interrupt handlers. > Callout code same time can be called in any environment with any > locks held. And new callout code may need to know precise current time > in any of those conditions. Attempt to use an IPI and wait there can be > fatal. Callout code can't be called from such a general "any" environment as timecounter code. Not from a fast interrupt handler. Not from an NMI or IPI handler. I hope. But timecounter code has a good chance of working even for the last 2 environments, due to its design requirement of working in the first. The spinlock in the i8254 timecounter certainly breaks some cases. For example, suppose the lock is held for a timecounter read from normal context. It masks hardware interrupts on the current CPU (except in my version). It doesn't mask NMIs or other traps. So if the NMI or other trap handler does a timecounter hardware call, there is deadlock in at least the !SMP case. In my version, it blocks normal interrupts later if they occur, but doesn't block fast interrupts, so the pps_capture() call would deadlock if it occurs, like a timecounter call from an NMI. I avoid this by not using pps in any fast interrupt handler, and by only using the i8254 timecounter for testing. I do use pps in a (nonstandard) x86 RTC clock interrupt handler. My clock interrupt handlers are all non-fast to avoid this and other locking problems. >> FYI, these are the results of the v215 (btw., these (ab)use a bus >> cycle counter of the host-PCI-bridge as timecounter) with your >> calloutng_12_17.patch and kern.timecounter.alloweddeviation=0: >> select 1 23.82 >> poll 1 1008.23 >> usleep 1 23.31 >> nanosleep 1 23.17 >> kqueue 1 1010.35 >> kqueueto 1 26.26 >> syscall 1 1.91 >> select 300 307.72 >> poll 300 1008.23 >> usleep 300 307.64 >> nanosleep 300 23.21 Please fix the tv_nsec initialization so that we can see if nanosleep() and kqueue() work. (The nanosleep() timeout here is actually 300 nsec, so 23.21 usec being shorter than 300 usec is not a kernel bug.) >> kqueue 300 1010.49 >> kqueueto 300 26.27 >> syscall 300 1.92 >> ... > > It seems that extra delay is only about 10-17us. Sometimes only 7. But 23-26 for short timeouts. For long timeouts, it should be closer to 0 extra, since the overhead for scheduling and handling timer interrupts is smaller than the timeout so it can be subtracted from the timeout to get accuracy. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130114102118.V1045>