Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 14 Jan 2013 11:38:20 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Alexander Motin <mav@FreeBSD.org>
Cc:        Davide Italiano <davide@FreeBSD.org>, freebsd-arch@FreeBSD.org, FreeBSD Current <freebsd-current@FreeBSD.org>, Marius Strobl <marius@alchemy.franken.de>
Subject:   Re: [RFC/RFT] calloutng
Message-ID:  <20130114102118.V1045@besplex.bde.org>
In-Reply-To: <50F30CAB.3000001@FreeBSD.org>
References:  <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> <50D03173.9080904@FreeBSD.org> <20121225232126.GA47692@alchemy.franken.de> <50DB4EFE.2020600@FreeBSD.org> <20130106152313.GD26039@alchemy.franken.de> <50EBF921.2000304@FreeBSD.org> <20130113180940.GM26039@alchemy.franken.de> <50F30CAB.3000001@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 13 Jan 2013, Alexander Motin wrote:

> On 13.01.2013 20:09, Marius Strobl wrote:
>> On Tue, Jan 08, 2013 at 12:46:57PM +0200, Alexander Motin wrote:
>>> On 06.01.2013 17:23, Marius Strobl wrote:
>>>> I'm not really sure what to do about that. Earlier you already said
>>>> that sched_bind(9) also isn't an option in case if td_critnest > 1.
>>>> To be honest, I don't really unerstand why using a spin lock in the
>>>> timecounter path makes sparc64 the only problematic architecture
>>>> for your changes. The x86 i8254_get_timecount() also uses a spin lock
>>>> so it should be in the same boat.
>>>
>>> The problem is not in using spinlock, but in waiting for other CPU while
>>> spinlock is held. Other CPU may also hold spinlock and wait for
>>> something, causing deadlock. i8254 code uses spinlock just to atomically
>>> access hardware registers, so it causes no problems.
>>
>> Okay, but wouldn't that be a general problem then? Pretty much
>> anything triggering an IPI holds smp_ipi_mtx while doing so and
>> the lower level IPI stuff waits for other CPU(s), including on
>> x86.
>
> The problem is general. But now it works because single smp_ipi_mtx is
> used in all cases where IPI result is waited. As soon as spinning
> happens with interrupts still enabled, there is no deadlocks. But
> problem reappears if any different lock is used, or locks are nested.
>
> In existing code in HEAD and 9 timecounters are never called with spin
> mutex held.  I intentionally tried to avoid that in existing eventtimers
> code.

Er, timecounters are called with a spin mutex held in existing code:
though it is dangerous to do so, timecounters are called from fast
interrupt handlers for very timekeeping-critical purposes:
- to implement the TIOCTIMESTAMP ioctl (except this is broken in
   -current).  This was a primitive version of pps timestamping.
- for pps timestamping.  The interrupt handler (which should be a fast
   interrupt handler to minimize latency) calls pps_capture() which
   calls tc_get_timecount() and does other "lock-free" accesses to the
   timecounter state.  This still works in -current (at least there is
   still code for it).

   OTOH, all drivers that call pps_capture() from their interrupt handler
   then immediately call pps_event().  This has always been very broken,
   and became even more broken with SMPng.  pps_event() does many more
   timecounter and pps accesses whose locking is unclear at best, and
   in some configurations it calls hardpps(), which is only locked by
   Giant, despite comments in kern_ntptime.c still saying that it (and
   many other functions in kern_ntptime.c) must be called at splclock()
   or higher.  splclock() is of course now null, but the locking
   requirements in kern_ntptime.c haven't changed much.  kern_ntptime.c
   always needed to be locked by the equivalent of a spin mutex, which
   is stronger locking than was given by splclock().  pps_event() would
   have to aquire the spin mutex before calling hardpps(), although
   this is bad for fast interrupt handlers.  The correct implementation
   is probably to only do the capture part from fast interrupt handlers.

> Callout code same time can be called in any environment with any
> locks held. And new callout code may need to know precise current time
> in any of those conditions. Attempt to use an IPI and wait there can be
> fatal.

Callout code can't be called from such a general "any" environment as
timecounter code.  Not from a fast interrupt handler.  Not from an NMI
or IPI handler.  I hope.  But timecounter code has a good chance of
working even for the last 2 environments, due to its design requirement
of working in the first.

The spinlock in the i8254 timecounter certainly breaks some cases.
For example, suppose the lock is held for a timecounter read from
normal context.  It masks hardware interrupts on the current CPU (except
in my version).  It doesn't mask NMIs or other traps.  So if the NMI
or other trap handler does a timecounter hardware call, there is
deadlock in at least the !SMP case.  In my version, it blocks normal
interrupts later if they occur, but doesn't block fast interrupts, so
the pps_capture() call would deadlock if it occurs, like a timecounter
call from an NMI.  I avoid this by not using pps in any fast interrupt
handler, and by only using the i8254 timecounter for testing.  I do
use pps in a (nonstandard) x86 RTC clock interrupt handler.  My clock
interrupt handlers are all non-fast to avoid this and other locking
problems.

>> FYI, these are the results of the v215 (btw., these (ab)use a bus
>> cycle counter of the host-PCI-bridge as timecounter) with your
>> calloutng_12_17.patch and kern.timecounter.alloweddeviation=0:
>> select         1     23.82
>> poll           1   1008.23
>> usleep         1     23.31
>> nanosleep      1     23.17
>> kqueue         1   1010.35
>> kqueueto       1     26.26
>> syscall        1      1.91
>> select       300    307.72
>> poll         300   1008.23
>> usleep       300    307.64
>> nanosleep    300     23.21

Please fix the tv_nsec initialization so that we can see if nanosleep()
and kqueue() work.  (The nanosleep() timeout here is actually 300 nsec,
so 23.21 usec being shorter than 300 usec is not a kernel bug.)

>> kqueue       300   1010.49
>> kqueueto     300     26.27
>> syscall      300      1.92
>> ...
>
> It seems that extra delay is only about 10-17us.

Sometimes only 7.  But 23-26 for short timeouts.

For long timeouts, it should be closer to 0 extra, since the overhead for
scheduling and handling timer interrupts is smaller than the timeout so
it can be subtracted from the timeout to get accuracy.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130114102118.V1045>