Date: Thu, 31 Mar 2005 20:50:50 +1000 (EST) From: Bruce Evans <bde@zeta.org.au> To: Uwe Doering <gemini@geminix.org> Cc: Joshua Coombs <jcoombs@gwi.net> Subject: Re: kern/79339: [patch] Kernel time code sync with improvements from DragonFly Message-ID: <20050331183426.O20748@delplex.bde.org> In-Reply-To: <424B9C67.6090201@geminix.org> References: <200503301440.j2UEe7s9078005@freefall.freebsd.org> <424B9C67.6090201@geminix.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 31 Mar 2005, Uwe Doering wrote: > Joshua Coombs wrote: >> Testing with wakeup_latency.c on a 5.3-Rel box shows the same symptom set. >> I've not yet tested the proposed fix on 5-x. I will try dupilcating this >> issue on 6-current as well to nail down the problem scope. > > Please also look at what's actually in DragonFly's CVS repository. Your PR > is based on the original patch, while the code in DragonFly is more > sophisticated. Namely, tvtohz() was split into two functions, tvtohz_low() > and tvtohz_high(), which replace the original function depending on the > context tvtohz() appears in. > > From this I conclude that the original patch is insufficient (likely to break > parts of the kernel), and that integrating this improvement into FreeBSD > might not be as easy and straightforward as it appears to be at first glance. > On the other hand, with some effort it ought to be doable. Indeed. Here is a discussion of some of the bugs in the patch: % >Fix: % /usr/src/sys/kern/kern_clock.c % 325c325 % < / tick + 1; % --- % > / tick; % 328c328 % < + ((unsigned long)usec + (tick - 1)) / tick + 1; % --- % > + ((unsigned long)usec + (tick - 1)) / tick; This breaks all callers of tvtohz() except the one that is changed in the patch to expect this API change. The comment before tvtohz() still says that tvtohz() adds 1. % /usr/src/sys/kern/kern_time.c % 232c232 % < int error; % --- % > int error, sleepticks; % 241a242 % > sleepticks = tvtohz(&tv); % 243c244 % < tvtohz(&tv)); % --- % > (sleepticks < 1)? 1 : sleepticks); This is more or less correct. 1 should be subtracted from tvtohz() in callers that do a careful comparision of the times before and after the sleep so that they can tell if the sleep time has completely expired. The function here (nanosleep1()) is not quite such a caller. It does a sloppy comparision of times, using getnanouptime() instead of nanouptime(). getnanouptime() has a resolution of 1/ticktock_hz, where ticktock_hz is appoximately min(hz, 1000) (normally just hz), so there is a possible error of 2/ticktock_hz in the comparision. I think all the errors go the same way, so the maximum error is 1/ticktock_hz. The extra tick added by tvtohz() accidentally compensates for this error. Synchronization effects may reduce (or increase?) the error. The first getnanouptime() is unsynchronized, but ones done just after timeout returns are synced with clock interrupts, so they give a fairly accurate time every hz/ticktock_hz hardclock interrupts. Anyway, if 1 is subtracted from tztvohz(), then naouptime() should be used to avoid these errors. There are many other callers like nanosleep1(): the ones for select(2), poll(2) and setitimer(2). These all depend on tvtohz() adding 1 to ensure that they sleep for the specified interval, and they all do sloppy comparisions like nanosleep1(), so they all need similar changes if you want timeouts to be synchronized with 1/HZ second boundaries as perfectly as possible. % 252c253,254 % < *rmt = ts; % --- % > rmt->tv_sec = ts.tv_sec; % > rmt->tv_nsec = ts.tv_nsec; % 258c260,261 % < ts3 = ts; % --- % > ts3.tv_sec = ts.tv_sec; % > ts3.tv_nsec = ts.tv_nsec; These changes just introduce style bugs. % 260a264,265 % > if (tv.tv_sec == 0 && tv.tv_usec < tick) % > return (0); This can't be right. We have just not-so-carefully checked whether the time has expired, and only get here when it hasn't. (tv.tv_sec == 0 && tv.tv_usec < tick) means that we would have preferred the sleep time to be less than 1 tick. We had to request a sleep of exactly 1 tick because less than 1 is impossible (this is with 1 subtracted from tvtohz()). Sleeping for exactly 1 tick is also impossible, so we have woken up after an interval of anywhere between 0+epsilon and (1-epsilon+latency) seconds. The interval may be significantly smaller or larger than than `tv' and we must go back to sleep if it is smaller. The above change breaks this. I think the problem that this change is supposed to fix is related to the tick frequency not being an exact multiple of 1/HZ. Also, to avoid sleeping longer than necessary, we should try to wake up 1 tick early and then decide whether to sleep another tick or 2 to finish. Note that although tvtohz() always rounds up, physical sleep intervals are always shorter than the specified timeout, so waking up 1 tick early is very common for unsynchonized sleeps. Thus if we subtract 1 from tvtohz(), we often wake up 1 tick early as a side effect, which is what we want, but there is a problem: suppose that that everything is in perfect sync, but the hardclock interrupt frequency is slightly less than 1/HZ seconds. Then we may wake up 5 usec or so early and decide to go back to sleep, giving a large error. Changes later in the patch are related to this. I think we shouldn't do anything special here except possibly return early if `tv' is very small. Going around the loop in nanosleep1() an extra time is a small pessimization. Using nanouptime() to get the decision of whether to loop right is a pessimization too, but it is relatively small. % /usr/src/sys/i386/isa/clock.c % 113c113,114 % < #define TIMER_DIV(x) ((timer_freq + (x) / 2) / (x)) % --- % > #define TIMER_DIV(x) (timer_freq / (x)) % > #define FRAC_ADJUST(x) (timer_freq - ((timer freq / (x)) * (x))) Reducing TIMER_DIV() unconditionally would be harmless under FreeBSD. It's rounding to nearest dates from there was little more than hardclock ticks for timekeeping. Now HZ and the hardclock interrupt frequency are almost unrelated to timekeeping. % 141a143 % > u_int timer0_frac_freq; % 204a207,209 % > int phase; % > int delta; % > % 215a221,236 % > % > phase = 1000000 / timer0_frac_freq; % > delta = timecounter->tc_microtime.tv_usec % phase; tc_microtime.tv_usec is not quite the right thing to use here. It is updated every tick or two so it might be up to date, but it has unnecessary jitter. microtime() would give a more accurate timestamp. I think microtime() and not microuptime() is the correct function to use here, since we want to sync with the real time. OTOH, nanosleep1() and friends use the uptime, so they must be looked at some more to determine the effects of using different time scales on syncing. I think the synchronization done here is honored by nanosleep1() despite the different scales, and sync is only lost when the clock is changed using settimeofday() (then everything gets out of sync). % > #if 1 % > disable_intr(); The clock should be read inside this critical section. % > if (delta < (phase >> 1)) { % > outb(TIMER_CNTR0, timer0_max_count & 0xff); % > outb(TIMER_CNTR0, timer0_max_count >> 8); % > } else { % > outb(TIMER_CNTR0, (timer0_max_count +1) & 0xff); % > outb(TIMER_CNTR0, (timer0_max_count +1) >> 8); % > ++i8254_offset; % > } I think i8254_offset needs to be reinitialized every time the maximum count is reprogrammed. This is not done in set_timer_freq(); however, most callers of set_timer_freq() initialize or update the i8254 timecounter immediately after, and testing shows that this reduces lost ticks to an acceptable value (usually, and hopefully always < 10). Correctly reprogramming the i8254 on every interrupt is harder. Losing even 1 tick per interrupt is too much, but I think the above can sometimes lose 100 (if clkintr() is delayed for that long, which can easily happen especially in RELENG_4 since clkintr() is not a fast interrupt handler there). See nearby code that calls i8254_get_timecount() inside a critical section for a way to reduce the error to at most 5 ticks. It takes about 5 ticks just to read the counter. This is still far too large to do on every clock tick. All of this only matters if the i8254 is used for timekeeping. % > enable_intr(); % > #endif % > % 236a258 % > timer0_frac_freq = new_rate; % 247,248c269,270 % < if ((timer0_prescaler_count += timer0_max_count) % < >= hardclock_max_count) { % --- % > timer0_prescaler_count += timer0_max_count; % > if (timer0_prescaler_count >= hardclock_max_count) { This change is just to style. % 689a712 % > timer0_frac_freq = intr_freq; The changes seem to be too simple to give a PLL. I didn't check the details for this. % 1221c1244 % < count = timer0_max_count - ((high << 8) | low); % --- % > count = timer0_max_count + 1 - ((high << 8) | low); Always adding 1 here seems to be wrong. Shouldn't you only add 1 if timer0_max_count isn't actually the max count, i.e., when the max count has been programmed to be 1 more than usual? All references to timer0_max_count are potentially wrong when timer0_max_count isn't actually the max count. You add 1 to i8254_offset in the above; this seems to be to adjust for 1 of the references being wrong, but it doesn't seem to adjust for `count' being 1 too large. % A sawtooth is still present, but the accuracy is MUCH better. I suspect my hack application of the PLL function isn't correct or my P133 is slow enough that I'm observing some other latencies. I have observed occasional negative offsets, which according to the article are strictly forbidden by RFCs, so please check my work. I believe they were the result of my playing with a hz value too high for the machine to reasonably handle, and are not occuring with saner values for hz. I only agree with the non-hardware changes (don't sleep for an extra tick in nanosleep1() and friends if this is easy to avoid). All that that perfect sync of real time with hardclock() clock gives is the possibility of waking up on precisely 1/HZ boundaries relative to real time (with whole seconds being boundaries). System activity lengthens sleeps by indeterminate amounts except on unloaded systems. The average error for a random sleep on an unloaded systems would still be 0.5/HZ (or 1.5/HZ without the nanosleep1() change). Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050331183426.O20748>