Date: Tue, 18 Jul 2017 18:08:14 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Ian Lepore <ian@freebsd.org> Cc: Bruce Evans <brde@optusnet.com.au>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r320901 - in head/sys: amd64/amd64 isa x86/isa Message-ID: <20170718153823.K1105@besplex.bde.org> In-Reply-To: <1500308973.22314.89.camel@freebsd.org> References: <201707120242.v6C2gvDB026199@repo.freebsd.org> <20170712224803.G1991@besplex.bde.org> <1500308973.22314.89.camel@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 17 Jul 2017, Ian Lepore wrote: > On Thu, 2017-07-13 at 01:42 +1000, Bruce Evans wrote: >> On Wed, 12 Jul 2017, Ian Lepore wrote: >> >>> Log: >>> Protect access to the AT realtime clock with its own mutex. >>> >>> The mutex protecting access to the registered realtime clock should not be >>> overloaded to protect access to the atrtc hardware, which might not even be >>> the registered rtc. More importantly, the resettodr mutex needs to be >>> eliminated to remove locking/sleeping restrictions on clock drivers, and >>> that can't happen if MD code for amd64 depends on it. This change moves the >>> protection into what's really being protected: access to the atrtc date and >>> time registers. >> >> The spin mutex provided some protection against timing bugs caused by >> preemption, but it is impossible to hold the mutex around the user code >> that is setting the time, so both the kernel and user code should check >> if the operation took too long and fail/retry if it did. >> I could barely read the reply, due to corruption of whitespace to spaces and \xa0. Formatting fixed, except all tabs are corrupted to spaces, and there are extra blank lines after quotes. > > I can't figure out what you're talking about here. There is no > userland code involved in any of this stuff. I seemed to have edited out examples with pseudo-code. Times are normally set by userland, and there is a larger race for the userland part than the kernel part: decide_time_to_set(); --> Preempt to run another thread. // Back here much later. Could be several quanta of 10 msec each // later. The time to set is now wrong by a large amount. clock_settime(CLOCK_REALTIME, ...); // further races here // Should check here to see if the above took too long, and retry // from the top if so. But on slow and/or loaded systems, this // might give endless retries. We should hold a mutex or something // around the whole operation to try to limit preemption. ntpd can be configured to run at high priority (-N flag, which gives realtime priority 0 on FreeBSD), but I haven't noticed any systems where this is used (except mine of course). ntpd mostly uses delta- micor-adjustments which are not affected much by the time taken to make the adjustment. At least old versions of ntpd don't do anything like the above when they set or step the time using clock_settime(). They just do the syscall. If this gives an error of only several milliseconds, then ntpd will correct the error eventually using delta-micro-adjustments. The kernel could also increase the thread's priority, but never did. Mutexes would help a bit. Parts of the operation are now in a critical section, which works similarly to increasing the priority to realtime 0. But nothing in the kernel can protect agains the race in the userland part. Kernel code using mutexes (of any sort) should look like the above retry loop if the mutex has to be dropped: // Have time to set. again: mtx_lock_spin(&time_mtx); ... r = set_time(); mtx_unlock_spin(&time_mtx); if (r != 0 || operation_took_too_long()) goto again; If set_time() has to drop the mutex, then it might abort and return nonzero to reach the retry at the top level, or retry or continue itself and depend on the top level retrying if necessary. >> With correct code like that, spin mutexes are still good for limiting >> retries. I think it is best to try to hold one around the entire kernel >> part of the operation, and release it when sleeping at lower levels. >> > > I'm not sure what this is about either. The current code for getting > the atrtc has no retries. It is too incorrect to retry. >>> Modified: head/sys/x86/isa/atrtc.c >>> ============================================================================== >>> --- head/sys/x86/isa/atrtc.c Tue Jul 11 21:55:20 2017 (r320900) >>> +++ head/sys/x86/isa/atrtc.c Wed Jul 12 02:42:57 2017 (r320901) >>> @@ -53,9 +53,17 @@ __FBSDID("$FreeBSD$"); >>> #include >>> #include "clock_if.h" >>> >>> +/* >>> + * clock_lock protects low-level access to individual hardware registers. >>> + * atrtc_time_lock protects the entire sequence of accessing multiple registers >>> + * to read or write the date and time. >>> + */ >>> #define RTC_LOCK do { if (!kdb_active) mtx_lock_spin(&clock_lock); } while (0) >>> #define RTC_UNLOCK do { if (!kdb_active) mtx_unlock_spin(&clock_lock); } while (0) >> >> This has a lot of old design and implementation bugs: >> - it abuses the i8254 mutex for the rtc >> - it obfuscates this using macros >> - locking individual register accesses is buggy. Register accesses need to >> be locked in groups like your new code does >> - the kdb_active hack is buggy (it can't use mutexes and is unnecessarily >> non-reentrant) >> The locks should be separate and statically allocated. >> > > I've always assumed the locking of individual register access was > because rtcin() and writertc() are global functions used by other parts > of the kernel to access individual bytes of cmos nvram, and that > locking the whole device for the duration of the larger operations > involved might cause problems. > > For example, the nvram driver does byte at a time accesses in a loop > that includes uimove() calls, so holding a lock on all atrtc hardware > access for the duration of that isn't a good option. > > It does seem like the atrtc code should have its own register-access > mutex rather than using clock_lock. As you mentioned, the fact that > it's borrowing a mutex is pretty well obscured; I hadn't noticed it. Forcing all accesses to go through rtcin() and writertc() was originally the only method of access control. writertc() was actually open-coded using outb(). This worked on non-preemptive kernels except for problems with interrupts. In 386BSD and and FreeBSD-1, the rtc didn't have an interrupt handler so there were no actual problems with interrupts. This depended on interrupt handlers not calling rtcin() or resettodr(). rtcin() was only called by inittodr() and the non-interrupt parts of the fd driver. This was broken when the rtc interrupt handler was added. The handler calls rtcin(), so everything that calls rtcin() must lock out the rtc interrupt handler or rtcin() itself must do this. This was broken as late as FreeBSD-3. The handler runs at splclock(), but rtcin() and its callers don't call splclock(). This didn't cause many problems since there aren't many callers. IIRC, the fd driver only calls rtcin() for initialization, and there was no cmos driver. The main race was for settimeofday() calling resettodr(). This is a very rare operation (it should be once per boot, with only micro-adjustments by ntpd after that). FreeBSD-4 has the necessary splclock() locking. SMPng broke this again. The splclock() locking became null. The main race is still with resettodr(). This became Giant-locked. But rtcintr() became a (broken) fast interrupt handler, so it is fundamentally Giant-unsafe and the Giant locking of resettodr() doesn't help. Most other subsystems avoided similar problems by using Giant locking for everything including interrupt handlers. This was eventually fixed by abusing the i8254 mutex for the rtc. >>> >>> +struct mtx atrtc_time_lock; >>> +MTX_SYSINIT(atrtc_lock_init, &atrtc_time_lock, "atrtc", MTX_DEF); >>> + >> >> I think the old mutex should be used here, after fixing it. Locking >> individual register access is a special case of locking groups of >> register acceses. It must be a spin mutex for low-level use, and that >> is good for high-level use too, to complete the high-level operations >> as soon as possible. >> > > I'm not sure what you're referring to when you say "the old mutex". clock_lock, after separating it from the i8254. >> You used this mutex for efirt.c, but I think that is another error like >> sharing the i8254 mutex (or the resettodr mutex). Different clock >> drivers should just use different mutexes. Perhaps some drivers actually >> need sleep mutexes. >> > > I assumed the reason the efirt code was using the same mutex that > protected access to the at rtc was because under the hood, the efi rt > clock is really the same hardware too, accessed via bios code. See kib's reply. I didn't know about any accesses in bios code. If that is written in asm (perhaps for resume code), then it wouldn't automatically get the common locking provided by rtcin() and writertc(). If the accesses are in the bios itself (vm86 or emulated or SMM), then who knows what happens. > >>> ... >>> @@ -352,6 +364,7 @@ atrtc_gettime(device_t dev, struct timespec *ts) >>> * to make sure that no more than 240us pass after we start reading, >>> * and try again if so. >>> */ >>> + mtx_lock(&atrtc_time_lock); >>> while (rtcin(RTC_STATUSA) & RTCSA_TUP) >>> continue; >>> critical_enter(); >> >> Note the comment about the 240us window. On i386, this was broken by >> SMPng, and almost fixed by starting the locking outside of the loop >> (a spin mutex would fix it). >> >> On i386, before SMPng, there was an splhigh() starting before the loop. >> rtcin() also used splhigh(). This was only shortly before SMPng -- >> FreeBSD-4 had both splhigh()'s, but FreeBSD-3 had neither. >> >> With no locking, an interrupt after the read of the status register can >> occur, and interrupt handling can easily take longer than the 240us window >> even if it doesn't cause preemption. >> >> critical_enter() in the above gives poor man's locking which blocks >> preemption but not-so-fast interrupt handlers. >> >> The race is not very important since this code only runs at boot time. >> > > And at resume from sleep/lowpower modes, I think. This is not > something I can test easily, but I see calls to inittodr() from apm and > pmtimer code. Even so, I think it qualifies as called very > infrequently. Right. I finally got jkim to update the corresponding acpi code just last year (I don't understand acpi and try not to touch it). The acpi timer code is better obfuscated than the pmtimer code and still attempts to do less, but the ambitious parts of the pmtimer code don't work anyway and are mostly turned off. The corresponding acpi code was turned off for i386. > That's why I think the right way to fix the coherent reading problem is > not trying harder to stay within a 244us window, but rather to stop > even trying to. Instead, look at that flag only to prevent starting > the operation while an update is in progress, and use the more usual > "loop until you read the same values twice" logic to get a coherent > read. That's especially true considering that taking longer than 244us > to do the reads is only a problem once per second; during the other > 999756us you'll still get coherent values. Good idea. My inittodr() waits for the seconds register to change. This increases boot time by an average of 0.5 seconds, but is needed to get a result that is not out of date by an average of 0.5 seconds. After the seconds register changes, the window for reading the other registers is almost 1 second, but the reads must complete in a few microseconds anyway else the result is still too out of date. Debugging printf()s show that the reads usually complete in 10-50 usec with low latency (faster on old systems with faster ISA buses). Accuracy for reading is not very important because accuracy for writing is very low (1-2 seconds). My resettodr() only writes if the difference is more than 2 seconds. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170718153823.K1105>