From owner-svn-src-all@FreeBSD.ORG Tue Mar 15 19:26:31 2011 Return-Path: Delivered-To: svn-src-all@FreeBSD.org Received: from [127.0.0.1] (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by hub.freebsd.org (Postfix) with ESMTP id D3C951065679; Tue, 15 Mar 2011 19:26:28 +0000 (UTC) (envelope-from jkim@FreeBSD.org) From: Jung-uk Kim To: Bruce Evans Date: Tue, 15 Mar 2011 15:26:11 -0400 User-Agent: KMail/1.6.2 References: <201103142205.p2EM5x6E012664@svn.freebsd.org> <201103151232.35904.jkim@FreeBSD.org> <20110316035308.A2598@besplex.bde.org> In-Reply-To: <20110316035308.A2598@besplex.bde.org> MIME-Version: 1.0 Content-Disposition: inline Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201103151526.14264.jkim@FreeBSD.org> Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org Subject: Re: svn commit: r219646 - head/sys/x86/isa X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Mar 2011 19:26:31 -0000 On Tuesday 15 March 2011 02:13 pm, Bruce Evans wrote: > On Tue, 15 Mar 2011, Jung-uk Kim wrote: > > On Monday 14 March 2011 10:31 pm, Bruce Evans wrote: > >> On Mon, 14 Mar 2011, Jung-uk Kim wrote: > >>> Log: > >>> When TSC is unavailable, broken or disabled and the current > >>> timecounter has better quality than i8254 timer, use it for > >>> DELAY(9). > >> > >> You cannot use a random timecounter for DELAY(). DELAY() is > >> carefully written to not use any locks, so that it doesn't > >> deadlock when called in the "any" context from a bad console > >> driver like syscons when syscons is called from ddb. (Bad > >> console drivers may have their own locks which deadlock in ddb, > >> but that is another problem.) Even the i8254 timer, which is > >> the only timer that should be used by DELAY(), normally use > >> locks that used to cause deadlock, but DELAY() was unbroken to > >> bypass the locks when it is called from ddb. > > > > I disagree. I think we should keep away from i8254 as much as > > possible. > > It is adequate for DELAY(), and is the only timer that is available > on all x86. You need large complications in DELAY() to make it > work as well as the old code using the i8254, for no gain until > there is an x86 without an i8254. Intel started killing off i8254 (and other "legacy" ISA devices), starting from "Mobile Internet Device" platforms. Even for other so-called "legacy-free" platforms, it isn't adequate any more because of unpredictable SMI# interference. > > Actually, i8254 is the only timer that requires locking in > > the first place because of its braindead design. All other x86 > > timers do not require any locking at all. We even sacrificed > > upper 32 bits of HPET in favor of i386. :-( > > Did we? Timecounters use only 32 bits as an optimization. Scaling > of 64-bit values is slow even on 64-bit machines since it needs > 128-bit intermediate values for the multiplication/shift method to > work. No, I am talking about reading raw value of HPET. For amd64, it can be used as a good alternative to TSC if we detect it earlier for legacy-free platforms. Unfortunately, it is only useful for timecounter ATM. > >> Cpufreq and other calibration code should use a normal > >> timecounter via nanouptime() in all cases and never go near low > >> level timecounters or DELAY(). There are (hopefully minor) > >> problems getting the nanotime() initialized before it used. The > >> dummy timecounter is just what you don't want to use. > > > > That's exactly the problem. > > Your fix won't make much difference then. DELAY() will start with > a dummy timecounter and probably also a zero tsc_freq, so it will > use the i8254 for various things, probably including initial > calibration of the TSC/cpufreq. Later it may see a better > timecounter and use that to get a better calibration of the > TSC/cpufreq. But its caller can just as easily check for a better > timecounter and not use DELAY() if it can use a timecounter, > without messing with DELAY()'s internals and having to guard > against reentrancy from ddb. delay_timecounter() already has all > the code for this, except the check for a better timecounter is > external. Yes, it doesn't make much difference for that ATM. However, it DOES make differences when DELAY() is used for device drivers. What I really want is DELAY() does the right thing, e.g., holding no spin lock, not pinned on a CPU, no tsc_freq hackery, and no side-effect of any kind. > >> Even in this patch, it isn't clear that the low level > >> timecounters are initialized before they are used. > > > > The timecounter is always initialized first, then set. > > It is still unclear whether they are initialized enough before they > are used since their setting is neither locked nor atomic. > Consider the TSC initialization. This is init_TSC() from > startrtclock() and init_TSC_tc() from cpu_initclocks(). The first > should make the low- level timecounter useable, but this is > unclear. The second attaches the low-level timecounter to the > high-level timecounter data structures. There is no locking for > this except possibly Giant. tsc_freq is initialized early, so the > tests of it will succeed long before the TSC can become the > high-level timecounter. It is probably usable at a low level, but > this is unclear. tsc_freq is 64 bits, and the accesses to it are > non-atomic, so especially on 32-bit machines there are races > accessing it. In fact, I can now see a reproducible bug in > DELAY(): - find an x86 that can run at 4GHz > - run it in 386 mode > - start it so that tsc_freq is 0x10000000ULL > - use the machdep.tsc_freq sysctl to tune tsc_freq to 0xFFFFFFFF machdep.tsc_freq should never have modified tsc_freq in the first place. I am going to remove that soon. > - trace through this sysctl using gdb, and using a low quality > console driver like syscons which calls DELAY(). > - the 2 words in tsc_freq will normally be written from the low > word up. It will be seen to have value 0x1FFFFFFFFULL after only > the first word is written. Not too bad -- just of by a factor of > 2. - I can't quite see how to make it have a really broken value. > At first I tried and example with it starting with a value if > 0xFFFFFFFF and changing it to 0x100000000ULL. Then after changing > only its low word, it is 0 so DELAY() will not use the TSC. > Related non-reproducible bugs are also easy to see: > - on one CPU, spin calling the sysctl to toggly tsc_freq between > 4G-1 and 4G > - on another CPU, spin doing something that calls DELAY(). Since > there is no locking or atomic accesses for the writes to tsc_freq > in the sysctl or for the reads of it in DELAY(), not to mention > elsewhwere, tsc_freq can read as zero when it is being changed > between 2 nonzero values. Not too bad if it reads as zero -- then > it won't be used. But uses of it following it being read as > nonzero may find it with changed values that are very bad, for > example zero. > These bugs are in the old code in DELAY() that uses the TSC, and > perhaps in all code of the form "if (tsc_freq != 0) use()". > Fixes for them should probably use a generation count. > > init_TSC_tc() finishes with tc_init(&tsc_timecounter) which does > most of the work for attaching to the higher-level timecounter > structures. tc_init() has no locking or atomic operations, which I > think gives it races in some contexts, but not in ones involving > single stepping it using ddb, since then there are no other CPUs to > race with. The races are the potentially-unordered writes of all > the variables set by tc_init(). Now don't you think we should really kill delay by TSC? ;-) > >>> Modified: head/sys/x86/isa/clock.c > >>> =============================================================== > >>>== ============= --- head/sys/x86/isa/clock.c Mon Mar 14 > >>> 19:31:43 2011 (r219645) +++ head/sys/x86/isa/clock.c Mon Mar 14 > >>> 22:05:59 2011 (r219646) @@ -245,6 +245,42 @@ getit(void) > >>> return ((high << 8) | low); > >>> } > >>> > >>> +static __inline void > >>> +delay_tsc(int n) > >>> +{ > >>> + uint64_t start, end, now; > >>> + > >>> + sched_pin(); > >>> + start = rdtsc(); > >>> + end = start + (tsc_freq * n) / 1000000; > >>> + do { > >>> + cpu_spinwait(); > >>> + now = rdtsc(); > >>> + } while (now < end || (now > start && end < start)); > >>> + sched_unpin(); > >>> +} > >> > >> You cannot call random scheduling code from DELAY(), since the > >> scheduling code is not designed to be called in the "any" > >> context. As it happens, sched_pin() and sched_unpin() are safe, > >> and were already used in the TSC case. > > > > Actually, I really like to get rid of this code. It isn't safe > > for many reasons, e.g., its frequency may change while in the > > loop and end result is unpredictable if TSC is not invariant. We > > may limit its use for invariant case, though. > > Me too. It can also be used of the TSC is the timecounter, since > it it is good enough for a timecounter than it is good enough for > this. But it is a lot of work for no benefit to make all cases > work. > > >>> +static __inline void > >>> +delay_timecounter(struct timecounter *tc, int n) > >>> +{ > >>> + uint64_t end, now; > >>> + u_int last, mask, u; > >>> + > >>> + mask = tc->tc_counter_mask; > >>> + last = tc->tc_get_timecount(tc) & mask; > >>> + end = tc->tc_frequency * n / 1000000; > >> > >> This depends on the delicate timecounter locking to be safe. I > >> think it it is safe except possibly in tc_get_timecount(), for > >> the same reasons that nanouptime() can run safely with no > >> visible locking. tc must be the current timecounter/timehands, > >> which is guaranteed to not be in an in-between state or become > >> so while we are using it, despite there being no visible > >> locking. Actually there are some minor races: > >> - in nanouptime(), the timecounter generation is checked to > >> avoid using a new generation if necessary, but updates of the > >> generation count and the fileds that it protects are not > >> properly atomic. - here, the check of the generation is missing. > >> So this works without races when called from ddb (since ddb > >> stops other CPUs), but has minor races in general. > > > > Window of the races is very narrow here and it only happens on > > i386 where reading uint64_t is not atomic if my understanding is > > correct. > > I can easily make race windows very wide by single stepping them, > and usually do for testing code like this :-). > > > It cannot be any worse than tsc_freq anyway. ;-) > > Yes, but it extends old mistakes. > > >> Optimizing DELAY() was bogus. Now the loop for this is > >> duplicatied. > > > > "Bogus" optimization is not exactly duplicated here if you meant > > "(now < end || (now > start && end < start))" is bogus. Its > > bogusness comes from the fact that TSC is 64-bit. However, > > timecounter is effectively 32-bit because of its mask. Now > > overflow is carefully accounted for and it is (almost) correct. > > I meant not using the i8254. Even DELAY()'s API limits it to 1 uS > resolution. Who cares if the hardware read extends this to 5 uS? Ah, I see. > Duplication can be avoided by replacing the rdstc() by conditional > code like (using_tsc ? rdtsc() : (tc->tc_get_timecount(tc) & mask) > in the loop. We have a 1 uS resolution and a pause in the loop, > so we don't care that the loop takes a little longer due to the > conditional in it. Agreed. > My userland code for this does a full sysctl to read an emulated > timecounter for this. Its loop is: > > % binuptime(&start_bt); > % start_tsc = rdtsc(); > % do { > % binuptime(&sample_bt); > % tsc = rdtsc(); > % binuptime(&bt); > % bintime_sub(&bt, &sample_bt); > % if (bintimecmp(&bt, &best_bt, <)) { > % best_bt = bt; > % best_sample_bt = sample_bt; > % best_tsc = tsc; > % } > % bt = sample_bt; > % bintime_sub(&bt, &start_bt); > % } while (++nsamples < min_nsamples || > % bintimecmp(&best_bt, min_error_btp, >)); > > binuptime() is emulated so that it has the same API as the kernel. > It is actually clock_gettime(CLOCK_MONOTONIC, &ts) followed by > error handling and conversion to bintime. The loop is missing the > equivalent of sched_pin(). I don't know exactly how to do that in > userland. Is there a single syscall that does it? I used > cpuset(1). Not that I know of. Jung-uk Kim