Date: Thu, 8 May 2014 17:17:42 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Alan Somers <asomers@freebsd.org> Cc: "svn-src-head@freebsd.org" <svn-src-head@freebsd.org>, "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, "src-committers@freebsd.org" <src-committers@freebsd.org>, Bruce Evans <brde@optusnet.com.au> Subject: Re: svn commit: r265472 - head/bin/dd Message-ID: <20140508171730.T1548@besplex.bde.org> In-Reply-To: <CAOtMX2h_%2B1G18Nv5JvDE0H7_TZ96p81JotOwhq1Jm-dOOeahPw@mail.gmail.com> References: <201405062206.s46M6dxW060155@svn.freebsd.org> <20140507113345.B923@besplex.bde.org> <CAOtMX2h_%2B1G18Nv5JvDE0H7_TZ96p81JotOwhq1Jm-dOOeahPw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 7 May 2014, Alan Somers wrote: > On Tue, May 6, 2014 at 9:47 PM, Bruce Evans <brde@optusnet.com.au> wrote: >> On Tue, 6 May 2014, Alan Somers wrote: This is about some minor details that I didn't reply to for later followups. >>> + if (clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv)) >>> + err(EX_OSERR, "clock_gettime"); >>> + if (clock_getres(CLOCK_MONOTONIC_PRECISE, &tv_res)) >>> + err(EX_OSERR, "clock_getres"); >> >> >> clock_getres() is almost useless, and is useless here. It is broken >> as designed, since the precision may be less than 1 nanosecond but >> 1 nanosecond is the smallest positive representable value, but that >> is not a problem here since clock_gettime() also can't distinguish >> differences smaller than 1 nanosecond. > > Since it's reporting the clock resolution and not precision, and since > clock_gettime() only reports with 1ns resolution, I don't think it's a > problem for clock_getres to report with 1ns resolution too. I got most of the backwardness backwards. The syscall is clock_getres(), not clock_getprec(), and the variable name matches this. But what it returns is the precision. The resolution is just that of a timespec (1 nanosecond). No API is needed to report this. APIs are needed to report: - the precision. The API is misnamed clock_getres() - the granularity. This is the minimum time between successive measurements. It can be determined by actually doing some measurements. - the accuracy. No API is available For clocks based on timecounters, we use time timecounter clock period rounded up to nanoseconds for the precision. With a TSC, this is always 1 nanosecond above 1GHz. dd needs more like the granularity than the precision, but it doesn't really matter since the runtime must be much larger than the granularity for the statistics to be accurate, and usually is. >> The fixup is now only reachable in 3 cases that can't happen: >> - when the monotonic time goes backwards due to a kernel bug >> - when the monotonic time doesn't increase, so that the difference is 0. >> Oops, this can happen for timecounters with very low "precision". >> You don't need to know the "precision" to check for this. > > On my Xeon E5504 systems, I can see adjacent calls to clock_gettime > return equal values when using one of the _FAST clocks. It can't be > proven that this case will never happen with any other clock either, > so the program needs to handle it. Hrmph. This is either from the design error of the existence of the _FAST clocks, or from the design error of the existence of TSC-low. First, the _FAST clocks are only supposed to have a resolution of 1/hz. clock_getres() is quite broken here. It returns the timecounter precision for the _FAST clocks too. Also, if it returned 1/hz, then it would be inconsistent with the libc implementation of clock_gettime(). The latter gives gives the timecounter precision. TSC-low intentionally destroys the hardware TSC precision by right shifting, due to FUD and to handle a minor problem above 4GHz. The shift used to be excessive in most cases. On freefall it used to be about 7, so the precision of ~1/2.67 nsec was reduced to ~48 nsec. This was easy to see in tests programs for such things. Now the shift is just 1. Since 1<<1 is less than 2.67, the loss of precision from the shift is less than the loss of precision from converting from bintimes to timespecs. The shift is still a pessimization. sysctl has a read-only tunable kern.timecounter.tsc_shift. Use of this seems to be quite broken. The shift count is determined dynamically, and the tunable barely effects this. The active shift count is not written back to the tunable, so you can't see what it is easy. However, the shift count is now always 1 except in exceptional cases. The tunable defaults to 1. This is for CPU speeds between 2GHz and 4GHz to implement the support for the FUD at these speeds. Above 4GHz, the shift is increased to 2 without changing the tunable. Above 8GHz, the shift is increased to 3. That can't happen yet, but you can tune higher to get a higher shift count at lower speeds. You can also tune to 0 to avoid the shift up to 4GHz. The shift together with some fencing pessimizations that are not even done in the kernel version (only libc) is due to FUD. rdtsc is not a serializing instruction, so its direct use may give surprising results. I think it is serialized with respect to itself on the same CPU. It is obviously not serialized with respect to other instructions on the same CPU. So it doesn't work properly in code like "rdtsc; <save results>; v++; rdtsc; <compare results>" even with quite a bit more than v++ between the rdtsc's. Normally there is much more than v++ between rdtsc's so code like this works well in practice. When the rdtsc's are on separate CPUs, it is just a bug to depend on their order unless there are synchronization instructions for more than the rdtsc's. The old kernel code is sloppy about such things. It tries to do everything without atomic locking or mutexes. This mostly works, but I think it depends on the slowness of syscalls and locking in unrelated code for some corner cases. Syscalls put hundreds or thousands of instructions between successive timecounter hardware reads, so even if these reads are done on different CPUs the first one has had plenty of time to complete. Also, one CPU's TSC is acausal with respect to another's, the difference is hopefully a backwards step of at most a couple of cycles. This would be lost in the nose of the hundreds or thousands of cycles for the slow syscalls. Also, any context switch will do lots of locking operations that may synchronize the rdtscs. There is official FUD about some of these problems. An early "fix" was to shift the TSC count. I think this "works" just by breaking the precision of the counter enough for backwards steps to be invisible in most cases. A large shift count of 7 reduces the precision to 128 cycles. That should hide most problems. But I think it only works in about 127 of 128 problem cases if the problem is an acausality of 2 cycles. Suppose CPU1 reads the TSC at time 128 and sees 128, and CPU2 reads the TSC at time 129 and sees 127. CPU2 does the read later but sees an earlier time. I chose the times near a multiple of 128 so that even rounding to a multiple of 128 doesn't fix the problem. The current normal shift count of 1 hide so many problem cases. It can probably hide more than a few cycles of acausality since the shift instruction itself is so slow (several cycles). libc worries about the locking problems more than the kernel, and uses some fence instructions. Oops, these are in the kernel now. They are easier to see in the kernel too (they are spelled as *fence in asm there, but as rmb() in libc). Fence instructions don't serialize rdtsc, but may be needed for something. The rmb()'s in libc are replacements for atomic ops. Such locking operations are intentionally left out of the software parts of the kernel since the algorithm is supposed to work without them (it only clearly for UP in-order). However, the kernel now gets locking operations (mfence or lfence) in some TSC read functions, depending on the CPU (fences are mostly selected according to if the CPU supports SSE2; lfence is prefered, but mfence is used on AMD CPUs for some reason). There is lots of bloat to support this. libc only has the shifting pessimization. >> - when the monotonic time does increase, but by an amount smaller than >> the "precision". This indicates that the "precision" is wrong. We have the reverse bug that the precision is too small for the _FAST syscall case. The precision is adjusted to match the shifts. >> In the second case, fixing up to the "precision" may give a large >> estimate. The fixup might as well be to a nominal value like 1 >> nanosecond or 1 second. CLOCK_MONOTONIC can't have a very low >> precision, and the timing for runs that don't take as long as a >> large multiple of the precision is inaccurate. We could also >> report the result as <indeterminate> in this case. > > The second case is the one I'm most concerned about. Assuming that > the precision is correct, clock_getres() seems like the best value for > the fixup. Anything less than the reported precision would be > unnecessarily small and give unnecessarily inaccurate results. > Anything greater would make an implicit and unportable assumption > about the speed of the hardware. Do you really think it's a problem > to fixup to clock_getres() ? And it is the case broken for the _FAST syscall case (except this case shouldn't exist, and dd doesn't use it). Then the time only changes every 1/hz seconds and the fixup converts differences of nearly 1/hz seconds (but 0 due to the granularity) to 1 nanosecond (for x86 with TSC). With hz = 1000, the error is a factor of 1000000. I would just use an arbitrary fixup. I think I pointed out that ping(8) doesn't worry about this. It just assumes that the precision of gettimeofday() is the same as its resolution (1 usec) and that no times of interest below 1 usec occur (not quite true, since ping latency is in the microseconds range and you can do very short tests using ping -fq -c1). >>> @@ -77,7 +83,7 @@ summary(void) >>> st.trunc, (st.trunc == 1) ? "block" : "blocks"); >>> if (!(ddflags & C_NOXFER)) { >>> (void)fprintf(stderr, >>> - "%ju bytes transferred in %.6f secs (%.0f >>> bytes/sec)\n", >>> + "%ju bytes transferred in %.9f secs (%.0f >>> bytes/sec)\n", >> >> >> nanoseconds resolution is excessive here, and changes the output format. >> The only use of it is to debug cases where the output is garbage due >> to the interval being about 1 nanosecond. Printing nanoseconds resolution >> is also inconsistent with the fussy "precision" adjustment above. > > The higher resolution printf doesn't conflict with the resolution > adjustment above. Freefall actually reports 1ns resolution. But I > can buy that it's not useful to the user. Would you like me to change > it back to %.6 ? Yes, just change back. %.6f is probably excessive to. 4.4BSD uses just seconds and %u. > Even if nanosecond resolution isn't useful, monotonicity is. Nobody > should be using a nonmonotonic clock just to measure durations. I > started an audit of all of FreeBSD to look for other programs that use > gettimeofday to measure durations. I haven't finished, but I've > already found a lot, including xz, ping, hastd, fetch, systat, powerd, > and others. I don't have time to fix them, though. Would you be > interested, or do you know anyone else who would? There are indeed a lot. Too many for me to fix :-). The problem is limited, since for short runs the realtime isn't stepped, and for long runs the real time may be more appropriate. Hmm, cron uses CLOCK_REALTIME, sleep(1 or 60) and nanosleep(at most 600), while crontab uses gettimeofday() and sleep(1). It has real problems that are hopefully mostly avoided by using short sleeps and special handling for minutes rollovers). Realtime is appropriate for it. It is unclear what time even sleep() gives. It should sleep on monotonic time that is not broken by suspension, but is too old for POSIX to say anything about that. POSIX mentions the old alarm() implementation. FreeBSD now implements it using nanosleep(), but nanosleep() is specified to sleep on CLOCK_REALTIME. Oops, I found some POSIX words that may allow not-so-bizarre behaviour for nanosleep(): from an old draft: % 6688 CS If the value of the CLOCK_REALTIME clock is set via clock_settime( ), the new value of the clock % 6689 shall be used to determine the time at which the system shall awaken a thread blocked on an % 6690 absolute clock_nanosleep( ) call based upon the CLOCK_REALTIME clock. If the absolute time % 6691 requested at the invocation of such a time service is before the new value of the clock, the call % 6692 shall return immediately as if the clock had reached the requested time normally. % 6693 Setting the value of the CLOCK_REALTIME clock via clock_settime( ) shall have no effect on any % 6694 thread that is blocked on a relative clock_nanosleep( ) call. Consequently, the call shall return % 6695 when the requested relative interval elapses, independently of the new or old value of the clock. So for clock_gettime(), even when the clock id is CLOCK_REALTIME, stepping the clock doesn't affect the interval. But It is now unclear on which clock the interval is measured. And what heppens for leap seconds where the clock is stepped by a non-POSIX method? For nanosleep(): % 26874 system. But, except for the case of being interrupted by a signal, the suspension time shall not be % 26875 less than the time specified by rqtp, as measured by the system clock, CLOCK_REALTIME. % 26876 The use of the nanosleep( ) function has no effect on the action or blockage of any signal. Here there is no mention of stepping the time, and no option to measure the time by a clock other that CLOCK_REALTIME. Does CLOCK_REALTIME "measure" the time across steps? Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140508171730.T1548>