Date: Wed, 7 May 2014 13:47:31 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Alan Somers <asomers@freebsd.org> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r265472 - head/bin/dd Message-ID: <20140507113345.B923@besplex.bde.org> In-Reply-To: <201405062206.s46M6dxW060155@svn.freebsd.org> References: <201405062206.s46M6dxW060155@svn.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 6 May 2014, Alan Somers wrote: > Log: > dd(1) uses gettimeofday(2) to compute the throughput statistics. However, > gettimeofday returns the system clock, which may jump forward or back, > especially if NTP is in use. If the time jumps backwards, then dd will see > negative elapsed time, round it up to 1usec, and print an absurdly fast > transfer rate. > > The solution is to use clock_gettime(2) with CLOCK_MONOTONIC_PRECISE as the > clock_id. That clock advances steadily, regardless of changes to the system > clock. > > Reviewed by: delphij > MFC after: 3 days > Sponsored by: Spectra Logic > ... > Modified: head/bin/dd/dd.c > ============================================================================== > --- head/bin/dd/dd.c Tue May 6 22:04:50 2014 (r265471) > +++ head/bin/dd/dd.c Tue May 6 22:06:39 2014 (r265472) > @@ -50,7 +50,6 @@ __FBSDID("$FreeBSD$"); > #include <sys/conf.h> > #include <sys/disklabel.h> > #include <sys/filio.h> > -#include <sys/time.h> > > #include <ctype.h> > #include <err.h> > @@ -61,6 +60,8 @@ __FBSDID("$FreeBSD$"); > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > +#include <sysexits.h> Use of <sysexits.h> is a style bug. It is not used in BSD or KNF code like dd used to be. > +#include <time.h> > #include <unistd.h> > > #include "dd.h" > @@ -123,7 +124,7 @@ static void > setup(void) > { > u_int cnt; > - struct timeval tv; > + struct timespec tv; > > if (in.name == NULL) { > in.name = "stdin"; > @@ -240,8 +241,9 @@ setup(void) > ctab = casetab; > } > > - (void)gettimeofday(&tv, NULL); Not checking for errors, and especially voiding the result to say that errors need not be checked for, was sloppy. > - st.start = tv.tv_sec + tv.tv_usec * 1e-6; > + if (clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv)) > + err(EX_OSERR, "clock_gettime"); The existence of CLOCK_MONOTONIC_PRECISE is a design error. It is the same as the standard CLOCK_MONOTONIC. Use of the former is just gratuitously unportable. It ensures a compile-time failure on OSes (including old versions of FreeBSD) that don't have the CLOCK_MONOTONIC_PRECISE mistake. It gives a runtime error on old versions of FreeBSD that have clock_gettime() and CLOCK_MONOTONIC but not CLOCK_MONOTONIC_PRECISE. Apart from this error, clock_gettime() is as likely to fail as gettimeofday(), or exit(). The existence of CLOCK_MONOTONIC_FAST is a larger design error. It is not fast, but just sloppy, broken and not very fast (sloppiness is the intentional part of its design). Of course there is no need to use it to speed up the whole 2 clock_gettime() calls in dd. It is almost useless for other applications too. In non-broken implementations, the speed of CLOCK_* tends to be dominated by the syscall time even when the timer hardware is slow. On freefall now, the implementation is about half as good as possible, and takes between 32 and 35 nanoseconds for all clock ids, with the least accurate clock id TIME_SECOND tieing with CLOCK_REALTIME for slowest (only about 1 nanosecond slower) since it has the worst implementation. (The implementation on freefall avoids syscalls provided the timer hardware is the TSC. The hardware timer and binuptime() are used for all syscalls including TIME_SECOND. TIME_SECOND is slowest because it does extra work to zero tv_sec after calculating it with maximal precision. Doing it this way gives the correct time in seconds, but is incompatible with the kernel (syscall and kernel seconds variable) since the kernel calculations are sloppy. The kernel time for seconds and for the sloppy "FAST" clock ids and for the corresponding kernel timestamp functions lags the precise time by up to 1/HZ seconds. This difference can easily be seen from userland. The hardware part of the 32-35 nanoseconds is 2-4 nanoseconds. This is in sloppy tests. Reading the TSC is unserialized, and loops reading it may be pipelined too well to time it accurately. Also, the times read from it are not completely accurate. But clock_gettime() doesn't serialize it either. Serializing it would make it much slower (maybe 50 nanoseconds). 2-4 nanoseconds is almost as fast as on AthlonXP 10 years ago! rdtsc tends to be slower than that on Intel CPUs and on modern CPUs. Synchronization in hardware makes it slower on modern CPUs. ISTR it took about 65 cycles on a previous generation of Intel CPUs. 32-35 nanonseconds for the libc implementation is actually less than half as good as it possible if the hardware part really is only 2-4 nanoseconds. There is a lot of overhead from layering and from working around hardware bugs that usually don't exist. ) Bad implementations can be quite bad. With an i8254 timecounter, it takes about 5000 nanoseconds to read the hardware. On a 486, it took an additional 10000-20000 nanoseconds for overheads including a syscall. On a Pentium1, it only took 1000-2000 nanoseconds for overheads in old versions of FreeBSD (much more now). With an ACPI-"fast" timecounter, it takes 1000-2000 nanoseconds to read the hardware. HPET is only a few times faster. This does dominate syscall times on modern CPUs. syscall times are more like 100-200 nanoseconds on modern CPUs. The libc implementation reduces this to 30-60 nanoseconds. > + st.start = tv.tv_sec + tv.tv_nsec * 1.0e-9; > } > > static void > > Modified: head/bin/dd/misc.c > ============================================================================== > --- head/bin/dd/misc.c Tue May 6 22:04:50 2014 (r265471) > +++ head/bin/dd/misc.c Tue May 6 22:06:39 2014 (r265472) > @@ -40,14 +40,16 @@ static char sccsid[] = "@(#)misc.c 8.3 ( > __FBSDID("$FreeBSD$"); > > #include <sys/types.h> > -#include <sys/time.h> > > +#include <err.h> > #include <errno.h> > #include <inttypes.h> > #include <signal.h> > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > +#include <sysexits.h> > +#include <time.h> > #include <unistd.h> > > #include "dd.h" > @@ -56,16 +58,20 @@ __FBSDID("$FreeBSD$"); > void > summary(void) > { > - struct timeval tv; > - double secs; > + struct timespec tv, tv_res; > + double secs, res; > > if (ddflags & C_NOINFO) > return; > > - (void)gettimeofday(&tv, NULL); > - secs = tv.tv_sec + tv.tv_usec * 1e-6 - st.start; > - if (secs < 1e-6) > - secs = 1e-6; Bogus fixup. secs had microseconds granularity except possibly for for tiny roundoff errors. I think the roundoff errors cannot give a result near but not identical to 1e-6 or 0 unless tv_sec has a garbage value. The difference can be 0 in the unlikely event that the run takes less than 1 microsecond, or negative when the time goes backwards. Fixing up 0 to 1e-6 is reasonable. Fixing up negative differences to 1e-6 avoids division by 0 but gives garbage results. I think this was only intended to fix up 0, and fixing up negative differences is accidental. > + if (clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv)) > + err(EX_OSERR, "clock_gettime"); > + if (clock_getres(CLOCK_MONOTONIC_PRECISE, &tv_res)) > + err(EX_OSERR, "clock_getres"); clock_getres() is almost useless, and is useless here. It is broken as designed, since the precision may be less than 1 nanosecond but 1 nanosecond is the smallest positive representable value, but that is not a problem here since clock_gettime() also can't distinguish differences smaller than 1 nanosecond. > + secs = tv.tv_sec + tv.tv_nsec * 1.0e-9 - st.start; > + res = tv_res.tv_sec + tv_res.tv_nsec * 1.0e-9; IIRC, POSIX is confused about resolution vs precision and clock_getres() actually returns the resolution, so this variable has the correct name. I quote "precision" in the following to indicate this confusion. > + if (secs < res) > + secs = res; The fixup is now only reachable in 3 cases that can't happen: - when the monotonic time goes backwards due to a kernel bug - when the monotonic time doesn't increase, so that the difference is 0. Oops, this can happen for timecounters with very low "precision". You don't need to know the "precision" to check for this. - when the monotonic time does increase, but by an amount smaller than the "precision". This indicates that the "precision" is wrong. In the 3rd case, the actual difference may as well be used. It is probably slightly wrong at worst (say 999 nanoseconds when the "precision" is 1000 nanoseconds). The timing is going to be very inaccurate for short runs, so another 10% inaccuracy doesn't matter. At worst, the difference might be 1 nanosecond when it should be hundreds of nanoseconds. (I don't see how this can happen without a kernel bug. Just reading the time can take thousands of nanoseconds.) The result will be garbage, but it won't overflow. In the 1st case, we don't know the correct fixup, and a tiny value from clock_getprec() is not as good as anything, except to make the garbage in the result more obvious. In the second case, fixing up to the "precision" may give a large estimate. The fixup might as well be to a nominal value like 1 nanosecond or 1 second. CLOCK_MONOTONIC can't have a very low precision, and the timing for runs that don't take as long as a large multiple of the precision is inaccurate. We could also report the result as <indeterminate> in this case. > (void)fprintf(stderr, > "%ju+%ju records in\n%ju+%ju records out\n", > st.in_full, st.in_part, st.out_full, st.out_part); > @@ -77,7 +83,7 @@ summary(void) > st.trunc, (st.trunc == 1) ? "block" : "blocks"); > if (!(ddflags & C_NOXFER)) { > (void)fprintf(stderr, > - "%ju bytes transferred in %.6f secs (%.0f bytes/sec)\n", > + "%ju bytes transferred in %.9f secs (%.0f bytes/sec)\n", nanoseconds resolution is excessive here, and changes the output format. The only use of it is to debug cases where the output is garbage due to the interval being about 1 nanosecond. Printing nanoseconds resolution is also inconsistent with the fussy "precision" adjustment above. > st.bytes, secs, st.bytes / secs); > } > need_summary = 0; In practice, we can depend on at least microseconds "precision" since even gettimeofday() gave that as a precision and gettimeofday() was fixed on at least x86 20-25 years ago so that it had almost the same "precision" (resolution) as its granularity. In ping(8), we still just use microseconds resolution and even non-monotonic gettimeofday(). It only had milliseconds resolution in BSD 20-25 years ago and in Linux much more recently. dg fixed it to use the microseconds resolution that became available when gettimeofday() was fixed. clock_gettime() didn't exist then. I tried converting it to use nanoseconds resolution and wasn't happy with the results. Formats like %.9f are needed to see the full resolution, but networking latency is still too large so reasolution of more tha 1 microsecond is rarely useful, and the extra digits from %.9f format just make the value harder to read. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140507113345.B923>