Date: Wed, 4 Jun 2014 23:12:59 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Bryan Drewery <bdrewery@freebsd.org> Cc: svn-src-head@freebsd.org, Alexander Motin <mav@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org Subject: Re: svn commit: r267029 - head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs Message-ID: <20140604212719.C1160@besplex.bde.org> In-Reply-To: <538EA8E4.1070300@FreeBSD.org> References: <201406032106.s53L63oR085624@svn.freebsd.org> <538EA8E4.1070300@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 4 Jun 2014, Bryan Drewery wrote: > On 6/3/2014 4:06 PM, Alexander Motin wrote: >> Author: mav >> Date: Tue Jun 3 21:06:03 2014 >> New Revision: 267029 >> URL: http://svnweb.freebsd.org/changeset/base/267029 >> >> Log: >> Replace gethrtime() with cpu_ticks(), as source of random for the taskqueue >> selection. gethrtime() in our port updated with HZ rate, so unusable for >> this specific purpose, completely draining benefit of multiple taskqueues. >> >> MFC after: 2 weeks >> >> Modified: >> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c >> >> Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c >> ============================================================================== >> --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c Tue Jun 3 21:02:19 2014 (r267028) >> +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c Tue Jun 3 21:06:03 2014 (r267029) >> @@ -953,7 +953,7 @@ spa_taskq_dispatch_ent(spa_t *spa, zio_t >> if (tqs->stqs_count == 1) { >> tq = tqs->stqs_taskq[0]; >> } else { >> - tq = tqs->stqs_taskq[gethrtime() % tqs->stqs_count]; >> + tq = tqs->stqs_taskq[cpu_ticks() % tqs->stqs_count]; >> } >> >> taskq_dispatch_ent(tq, func, arg, flags, ent); Bugs remaining in this include: - the comment before this function still says that gethrtime() is used: /* * Dispatch a task to the appropriate taskq for the ZFS I/O type and priority. * Note that a type may have multiple discrete taskqs to avoid lock contention * on the taskq itself. In that case we choose which taskq at random by using * the low bits of gethrtime(). */ - timing functions are a bogus source of random bits. The get_cyclecount() function should be used for this, though its existence is a bug and its name and implementation are worse. On some arches (actually just i386, but should be arm too), cputick() is the same as get_cyclecount(). cputick() has to be correct for timing, but get_cyclecount() doesn't, so get_cyclecount() can be more optimized, but on some arches get_cyclecount() is correct for timing too. The implementation of this varies gratuitously with the arch. Only arm is very bad: - arm: get_cyclecount() is binuptime() with bogus scaling. This shows another bug in the above change -- depending on the details of the ticker and its scaling, there may be no random low bits at all. stqs_count is dynamically initialized and it is not clear if it is a power of 2, or a prime... The bogus scaling is: binuptime(&bt); return ((uint64_t)bt.sec << 56 | bt.frac >> 8); The 8 lower bits may or may not be random (or even nonzero) but they are always discarded. The remaining lower bits may or may not be random (or even zero). 56 top bits of the seconds part are always discarded. Old versions were better. They xor'ed the seconds and fraction part. This is inconsistent with the bad function name, but better for randomness. It is still an error to use only the low bits. Using all the bits modulo a prime would work well: tq = tqs->stqs_taskq[cpu_ticks() % BIGPRIME % tqs->stqs_count]; The magic BIGPRIME only needs to be relatively prime to the infinite- precision maxof(typeof(get_cyclecount()) + 1), so it doesn't depend on the details of get_cyclecount(). For general use, BIGPRIME should be 64 bits, but 32 bits should be enough for anyone and the the 64-bit division for excessive generality would be slow. The existing 64-bit division is already excessively general and gratuitously slow. Both cpu_ticks() and gethrtime() are 64 bits. The division should have been reduced to 32 bits using (uint32_t)getmumbletime() % count. When ntpd is active, the lower bits in the fraction normally contain significant randomness from micro-adjustments, so they should not be discarded. Otherwise, there is really no randomness in lower bits, but they may change in step with high bits depending on the details of the clock frequency and scaling. - i386: get_cyclecount() is cputicks(). This is different from amd64 because rdtsc() is not always available and cputicks() handles the details of sometimes using a timecounter. This is unnecessarily slow when cputicks() doesn't use rdtsc(). cputicks() isn't inline, and it uses lots of function pointers. It even uses a function pointer for rdtsc(). cputicks() is much more important than get_cyclecount(). It should be optimized. arm should use it even if it is not optimized. get_cyclecount() is abused as a cycle count in: - netinet/sctp_os_bsd.h. microtime() with an x86 TSC timecounter is plenty fast and accurate enough for timestamping 1 GBps ethernet. (I sometimes use nanotime(), but find that actually using the nanoseconds resolution usually just makes the output too verbose with noise in the last 3 decimal digits, so the output formatting routine needs to reduce to microseconds.) - kern/ktr_time.c. The undocumented unsupported option KTR_TIME defaults to being get_cyclecount(). KTR needs to timestamp low-level functions, so the timestamp function needs to be lower-level and nanotime() might not be low enough. However, i386 sometimes uses the timecounter, and arm always does, to nanotime() must be low enough. nanotime() also gives nicely biased and scaled timestamps. KTR has null support for scaling the timestamps that it makes. In the kernel, it just prints raw values in %10.10lld format. ktrdump(1) is further from knowing how to scale them. It just prints them or differences of them in %16ju format. 16 is excessively wide, yet not wide enough to print UINT64_MAX (that takes 20 decimal digits. Only 16 hex digits). At least it doesn't use the long long abomination. - kern/subr_bus.c. get_cyclecount() is not abused, but its result is assigned to a variable named attachtime. This is fed to random_harvest() as the vatiable named *entropy, so it is clearly random and not a time. The existence of get_cyclecount() is a bug since all uses are satisifed with a timestamp and it is better to optimize the timestamp functions than have special versions of them for just randomness. It was created before cputick() existed. cputick() is a fast enough timestamp function for all random uses (it basically uses a cycle counter if available, else falls back to using the timecounter). However, bugs pointed out above show that it is confusing to use functions whose names are related to timers as a source of randomness. A name with "random" in it would be better. Also, some arches may have a random register (in the CPU so it is fast to read) but no x86-style cycle counter or fast timecounter. These arches should use the random register for the random function. The random register just doesn't need to be very wide or crypto quality. Some of the bugs are documented in get_cyclecount(9). It is documented as monotonically increasing, although the xor version on arm was far from that and the current version wraps every 256 seconds. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140604212719.C1160>