From owner-freebsd-arch@FreeBSD.ORG Fri Jun 1 21:21:55 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CB19D106564A; Fri, 1 Jun 2012 21:21:55 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au [211.29.132.187]) by mx1.freebsd.org (Postfix) with ESMTP id 2BAE98FC0C; Fri, 1 Jun 2012 21:21:55 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q51LLjQW002570 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 2 Jun 2012 07:21:46 +1000 Date: Sat, 2 Jun 2012 07:21:45 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Giovanni Trematerra In-Reply-To: Message-ID: <20120602044306.S4049@besplex.bde.org> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , alc@freebsd.org, Konstantin Belousov , Alexander Kabaev , freebsd-arch@freebsd.org Subject: Re: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Jun 2012 21:21:55 -0000 On Fri, 1 Jun 2012, Giovanni Trematerra wrote: > I'd like to discuss a way to provide a mechanism to share some read-only > data between kernel and user space programs avoiding syscall overhead, > implementing some them, such as gettimeofday(3) and time(3) as ordinary > user space routine. This is particularly unsuitable for implementing gettimeofday(), since for it to work you would need to use approximately 1 CPU spinning in the kernel to update the time every microsecond. For time(3), it only needs a relatively slow update. For clock_gettime() with nansoeconds precision, it is even more unsuitable. For clock_gettime() with precisions between 1 second and 1 microseconds, it is intermediately unsuitable. It also requires some complications for locking/atomicity and coherency (much the same as in the kernel. Not just for times. For times, the kernel handles locking/atomicity fairly well, and coherency fairly badly. > The patch at > http://www.trematerra.net/patches/ksvar_experimental.patch > > is in a very experimental stage. It's just a proof-of-concept. > Only works for an AMD64 kernel and only for 64-bit applications. > The idea is to have all the variables that we want to share between kernel > and user space into one or more consecutive pages of memory that will be > mapped read-only into every running process. At the start of the first > shared page > there'll be a table with as many entries as the number of the shared variables. > Each entry is a 32-bit value that is the offset between the start of the shared > page and the start of the variable in the page. The user space processes need > to find out the map address of shared page and use the table to access to the > shared variables. On amd64, 2 32-bit values or 64-bit values with most bits 0 or 1 can be packed/encoded into 1 64-bit value to give a certain atomicity without locking. The corresponding i386 packing into 1 32-bit value doesn't work so well. > ... > Just as proof of concept I re-implemented gettimeofday(3) in user space. > First of all I didn't remove the entry into the syscall.master, just renamed the > sys_gettimeofday. I need it for the fallback path. > In the kernel I introduced a struct wall_clock. > > +struct wall_clock > +{ > + struct timeval tv; > + struct timezone tz; > +}; This is much larger than 64 bits. struct timezone is relatively unimportant. struct timeval is bloated on amd64 (128 bits), but can be packed into 64 bits (works for a few hundred years). On i386, it could be packed into 20 bits for tv_usec and 12 bits for an offset for tv_sec. > Now we defined a shared variable at index WALL_CLOCK_INDEX of type > struct wall_clock and named wall_clock. > Inside handleevents I update the info exported by wall_clock. > > + struct timeval tv; > + > + /* update time for userspace gettimeofday */ > + microtime(&tv); This is supposed to have no races (microtime() uses a generation counter to recover from them), and gives the necessary microseconds precision, provided it is called every microsecond. This doesn't quite require a CPU spinning to update. You could also unmap the variable 1 microsecond after it is accessed and take a pagefault to update it for the next access. > + wall_clock.tv = tv; This has races (when userland accesses it in the middle of the kernel update). > + wall_clock.tz.tz_minuteswest = tz_minuteswest; > + wall_clock.tz.tz_dsttime = tz_dsttime; This has races, but rarely changes. > +int > +gettimeofday(struct timeval *tp, struct timezone *tzp) > +{ > + > + /* fallback to syscall if kernel doesn't export ksvar */ > + if (!KSVAR_IS_ACTIVE()) > + return (sys_gettimeofday(tp, tzp)); > + > + if (tp != NULL) > + *tp = KSVAR(wall_clock)->tv; Races. These can be fixed using a generation counter as in the kernel, but it is much harder since we are not in control of the update times and the updates must occur much more frequently. In the kernel, the critical updates occur only every 1-10 msec (except when timers are stopped). This gives an offset, and a difference is added to this, with atomicity for the difference given even naturally by the hardware or by locking the hardware. Then generation count changes only every 1-10 msec, and itself is updated only by implicit serialization instructions. No ordering is enforced for stores and loads, but the implicit serializations have 1-10 msec to flush the stores and loads, in any order provided that they happen during this time, for the generation count to work. > + if (tzp != NULL) > + *tzp = KSVAR(wall_clock)->tz; Minor races. > + return (0); > +} I don't see how to implement gettimeofday() in userland without duplicating most of the kernel's timecounter code. The kernel can keep an offset, updated every 1-10 msec. Userland calls the hardware timecounter and scales it delicately as in the kernel, but with even more complications to synchronize with ntpd micro-adjustments and other threads and timecounter hardware. > Now when a process will call getimeofday, will call that function actually. > If the process makes a lot of call to gettimeofday, we will see a > performance boost. Without the above problems being solved, we would see it not working precisely when it is used a lot. It would report time differences of 0 (or occasionally garbage from races) for events separated by hundreds of thousands of nanoseconds. gettimeofday()'s precision and accuracy are well established, so there must be many programs that depend on them. The newer clock_gettime() interfaces give more choice and must be used if sub-microsecond precision is wanted and sub-microsecond accuracy is hoped for (with normal X86 hardware, an accuracy of ~10 nanosecond is nearly possible for small differences in relative times but not for absolute times). But again, the low precision clock ids aren't very useful, since if you actually use them enough to notice their slowness, then they won't distinguish different events. These ids are especially bogus when clock_gettime() is implemented as a syscall, since the syscall overhead is about 90% of the total overhead provided the timecounter hardware is efficient (clock_getttime(CLOCK_MONOTONIC, ...) might take 300 nsec, while clock_gettime(CLOCK_MONOTONIC_FAST_N_BROKEN, ...) takes 330 nsec. Then you may as well use precise version). But if clock_gettime(CLOCK_MONOTONIC, ...) is a syscall as it needs to be for good precision, while clock_gettime(CLOCK_MONOTONIC_FAST_N_BROKEN, ...) is memory mapped, then the latter becomes almost useful. BTW, the FAST_N_BROKEN clock ids advertise bogus precision in clock_getres(), so you can't tell (except from their names) how much worse their precision is than that of the unbroken versions. (clock_getres() is bogusly named (POSIX standard), since it returns the precision (the resolution is 1 nsec for all timespec interfaces). POSIX clearly doesn't intend for clock_getres() to actually return the resolution, since that is known at compile time and parts of the specification of clock_getres() require precision semantics. Many parts are confused in other ways, at least in 2001 draft7, and are not implemented in FreeBSD. E.g., clock_settime() is specified to truncate the time to a multiple of the "resolution" returned by clock_getres() with the same id. This more or less assumes that the clock is in hardware ticks, with no micro-adjustments. FreeBSD does micro-adjustments and delicate scaling of hardware clocks, with a low-level resolution of 2**-64 seconds, and doesn't truncate the time in clock_settime(). File times are more interesting, since they often have to be truncated to fit on disks; POSIX only started specifying the details of this about 4 years ago. In this mail, I try to use "precision" consistently instead of "resolution", except where describing POSIX bugs.) For old clock ids, FreeBSD returns something reasonable (the hardware or virtual clock update period, rounded up), except for CLOCK_PROF and CLOCK_VIRTUAL, it returns the period of the unrelated hz clock (the actual clock is an impossible-to-express combination of the stathz clock for sampling the user+sys decomposition, the cputick clock for the total time, and these limited by the resolution of the getrusage()/calcru() timeval-based API, 1 usec for the tiemval resolution is probably best on fast machine, but I use 1/stathz seconds.) For newer clock ids, FreeBSD the precision of the precise clock for all the non-precise clocks (the latter have a virtual clock update period of tc_tick/hz seconds, so they should advertise that as their precision). CLOCK_THREAD_CPUTIME_ID uses the cputicker as its basic clock, but rounds to usec, so its precision is an integral multiple of 1000 nsec. The scaling for this is quite different and broken than that for the old clock ids. It rounds down instead of up, and then rounds up to 1000 nsec iff the previous value was 0. The first scaling step has the nonsense factor of 1000000 instead of 1000000000, and the result is accidentally correct in the usual case where the cputicker frequency is < 1000000, else garbage (not a multiple of 1000, and usually too small): % int % kern_clock_getres(struct thread *td, clockid_t clock_id, struct timespec *ts) % { % % ts->tv_sec = 0; % switch (clock_id) { % case CLOCK_REALTIME: % case CLOCK_REALTIME_FAST: % case CLOCK_REALTIME_PRECISE: % case CLOCK_MONOTONIC: % case CLOCK_MONOTONIC_FAST: % case CLOCK_MONOTONIC_PRECISE: % case CLOCK_UPTIME: % case CLOCK_UPTIME_FAST: % case CLOCK_UPTIME_PRECISE: Most of these shouldn't be here (or anywhere). % /* % * Round up the result of the division cheaply by adding 1. % * Rounding up is especially important if rounding down % * would give 0. Perfect rounding is unimportant. % */ % ts->tv_nsec = 1000000000 / tc_getfrequency() + 1; Correct value for CLOCK_REALTIME and CLOCK_MONOTONIC. Also for unportable aliases of these, but those shouldn't exist. % break; % case CLOCK_VIRTUAL: % case CLOCK_PROF: % /* Accurately round up here because we can do so cheaply. */ % ts->tv_nsec = (1000000000 + hz - 1) / hz; Should use stathz ore maybe just the resolution of a timeval (except when the precision is limited by the cputicker more than by timevals, use the former limit). % break; % case CLOCK_SECOND: % ts->tv_sec = 1; % ts->tv_nsec = 0; % break; Correct. Unlike most cases, clock_gettime() with this returns an integral multiple of the precision. For hardware clock ids with precise hardware, we don't really know the precision and don't want to round to it (it will be a small number of nsec, perhaps 0, plus a fraction of a nanosec, and we don't want to round everything based on these fractions). But for software clock ids, the resolution will be about 1-10msec and rounding to a multiple of 1 or 10 msec would be good if that is the update frequency (it will often be 1/hz with hz a nice multiple of 10, except for inaccuracies in the interrupt frequency). % case CLOCK_THREAD_CPUTIME_ID: % /* sync with cputick2usec */ % ts->tv_nsec = 1000000 / cpu_tickrate(); % if (ts->tv_nsec == 0) % ts->tv_nsec = 1000; % break; Nonsense scaling. When cpu_tickrate() <= 1000000, then result is too small by a factor of 1000 and often invalid since it is not a multiple of 1000. But on most or all supported arches, cpu_tickrate() is > 1000000 so the result of the division is 0 and this is fixed up to the correct value of 1000. The correct scaling is more like the above: - use the same scale factor as above. Here we are scaling to nsec, so it is almost irrelvant that cputick2usec() scales to usec. We should just use 1000 from cputick2usec()'s limit on the precision in the (usual) case that that limit is stricter (higher) than the one got by correct scaling here - when scaling to nsec, round up as above, not down as this does now - when scaling to nsec, round up efficiently as above, not with branching logic as this does now. But we probably need the branching logic to clamp up to 1000. Note that clock_gettime(CLOCK_THREAD_CPUTIME_ID, ...) scales to usec just to use old KPIs which only support timevals, although clock_gettime() uses timespecs. % default: % return (EINVAL); % } % return (0); % } Bruce