From owner-freebsd-arch@FreeBSD.ORG Wed Aug 20 20:23:06 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id DE8A5475; Wed, 20 Aug 2014 20:23:05 +0000 (UTC) Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au [211.29.132.53]) by mx1.freebsd.org (Postfix) with ESMTP id 893D432A9; Wed, 20 Aug 2014 20:23:05 +0000 (UTC) Received: from c122-106-147-133.carlnfd1.nsw.optusnet.com.au (c122-106-147-133.carlnfd1.nsw.optusnet.com.au [122.106.147.133]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 7BD19D448FA; Thu, 21 Aug 2014 06:22:56 +1000 (EST) Date: Thu, 21 Aug 2014 06:22:55 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin Subject: Re: [PATCH 0/2] plug capability races In-Reply-To: <201408201111.47601.jhb@freebsd.org> Message-ID: <20140821044234.H11472@besplex.bde.org> References: <1408064112-573-1-git-send-email-mjguzik@gmail.com> <201408151031.45967.jhb@freebsd.org> <20140816102840.V1007@besplex.bde.org> <201408201111.47601.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=BdjhjNd2 c=1 sm=1 tr=0 a=7NqvjVvQucbO2RlWB8PEog==:117 a=PO7r1zJSAAAA:8 a=tTSYktBZc9AA:10 a=KN91Z2BipYgA:10 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8 a=YwIKbEHEEUB-9GaOcawA:9 a=kzYn1Pzwvs4spdd-:21 a=Ip1ZeEM7m2elqLRx:21 a=CjuIK1q_8ugA:10 Cc: Mateusz Guzik , Robert Watson , Johan Schuijt , freebsd-arch@freebsd.org, Konstantin Belousov X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Aug 2014 20:23:06 -0000 On Wed, 20 Aug 2014, John Baldwin wrote: > On Friday, August 15, 2014 9:34:59 pm Bruce Evans wrote: >> On Fri, 15 Aug 2014, John Baldwin wrote: >> >>> One thing I would like to see is for the timecounter code to be adapted to use >>> the seq API instead of doing it by hand (the timecounter code is also missing >>> barriers due to doing it by hand). >> >> Locking in the timecounter code is poor (1), but I fear a general mechanism >> would be slower. Also, the timecounter code now extends into userland, >> so purely kernel locking cannot work for it. The userland part is >> more careful about locking than the kernel. It has memory barriers and >> other pessimizations which were intentionally left out of the kernel >> locking for timecounters. If these barriers are actually necessary, then >> they give the silly situation that there are less races for userland >> timecounting than kernel timecounting provided userland mostly does >> direct accesses instead of syscalls and kernel uses of timecounters are >> are infrequent enough to not race often with the userland accesses. > > Yes, the userland code is more correct here. The barriers are indeed missing in > the kernel part, and adding them should give something equivalant to a correctly > working seq API as it is doing the same thing. Userland is technically correct, but this defeats the point of the intended algorithm. I now remember a bit more about the algorithm. There are several generations of timehands. Each generation remains stable for several clock ticks. That should be several clock ticks at 100 Hz. Normally there is no problem with just using the old pointer read from timehands (except there is no serialization for updating timehands itself (*)). However, the thread might be preempted for several clock ticks. This is enough time for the old generation to change. The generation count is used to detect such changes. Again it doesn't matter if the generation count is out of date, unless it is out of date by a few generations. So the algorithm works unless the CPU de-serializes things by more than a few clock ticks. I think no real CPUs do that. Virtual CPUs can do that, but I think they aren't a problem in practice. Single stepping in ddb gives a sort of virtual CPU and breaks the algorthm since time runs much faster outside of the stepped process and may do several generations per step. The generation count protects against using a changed timehands but may cause binuptime() to never terminate instead. It takes much weirder virtualization than that to break the generation count itself. Any normal preemption or abnormal stopping of CPUs uses locks galore which synchronize everything on at least x86. Variable-tick kernels give another problem. They sometimes issue virtual clock interrupts to catch up. I think they take some care with tc_windup() but perhaps not enough. tc_windup() calls must be separated so that the timehands don't cycle too fast or too slow in either real time or time related to other system operation (there are hard real time requirements mainly for reading real hardware timecounters before they overflow). (*): % binuptime(struct bintime *bt) % { % struct timehands *th; % u_int gen; % % do { % th = timehands; Since tc_windup() also doesn't dream of memory ordering, timehands here may be in the future of what it points to. That is much worse than it being in the past. Barriers would be cheap in tc_windup() but useless if they require barriers in binuptime() to work. tc_windup() is normally called from the clock interrupt handler. There are several mutexes (or at least atomic ops that give synchronization on at least x86 SMP) before and after it. These gives serialization very soon after the changes. The fix (without adding any barrier instructions) is easy. Simply run the timehands update 1 or 2 generations behind the update of what it points to. This gives even more than time-domain locking, since the accidental synchronization from the interrupt handler gives ordering between the update of the pointed-to data and the timehands pointer. % gen = th->th_generation; It doesn't matter if the generation count is in the future, but it needs to be the same as what was written in the past or future. % *bt = th->th_offset; % bintime_addx(bt, th->th_scale * tc_delta(th)); % } while (gen == 0 || gen != th->th_generation); % } Now the timehands update code: % /* % * Now that the struct timehands is again consistent, set the new % * generation number, making sure to not make it zero. % */ It is only sure to be consistent on in-order CPUs. % if (++ogen == 0) % ogen = 1; % th->th_generation = ogen; % % /* Go live with the new struct timehands. */ % #ifdef FFCLOCK % switch (sysclock_active) { % case SYSCLOCK_FBCK: % #endif I don't like the FFCLOCK complications. They interact with the locking bugs a little here. % time_second = th->th_microtime.tv_sec; % time_uptime = th->th_offset.sec; Old versions had only these 2 statements before setting timehands and returning. These are racy enough. Using these variables is racier. They have type time_t, so they might be 64 bits on 32-bit arches so reading them might be non-atomic. In practice, very strong time-domain locking applies -- the races won't occur until the top bits start being actually used a mere 24 years from now. Then there will be a race window of a few microseconds. The generation count should be used to make accesses to these variables techically correct and slow. % #ifdef FFCLOCK % break; % case SYSCLOCK_FFWD: % time_second = fftimehands->tick_time_lerp.sec; % time_uptime = fftimehands->tick_time_lerp.sec - ffclock_boottime.sec; % break; Perhaps more races from more complicated expressions. Also a style bug (long line). % } % #endif % % timehands = th; % timekeep_push_vdso(); % } timekeep_push_vdso() has a couple of atomic stores in it. Perhaps these give perfect serialization for the user variables. On some arches, they accidentally sync the kernel variables a little earlier than the accidental sync from the interrupt handler. Still out of order with the kernel variable updates. Again, this shouldn't be needed -- use a delayed pointer update for the user variables too. Bruce