Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 24 Jun 2013 23:16:33 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Gleb Smirnoff <glebius@FreeBSD.org>
Cc:        Konstantin Belousov <kostikbel@gmail.com>, svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Bruce Evans <brde@optusnet.com.au>
Subject:   Re: svn commit: r252032 - head/sys/amd64/include
Message-ID:  <20130624211428.O2235@besplex.bde.org>
In-Reply-To: <20130624081056.GD1214@FreeBSD.org>
References:  <20130621065839.J916@besplex.bde.org> <20130621081116.E1151@besplex.bde.org> <20130621090207.F1318@besplex.bde.org> <20130621064901.GS1214@FreeBSD.org> <20130621184140.G848@besplex.bde.org> <20130621135427.GA1214@FreeBSD.org> <20130622110352.J2033@besplex.bde.org> <20130622124832.S2347@besplex.bde.org> <20130622174921.I3112@besplex.bde.org> <20130623073343.GY91021@kib.kiev.ua> <20130624081056.GD1214@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 24 Jun 2013, Gleb Smirnoff wrote:

> On Sun, Jun 23, 2013 at 10:33:43AM +0300, Konstantin Belousov wrote:
> K> On Sat, Jun 22, 2013 at 06:58:15PM +1000, Bruce Evans wrote:
> K> > So the i386 version be simply "addl; adcl" to memory.  Each store in
> K> > this is atomic at the per-CPU level.  If there is no carry, then the
> K> > separate stores are equivalent to adding separate nonnegative values and
> K> > the counter value is valid at all times.  If there is carry, then the
> K> > separate stores are equivalent to adding a negative value followed by
> K> > a larger positive value.  The counter transiently goes backwards, but
> K> > no information is lost and the counter is eventually set to the correct
> K> > forward-going value.
> K>
> K> This is quite interesting idea, but I still did not decided if it
> K> acceptable.  The issue is that we could add the carry to the other
> K> processor counter, if the preemption kicks in at right time between
> K> two instructions.  I did not found any argument why would it be
> K> wrong, the races for fetch operation seems to be the same with either
> K> local update method.
>
> This would be wrong since update isn't locked. Thus, if we are put on
> other CPU between two instructions, and in second instruction updating
> another CPU counter simultaneously with the original CPU we were on,
> then we are losing an update.

Hmm, this is subtle.  The update is never lost, but is applied to
different counter, non-atomically at a later time.  Non-atomicity
only matters when there is a carry since the counter only goes
transiently backards in this case.  For example: initial state:

     CPU1 counter:  00000000 ffffffff
     CPU2 counter:  00000000 fffffffe

Start adding 1 to the first counter, doing it non-atomically by
incrementing the low word first.

     CPU1 counter:  00000000 00000000  (carry in CPU1 eflags)
     CPU2 counter:  00000000 fffffffe

Get preempted at this point:

     CPU1 counter:  00000000 00000000
     CPU2 counter:  00000000 fffffffe  (carry in CPU2 eflags)

The fetching code running at this point will see a wrong value, but
has more serious bugs.  Once it is fixed, it can try using a heuristic
to detect this case, or the fix might involve a heuristic that detects
this case too.  I was thinking of keeping a global count of the previous
value for each counter, and adding 0x100000000 when the count goes backwards.
However, in the worst each component of a counter may be making the
transition from 0xffffffff to 0x100000000, with the carry bit in eflags
for all CPUs.  Detecting this requires watching the low word of all
components of all counters.  Increments must be limited to considerably
less than 32 bits, and negative increments should be disallowed, so
that the low word doesn't wap faster than it can be watched.

CPU2 runs and completes the increment by adding the carry to the high
word:

     CPU1 counter:  00000000 00000000
     CPU2 counter:  00000001 fffffffe

The total count becomes correct once the second word is written.
Statistics for each CPU should be correct, but losing this race is
so rare that maybe we especially don't care about this subcase.  There
is currently no API for exporting the counter statistics for each CPU.

Similarly if the preemption is to another thread on the same CPU, or
to another thread that doesn't touch this counter and then back to the
original thread (much later) on the same CPU.  There may be any number
of updates to the low word before the corresponding updates to the high
word, mixed in any order except for the low word being updated before
the high word for each increment.  State for the half-done updates is
saved in carry flags and registers in any number of threads.  If large
increments are allowed, the value may appear to have had many negative
increments.

> Yes, the probability of such event is extremely low, but still possible.

> The idea on counter(9) is that although fetching might be not precise,
> the internal value of a counter is consistent and doesn't lose even a
> single update. This would allow later to use counter(9) as a cheap
> reference counter.

The fetching code has essentially the same race (in the 32-bit case) that
is laboriously worked around in the update code, irrespective of how
atomic the update code is, even with 1 CPU:

     CPU1 counter:  00000000 ffffffff

Start reading it on CPU1, using 32-bit load of the low word (the compiler
could make this worse or just different by loading the high word first):

                             ffffffff  (read this value)

Get preempted; counter update occurs and adds 1 perfectly atomically:

     CPU1 counter:  00000001 00000000

Back to fetching code on CPU1; read high word:

                    00000001 ffffffff  (read this value; it is garbage)

If the compiler loads the high word first, it will see the value off by
~2**32 in the opposite direction (00000000 00000000).

Like I said before, this can probably be handled by rereading each
pcpu counter until it doesn't change.  Or if large increments are not
allowed, just read the low word, the high word, and the low word again
until the low word doesn't change.  This works on the CPU owning pcpu
counter provided the counter update is atomic.  On other CPUs, I think
it depends on MD memory ordering being fairly strict.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130624211428.O2235>