From owner-svn-src-head@FreeBSD.ORG Sun Jun 23 09:58:00 2013 Return-Path: Delivered-To: svn-src-head@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id D780997D; Sun, 23 Jun 2013 09:58:00 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id 548521729; Sun, 23 Jun 2013 09:57:59 +0000 (UTC) Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id E9AE74205F7; Sun, 23 Jun 2013 19:57:57 +1000 (EST) Date: Sun, 23 Jun 2013 19:57:57 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov Subject: Re: svn commit: r252032 - head/sys/amd64/include In-Reply-To: <20130623073343.GY91021@kib.kiev.ua> Message-ID: <20130623181458.J2256@besplex.bde.org> References: <201306201430.r5KEU4G5049115@svn.freebsd.org> <20130621065839.J916@besplex.bde.org> <20130621081116.E1151@besplex.bde.org> <20130621090207.F1318@besplex.bde.org> <20130621064901.GS1214@FreeBSD.org> <20130621184140.G848@besplex.bde.org> <20130621135427.GA1214@FreeBSD.org> <20130622110352.J2033@besplex.bde.org> <20130622124832.S2347@besplex.bde.org> <20130622174921.I3112@besplex.bde.org> <20130623073343.GY91021@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=eqSHVfVX c=1 sm=1 a=0l9hOOMEwYoA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=gvJhbHXk4isA:10 a=7Rx_JezBbeIR9hiCpAQA:9 a=CjuIK1q_8ugA:10 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117 Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, Gleb Smirnoff , src-committers@freebsd.org, Bruce Evans X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Jun 2013 09:58:00 -0000 On Sun, 23 Jun 2013, Konstantin Belousov wrote: > On Sat, Jun 22, 2013 at 06:58:15PM +1000, Bruce Evans wrote: >> So the i386 version be simply "addl; adcl" to memory. Each store in >> this is atomic at the per-CPU level. If there is no carry, then the >> separate stores are equivalent to adding separate nonnegative values and >> the counter value is valid at all times. If there is carry, then the >> separate stores are equivalent to adding a negative value followed by >> a larger positive value. The counter transiently goes backwards, but >> no information is lost and the counter is eventually set to the correct >> forward-going value. > > This is quite interesting idea, but I still did not decided if it > acceptable. The issue is that we could add the carry to the other > processor counter, if the preemption kicks in at right time between > two instructions. I did not found any argument why would it be > wrong, the races for fetch operation seems to be the same with either > local update method. I thought about exploiting not-so-strong memory ordering to fix the fetches. On at least x86, stores are not reordered, so if the storer does a simple "addl, adcl", then the order is not much different from load low 32 bits store low 32 bits load high 32 bits store high 32 bits The loads can be in any order, but must be before the stores to the same location. I think the fetching code can depend on this (maybe only on x86, but the fetching code needs to be MD anyway). Thus it can detect most races by simply rereading the counters in almost any order until they don't change. Without this, the fetching code can easily read high and low bits from unrelated stores (when it is preempted, the race window is large), and the sophisticated cmpxchg8b code to make each store atomic is almost useless. The case that can't be fixed by rereading the counters is when fetching code runs in between the stores. If the stores are on a another CPU that is currently executing them, then we can keep checking that the counters don't change for the maximum time that any pair of stores take to complete. This time is quite long and difficult to determine (but it can be bounded by hard-disabling interrupts in the storer, and limited by prefetching). If the stores are on the same CPU, then we have no good way to detect that another thread is in the middle of them, or to switch back to that thread. So we currently prevent this case using slow methods. > On the other hand, for debugging it would be quite confusing state of > the counters array to observe. I thought of lots of variations, but couldn't find one that works perfectly. One idea (that goes with the sign check on the low 32 bits) is to use a misaligned add to memory to copy the 31st bit as a carry bit to the the high word. The value of the counter is supposed to be [unsigned value of low 32 bits] + [unsigned value of next 24 bits] << 31 (high 8 bits not used) at all times, with the 31st bit zero at most times so that the the carry operation is rarely executed. This format is only slightly confusing for debugging (it basically has 2 31st bits, with the one in the low 32 bits rarely used). This format can be updated using something like: addl %1,%%fs:(%0) # only small 32-bit increments supported jns 8f addl $0x80,%%fs:3(%0) # misaligned memory access 8: ; No problem if this is not preempted. The idea of the misaligned memory access is that the CPU needs to make this atomic enough even though it crosses a 32-bit boundary. BTW, is anything guaranteed if the access also crosses a cache line or page boundary? The OS would be involved for page boundaries if there is a page fault. Better not step on bugs in that. What about for misaligned 64-bit or 128-bit stores done by cmpxchg8b or SSE? Counters are 64-bit aligned, so I think that at least CPUs that are basically 64-bit must handle this like I want. The above doesn't work if it is preempted -- then it might add do the carry operation more than once. But it is easy to lock using cmpxchg or disabling interrupts. Disabling interrupts requires only small code: addl %1,%%fs:(%0) # only small 32-bit increments supported jns 8f pushfl cli testb $0x80,%%fs:3(%0) je 1f # carry already done by preemption addl $0x80,%%fs:3(%0) # misaligned memory access 1: popfl 8: ; Another idea is to use the high bits of the high word to encode state. They can be set atomically enough using addl/andl/orl/xorl since they are per-CPU. I couldn't quite get this to work. You could increment them to indicate an update in progress. This only requires 1 extra add to memory if the adcl is always done (e.g., first add 0x0100000000 to the high word; then addl to the low word; then adcl of (inc >> 32) - 0x01000000 to the high word. Increments must be limited to not change the state bits of course. The fetching code can easily examine these bits to learn that there is a problem, but it has no good way of handling the problem. The code that set the bits might have been preempted by the fetching code. It takes something like mutex priority propagation to get back to the code that clears finishes the update and clears the state. But we are intentionally not using full mutexes. Bruce